Wednesday, October 3, 2007

The Analytic Value Chain - Extract, Transform, and Load the data

Related posts
The Analytic Value Chain introduced
Defining the problem
Determine Data Requirements
Locate Relevant Data

Extract, Transform, and Load (or ETL) is database administrator talk for getting data ready to analyze. For those who want a more formal definition, check out Wikipedia. I'll talk a little bit about each, but most of the action is in Transform.

From the data analyst perspective, there is not much to say on extraction; you need to get the data out of your systems and you need people who have the requisite skills. Think SQL. If you are hiring data analysts make sure they can write SQL and can architect a simple database. It will be a big help when they need to merge data sets or give requirements to your IT staff.

As I said above, transformation is much more interesting, involving both cleaning the data and creating new variables for analysis. Lets talk about data cleansing first. By data cleansing, I mean how your team handles missing values and outliers.

Though unglamorous, data cleansing is a critical task. Without having a consistent method for data cleaning, everyone will invent their own method. Once people start sharing data sets, a lack of consistency leads to bad analysis. Further, you want to ensure that your vendors and your data warehouse play by the same rules as your analysts. You might think it is obvious that missing values should be coded as missing, but your DBAs are not analysts or statisticians and they have different priorities.

True story on handling of missing data: We had one data set that tracked values over of something over time, call it Revenue per Week. If Revenue per Week for Time2 was missing, the DBA pulled the values from Time1 and plugged it into Time2. In our case, we found data from t1 was being used in t71. As a result, the data set was unnaturally stable over a very long period of time. Because of our data quality checking efforts, we found the problem and now missing data is coded as missing.

The second part of transformation involves creating variables. As I have said before, I like a focused approach to data analysis. If you start transforming variables for analysis (taking the square and cubes, etc.) you add variables. Seems to run counter to taking a focused approach, I know. I am not suggesting creating every possible transformation, but you are going to need a few to help you capture non-linear effects or normalize your data.

In my experience, if you take take the square, the log (base 10), and the inverse of your variables you are going to get most of the value out of your transformations. You are going to need to use some common sense (what is the square of a categorical variable?) but in general, you are not going to need to go crazy creating new variables. However, your data analysts are going to want to create every possible transformation that they can think of; it is easy and they might need them later. Serious emphasis on might. In my opinion, the time would be better spent thinking about what transformations are actually useful. Also, each of those transformations creates another column of data that has to be processed, slowing down your analyses. My recommendation is to use the big three and if you can logically think through other variables that may need to be transformed, tackle those on an one-off basis.

Another type of transformation is creating interaction terms. I had written a long and involved descriptions of how interactions work and why you should care, but I deleted it; I think interactions deserve deserves its own post. Interactions capture additional impact of two variables combines that is not captured by considering each variable separately. The short version is that you can (and should) create interaction variables by multiplying variables together. The challenge: it is difficult to know which interactions to create. You could guess by now that I am not in favor of creating every possible interaction terms, you would create an enormous data set that would not be useful. I am a fan of creating interactions that, due to your understanding of your business, you think exist and making them a regular part of your transformation process.

Last piece, Load. I have nothing to say here. In most analytic platforms, once you have extracted the data, it is loaded. Really, the process for data analyst should be called ELT.

Take aways? You want to be thoughtful about your transformation process. People often create tons of variables as a proxy for thinking deeply about the problem they are trying to solve. Creating big, dumb, datasets have real costs associated with them. Not least of which is that they are hard to analyze. Also, create a standard method for data cleaning and make sure everyone knows the standard.

2 comments:

Unknown said...

One might also consider transforming nominal data into a series of dichotomous variables, depending on your hypotheses (obviously). This is one case were it really pays to do a lot of thinking up front--what is my reference category? What does that mean in the context of my business/what my research it trying to address? Common sense reference categories might be the lowest (or highest) extreme, or the modal category.

Unknown said...

Mark is right. I should have included nominal data in the discussion.