Tuesday, October 23, 2007

AOL Layoffs

Posts are going to be a bit slow for a couple of weeks. AOL laid off about 80% of my business unit (the access business) and I was, as they say, impacted. When you stop doing marketing, you don't need folks in Marketing Analysis. If anyone has a need for business analysts, please contact me. A number of my staff are looking for work.

My resume is here.

Saturday, October 13, 2007

"Growing" a SAS analyst

The other day, someone asked me how to “grow” a SAS business analyst. My first thought was “Let Capital One do it for you”. I then got to thinking about what it means to “grow” someone. I think the question was really “what skills does a SAS analyst need?” I was talking to David Ye, who is a senior manager on my stats team about this problem and he noted that there is often a difference between SAS programmers and SAS business analyst. Just to be clear, I am talking about a business analyst role.

Putting aside things that make a good analyst (being a voracious and tenacious problem solver, have a good understanding of your business dynamics, good written communication skills, etc), an analyst who relies on SAS as their primary analytic tool needs to:

  • To be able to pull their own data (SQL skills and Proc SQL)
  • Know how to use SAS efficiently (can’t overstate the importance of this, think temp tables…)
  • Have a good understand the analytic procedures available (and options) they need for their job (I like Proc Means, Anova, Reg, Corr, Cluster, Factor, Chart, Plot, and Tabulate. Also, if you are doing serious experimental work, you need GLM and Mixed)
  • How to leverage the various programming options (Macro, SAS Code)

Some of these are easier than others to develop. Most of these skills are hard won, so if you are trying to train someone up on anything but the most basic Procs, my advice would be to hire an experienced person and have them train new staff. Someone who is new to SAS, in my opinion, needs someone close by to answer questions.

Wednesday, October 3, 2007

The Analytic Value Chain - Extract, Transform, and Load the data

Related posts
The Analytic Value Chain introduced
Defining the problem
Determine Data Requirements
Locate Relevant Data

Extract, Transform, and Load (or ETL) is database administrator talk for getting data ready to analyze. For those who want a more formal definition, check out Wikipedia. I'll talk a little bit about each, but most of the action is in Transform.

From the data analyst perspective, there is not much to say on extraction; you need to get the data out of your systems and you need people who have the requisite skills. Think SQL. If you are hiring data analysts make sure they can write SQL and can architect a simple database. It will be a big help when they need to merge data sets or give requirements to your IT staff.

As I said above, transformation is much more interesting, involving both cleaning the data and creating new variables for analysis. Lets talk about data cleansing first. By data cleansing, I mean how your team handles missing values and outliers.

Though unglamorous, data cleansing is a critical task. Without having a consistent method for data cleaning, everyone will invent their own method. Once people start sharing data sets, a lack of consistency leads to bad analysis. Further, you want to ensure that your vendors and your data warehouse play by the same rules as your analysts. You might think it is obvious that missing values should be coded as missing, but your DBAs are not analysts or statisticians and they have different priorities.

True story on handling of missing data: We had one data set that tracked values over of something over time, call it Revenue per Week. If Revenue per Week for Time2 was missing, the DBA pulled the values from Time1 and plugged it into Time2. In our case, we found data from t1 was being used in t71. As a result, the data set was unnaturally stable over a very long period of time. Because of our data quality checking efforts, we found the problem and now missing data is coded as missing.

The second part of transformation involves creating variables. As I have said before, I like a focused approach to data analysis. If you start transforming variables for analysis (taking the square and cubes, etc.) you add variables. Seems to run counter to taking a focused approach, I know. I am not suggesting creating every possible transformation, but you are going to need a few to help you capture non-linear effects or normalize your data.

In my experience, if you take take the square, the log (base 10), and the inverse of your variables you are going to get most of the value out of your transformations. You are going to need to use some common sense (what is the square of a categorical variable?) but in general, you are not going to need to go crazy creating new variables. However, your data analysts are going to want to create every possible transformation that they can think of; it is easy and they might need them later. Serious emphasis on might. In my opinion, the time would be better spent thinking about what transformations are actually useful. Also, each of those transformations creates another column of data that has to be processed, slowing down your analyses. My recommendation is to use the big three and if you can logically think through other variables that may need to be transformed, tackle those on an one-off basis.

Another type of transformation is creating interaction terms. I had written a long and involved descriptions of how interactions work and why you should care, but I deleted it; I think interactions deserve deserves its own post. Interactions capture additional impact of two variables combines that is not captured by considering each variable separately. The short version is that you can (and should) create interaction variables by multiplying variables together. The challenge: it is difficult to know which interactions to create. You could guess by now that I am not in favor of creating every possible interaction terms, you would create an enormous data set that would not be useful. I am a fan of creating interactions that, due to your understanding of your business, you think exist and making them a regular part of your transformation process.

Last piece, Load. I have nothing to say here. In most analytic platforms, once you have extracted the data, it is loaded. Really, the process for data analyst should be called ELT.

Take aways? You want to be thoughtful about your transformation process. People often create tons of variables as a proxy for thinking deeply about the problem they are trying to solve. Creating big, dumb, datasets have real costs associated with them. Not least of which is that they are hard to analyze. Also, create a standard method for data cleaning and make sure everyone knows the standard.

OOO

Sorry for the radio silence. I have been traveling for the last 10 days and have little access to an internet connection. Today, I am speaking at the "Optimization Summit" in San Francisco, traveling back to DC tomorrow, and back in the harness on Friday. I have a new "Value Chain" post in the hopper and I'll post it in a couple of days.