Showing posts with label analytic value chain. Show all posts
Showing posts with label analytic value chain. Show all posts

Saturday, December 1, 2007

The Analytic Value Chain - Understand the relationships in your data

One the dataset is together and the data is QA'ed, New analysts want to dive right in and start mining the data or building statistical models . More experienced hands just putter. The run descriptives, they check to see relationships that they thought they would see actually exist (age and income are positively related, for example), creating some histograms to look at the frequency distribution of the variables.

Why all the putterage? An effective analyst needs to develop an intuition about the data. Without that intuition, they won't have the common sense required to make some of the judgment calls they are going to need to make when they get down to modeling. Data analysis requires judgement (which is why you hire experts to do it) and without a good intuitive understanding of how the data behaves, the judgement calls to come are going to be on an unstable foundation. And you get to understand the business dynamics (by understanding what variables drive outcomes) in a way that the people who are running the business can not.

Another point, you are going to find relationships that were unexpected. These unexpected relationships are things you should take note of, you are going to want to follow-up on them and understand if they are real. These unexpected relationships are the beginning of finding what I call "Game Changing Analyses", the home run of data analytics. What do I mean by game changing? Analyses that lead your business partners into changing their business strategy, that indicate that activities that are fundamental to the business need to be reassessed. A good analyst should be able to have this kind of impact multiple times a year.

For those data analysts who read the blog, developing the intuition is critical for your ability to have impact and your reputation. Imagine that you are presenting some work to a senior executive and they ask a question that you can't answer based on your analysis. If you have a good understanding of the data you can say "I can't answer your question definitively, but given what I know of the data, I believe this to be the answer to your question. Of course, I will check the answer as soon as I get back to my desk." To the extent that you consistently get these kinds of questions right, you be come a trusted resource for the executive. I know more than one analyst whose job was saved because they proved over and over again that they were an expert in the dynamics of a business unit.

Upshot, spend the time getting to know your data . If you are a manager, build time to explore data into your project plans. Understanding the relationships in the data is a critical part of the data analysts and managers job.

Monday, November 12, 2007

The Analytic Value Chain - QA the data

Related posts
The Analytic Value Chain introduced
Defining the problem
Determine Data Requirements
Locate Relevant Data
Extract, Transform, and Load the data

Sorry for the long delay between posts. I have been focused on my job search and just got my head above water.

I won't do a big song and dance on this part of the value chain, I already did a post on QA'ing your data, here.

A fast story on how the prevalence of data quality problems . I was speaking with someone whom is an expert in the direct marketing industry about a brief sojourn in consulting. She went into consulting because she likes doing novel things and she thought that consulting was going to provide that variety. She was lamenting that all of the engagements she worked on were similar, with "correcting data hygiene" as how she spent most of her time. The problem is endemic.

No matter where you get your data, a third party or an internal data mart, you have to check the quality of the data. And regularly. Assume data is guilty until proven innocent.

Wednesday, October 3, 2007

The Analytic Value Chain - Extract, Transform, and Load the data

Related posts
The Analytic Value Chain introduced
Defining the problem
Determine Data Requirements
Locate Relevant Data

Extract, Transform, and Load (or ETL) is database administrator talk for getting data ready to analyze. For those who want a more formal definition, check out Wikipedia. I'll talk a little bit about each, but most of the action is in Transform.

From the data analyst perspective, there is not much to say on extraction; you need to get the data out of your systems and you need people who have the requisite skills. Think SQL. If you are hiring data analysts make sure they can write SQL and can architect a simple database. It will be a big help when they need to merge data sets or give requirements to your IT staff.

As I said above, transformation is much more interesting, involving both cleaning the data and creating new variables for analysis. Lets talk about data cleansing first. By data cleansing, I mean how your team handles missing values and outliers.

Though unglamorous, data cleansing is a critical task. Without having a consistent method for data cleaning, everyone will invent their own method. Once people start sharing data sets, a lack of consistency leads to bad analysis. Further, you want to ensure that your vendors and your data warehouse play by the same rules as your analysts. You might think it is obvious that missing values should be coded as missing, but your DBAs are not analysts or statisticians and they have different priorities.

True story on handling of missing data: We had one data set that tracked values over of something over time, call it Revenue per Week. If Revenue per Week for Time2 was missing, the DBA pulled the values from Time1 and plugged it into Time2. In our case, we found data from t1 was being used in t71. As a result, the data set was unnaturally stable over a very long period of time. Because of our data quality checking efforts, we found the problem and now missing data is coded as missing.

The second part of transformation involves creating variables. As I have said before, I like a focused approach to data analysis. If you start transforming variables for analysis (taking the square and cubes, etc.) you add variables. Seems to run counter to taking a focused approach, I know. I am not suggesting creating every possible transformation, but you are going to need a few to help you capture non-linear effects or normalize your data.

In my experience, if you take take the square, the log (base 10), and the inverse of your variables you are going to get most of the value out of your transformations. You are going to need to use some common sense (what is the square of a categorical variable?) but in general, you are not going to need to go crazy creating new variables. However, your data analysts are going to want to create every possible transformation that they can think of; it is easy and they might need them later. Serious emphasis on might. In my opinion, the time would be better spent thinking about what transformations are actually useful. Also, each of those transformations creates another column of data that has to be processed, slowing down your analyses. My recommendation is to use the big three and if you can logically think through other variables that may need to be transformed, tackle those on an one-off basis.

Another type of transformation is creating interaction terms. I had written a long and involved descriptions of how interactions work and why you should care, but I deleted it; I think interactions deserve deserves its own post. Interactions capture additional impact of two variables combines that is not captured by considering each variable separately. The short version is that you can (and should) create interaction variables by multiplying variables together. The challenge: it is difficult to know which interactions to create. You could guess by now that I am not in favor of creating every possible interaction terms, you would create an enormous data set that would not be useful. I am a fan of creating interactions that, due to your understanding of your business, you think exist and making them a regular part of your transformation process.

Last piece, Load. I have nothing to say here. In most analytic platforms, once you have extracted the data, it is loaded. Really, the process for data analyst should be called ELT.

Take aways? You want to be thoughtful about your transformation process. People often create tons of variables as a proxy for thinking deeply about the problem they are trying to solve. Creating big, dumb, datasets have real costs associated with them. Not least of which is that they are hard to analyze. Also, create a standard method for data cleaning and make sure everyone knows the standard.

Monday, September 24, 2007

The Analytic Value Chain - Locate relevant data

Related posts
The Analytic Value Chain introduced
Defining the problem
Determine Data Requirements

This post overlaps with the previous Value Chain post on determining the data requirements, but is meant to be a bit more tactical. If you have followed my advice from the previous post, you have a good sense of what data you are going to need. Now you need to find it. In most organizations I have worked with, the data to solve any given business problem exists. The challenge is that the data often exists in a place that is not accessible to most of the folks in the organization. The data may not be in a production environment (it is sitting on a server under someone's desk) and if it is in production, the data warehouse might be so large that no one really knows what is in there (this is not an uncommon problem in real warehouses. I had an expert in physical warehousing once tell me that a really good warehouse knows where a specific pallet can be found 80% of the time). I once had both problems at the same time. I found two data sets that answered a critical business problem when combined, but were running on different desktops in two different parts of my client's organization. I found the data by luck, but wound up doing a very impactful analysis. My value was in carrying the data sets, on floppy disks, back to my PC. Obviously, you can't analyze data that you can't find.

So, what to do? I don't have a ton of advice here, but I have found somethings work pretty well in identifying the data that is out there and making it accessible. First, treasure your staff that really know your data infrastructure because they have hard fought knowledge (for those in my organization, you know who you are. And you know how much I value your contributions. You also know that I am understating.) There is no replacement for just having experience in your data infrastructure. Having said that we rely on people power, metadata helps. And even the best staff are not going to be able to find useful data if you don't have data dictionaries for every dataset.

Second, create tables that aggregate your most useful data. We do this and can get our hands on useful data sets pretty quickly; in minutes. We evaluate the variables in those tables about once a year (or after any major strategy shift) to ensure that the data set maintains its usefulness. This data set has an additional value in that it can be shared with your entire organization and forms a common "fact base" for the organization.

Third, try to think ahead and ask your team to be on the lookout for certain types of data. There are business questions that I know I am going to want to take a look at and by communicating to the team my topics of interest, they can make serendipitous discoveries.

Sunday, September 9, 2007

The Analytic Value Chain - Determine data requirements

Defining the analytic value chain post

This year, I have had three conversations with vendors who are trying to sell my company some piece of third party data. We currently have well over a thousand variables available to us for modeling and insight, before transformations; our problem is not a lack of data. Our problem is that we have too much of it. On a recent data mining project, we included over 6000 variables in the dataset. So, why do I even mention determining data requirements as part of the analytic value chain? If you can include every variable under the sun, just grab 'em all? This way, you don't have to make any hard choices about what to include and what to leave out of the analysis.

I wish that things were that simple. For well understood problems,

Thursday, September 6, 2007

The Analytic Value Chain - Defining the Problem

Related posts
The Analytic Value Chain introduced

One mistake analysts make when starting analytic work is to just jump right in with the data. I have had staff waste weeks of time learning that they (and me) misunderstood the problem being asked by our business partners. I am not proud of this, but it is a lesson learned. What have we done to mitigate the problem? We try to have kick off meetings for every large scale analysis so that the person doing the analysis (and their manager) can get a good bead on the problem the business is trying to solve. In these conversations, we spend a lot of time trying to understand why our internal clients think something is a problem, try to get a sense of root causes (that drive our data needs and analytic choices), and get agreement as to the deliverable (as well as timing.) We also try to set some ground rules around changing the problem in mid-stream. Once we kick off an analysis, any change in the request resets the timeline that we provided for the analysis. Another best practice here is for the manager to check in with the analyst a week or so after the analysts starts the work.

So, the first way that an analytic staff can add value is to help their internal clients define the problem.

Wednesday, August 29, 2007

The Analytic Value Chain

I am currently hiring for a Director for my statistics team and it is a hard position to hire for (if you think you might be qualified, please send me a resume.) A perfect candidate has to have skills all across the data analysis value chain. I imagine you are thinking "consultant speak"; maybe so, but I think of analyses as a product that needs to run through an analytic "factory." You can manufacture impactful analyses, just like GM manufactures cars. And I think of the value chain as having 9 steps.

1. Define the problem
2. Determine data requirements
3. Locate relevant data
4. Extract, Transform, and Load the data
5. QA data
6. Understand relationships in data
7. Do the analysis
8. Create presentation materials
9. Share results with business

It is hard to find someone who has the business skills to understand the problem faced by my internal clients, has the technical skill to manage the analysis, has the process re-engineering skills to improve the process (it is a factory, remember?), and the communication skills to present the results back to the business.

In the next couple of weeks, I'll discuss each of the steps in the value chain. Hopefully this set of posts will help folks as they set up their own analytic shops.