Wednesday, September 19, 2007

The Analytic Value Chain - Determine Data Requirements

Related posts
The Analytic Value Chain introduced
Defining the problem

So, once you have a good understanding, you should probably grab every piece of data you can find, transform the variables every way you can think of and start analyzing to figure out the variables that impact your dependant variable, right?

This way lies madness and spurious results. And is an approach that a naive data analyst would take (experts in data-mining might have a different take, but I was trained as a research scientist and I don't like the kitchen sink approach.) Before you start grabbing data, you should spend some time thinking about the behavior you are trying to predict and what phenomena might drive that behavior. Let me go a step further, I recommend that you generate hypotheses on what relationships you expect to find. Write 'em down. I go so far as to create something I call a conceptual map that graphically shows how I think the all of the variables, including the dependant, impact each other. Once I see the relationships, I can quickly generate hypotheses. Maybe you are thinking, what does this have to do with determining your data requirements? Hang on, I am getting to it.

Once you have the hypotheses, then you can start thinking about what data you need to test those hypotheses. In my world, it is important to have a good idea of what kinds of analyses we are going to be doing because the most time we have access to many more variables than we can possibly look at. We can't go on fishing expedition. Some other reasons to think hard about your data requirements:

  • In rapidly changing businesses, you'll spend more time finding data than analyzing the data. So, parsimony is key, you don't want to spend time getting data that is not going to be useful.
  • Most large enterprises have real restrictions on who can access specific types of data. So, even if you find the data, you'll have to figure out who owns the data.
  • Merging and analyzing larger datasets takes processor time. The smaller the datasets, the less time the analysts spend waiting for results.
  • The world is complicated. The more variables you have, the greater the chance of finding a spurious correlation. Also, if you have a good set of hypotheses and a conceptual map, you'll have a better sense if a relationship makes sense.

In sum, planning for your data needs, though it takes time up front, saves you time in the end.

No comments: