Tuesday, June 9, 2009

Brute force vs. smarts

My dissertation was, in part, about how to encourage people, when solving problems, to think about the information they already have available to them and to not just gather information for its own sake. I found that you can save a lot of money if you charge just a token amount for each new piece of information. When you charge for information, people think more deeply about the information in their possession and stop asking for information that they don't really need to solve the problem. I had a real world brush with this phenomena the other day.


I got a call from someone who had an enormous database (multi-petabytes) and was looking for some advice on how to "scale" the database. I almost choked. How much more scale do you need? They were saving every piece of information they gathered from their customers and were afraid to throw anything away. "We don't know what we are going to need in the future" was the refrain.

In my mind, the organization was not thinking hard about the data they had and how to use it efficiently. Rather they let cheap storage lead them down an easy path. The brute force path. The engineering path. I can tell you from experience that, assuming money, the technical folks will solve the problem of increasing storage. The organization still thought of their challenge as one of engineering. "How do we save even more data." But the engineers can't fix the underlying problem; the organization was being thoughtful about the data they were saving. At AOL, we did an analysis and found that for predictive modeling, we relied on a small set of data (less than 100 variables) and only used 10-15 for any given use. We had several thousand variables available to us, but most were either correlated with other variables and could be deleted with no loss of usability, the data supported a business we were no longer in, or being saved for no reason that we could determine, other than it was easy.

I would suggest to companies looking to "scale their databases" to first do an inventory of the data they are saving and develop a simple process to determine if the data is worth saving. In AOL's our case, each modeler assigned a letter grade to each variable and we "voted" on the data to give away. And, after this process, at no time did we say "I wish we kept variable x." Reality was we still had too much data.

Monday, June 8, 2009

Thinking about BI recently

It occurred to me that using a BI tool is a hard way to gain insight. You are limited by your own imagination. I like hypothesis driven analysis, but I think you can do a much better job in providing insight if you understand a bit of econometrics (Logit and OLS). You can simply run a stepwise regression on the variable you are trying to understand (say Life Time Value of a customer) and see what pops out (say the interaction of age and education). Once you see what variables pop, you can then use BI tools to illustrate the point. Critical piece, make sure you run all of the interactions. That is where the cool stuff lies.

Data Simplifiction

When I was trained, I was told that you should never take continuous level data and make it categorical. One of the guiding principles of regression analysis is that variance is good; Never reduce it by simplifying your data and creating categories. Maybe an example is in order:


Say you have a variable like temperature. This data is continuous; It is not bounded (except in extreme cases) and the temperature of something can take a wide variety of values. In regression, there would be no reason to define ranges or temperature (0-10, 11-20, 21-30, etc). The computer does the work and if you created the ranges you might reduce the explanatory power of the data (or if the data was used as a dependant variable, make it harder for other variables to predict it value). So, in research, categorizing data is a no no. So said Professor Feldman.

Funny thing was that I had a staff member (David) who kept telling me that while that the theory was right, you can usually create categories without much loss of predictive power. And in certain applications, working with categories is much easier than working with continuous data (ad serving is one such category. But that is another post).

The other day, I went through the exercise. I took continuous level data and made it categorical. David was right. Prof was wrong. At least in the world of Digital Marketing. The data retained its power and it is easier for consumers of the data to use it. Having said that, you still need a fair number of categories (over 10) to retain the power. Even still, I thought this was a fact worth knowing.