Tuesday, June 9, 2009

Brute force vs. smarts

My dissertation was, in part, about how to encourage people, when solving problems, to think about the information they already have available to them and to not just gather information for its own sake. I found that you can save a lot of money if you charge just a token amount for each new piece of information. When you charge for information, people think more deeply about the information in their possession and stop asking for information that they don't really need to solve the problem. I had a real world brush with this phenomena the other day.


I got a call from someone who had an enormous database (multi-petabytes) and was looking for some advice on how to "scale" the database. I almost choked. How much more scale do you need? They were saving every piece of information they gathered from their customers and were afraid to throw anything away. "We don't know what we are going to need in the future" was the refrain.

In my mind, the organization was not thinking hard about the data they had and how to use it efficiently. Rather they let cheap storage lead them down an easy path. The brute force path. The engineering path. I can tell you from experience that, assuming money, the technical folks will solve the problem of increasing storage. The organization still thought of their challenge as one of engineering. "How do we save even more data." But the engineers can't fix the underlying problem; the organization was being thoughtful about the data they were saving. At AOL, we did an analysis and found that for predictive modeling, we relied on a small set of data (less than 100 variables) and only used 10-15 for any given use. We had several thousand variables available to us, but most were either correlated with other variables and could be deleted with no loss of usability, the data supported a business we were no longer in, or being saved for no reason that we could determine, other than it was easy.

I would suggest to companies looking to "scale their databases" to first do an inventory of the data they are saving and develop a simple process to determine if the data is worth saving. In AOL's our case, each modeler assigned a letter grade to each variable and we "voted" on the data to give away. And, after this process, at no time did we say "I wish we kept variable x." Reality was we still had too much data.

No comments: