Monday, June 8, 2009

Data Simplifiction

When I was trained, I was told that you should never take continuous level data and make it categorical. One of the guiding principles of regression analysis is that variance is good; Never reduce it by simplifying your data and creating categories. Maybe an example is in order:


Say you have a variable like temperature. This data is continuous; It is not bounded (except in extreme cases) and the temperature of something can take a wide variety of values. In regression, there would be no reason to define ranges or temperature (0-10, 11-20, 21-30, etc). The computer does the work and if you created the ranges you might reduce the explanatory power of the data (or if the data was used as a dependant variable, make it harder for other variables to predict it value). So, in research, categorizing data is a no no. So said Professor Feldman.

Funny thing was that I had a staff member (David) who kept telling me that while that the theory was right, you can usually create categories without much loss of predictive power. And in certain applications, working with categories is much easier than working with continuous data (ad serving is one such category. But that is another post).

The other day, I went through the exercise. I took continuous level data and made it categorical. David was right. Prof was wrong. At least in the world of Digital Marketing. The data retained its power and it is easier for consumers of the data to use it. Having said that, you still need a fair number of categories (over 10) to retain the power. Even still, I thought this was a fact worth knowing.

No comments: