Thursday, May 15, 2008

Add more data? No, just more understanding

A couple of months ago, Anand Rajaraman, a professor at Stanford who teaches a class on data mining, wrote a blog post talking about a class assignment; students have to do a non-trivial data mining exercise. A bunch of his students decided to go after the Netflix prize, a contest, run by Netflix, to see if anyone could improve their movie suggestion algorithm by greater than 10%. I love these kinds of contests. One team in Anand's class added data from the Internet Movie Database to the Netflix supplied dataset. Another team did not add data, instead, they spent time on optimizing the recommendation algorithm . Turns out that the team that added data did much better on movie recommendations. So good, in fact, that they made the leaderboard. So what should we take away from this?

Anand suggests "adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set." He is right, but glosses over a critical word "independent." That is, the data being added has to not be correlated with the existing data. It needs to represent something new.

My take: The team that added the data were smart and operationalized descriptions of the movies better than the Netflix data. They found a dataset that added a missing theoretical construct, good descriptions of movies, and that made the difference. Just adding a bunch of data is not the takeaway here. (At my previous employer, we had over 8000 variables at the household level (we did a lot of transformation) that we could use to predict if someone was going to take an offer. In a typical model, we used less than 20 variaibles. Of the 20 models we had in production, we used less than 200 of the 8000 variables.)

So what is the secret sauce to model improvement? Adding data that operationalized a construct previously in the error term. In English: The team thought about what factors (what I was calling theoretical constructs) could possibly be used in a recommendation system and went to find data that could be a proxy for those factors. You should be willing to add (read:pay) for more data, but only if it measures something where you don't have an effective proxy.

No comments: