Tuesday, March 4, 2008

More on data quality

I was speaking with someone about ways to assess data quality for predictive modeling. I have written on tactically how to ensure quality data, but here is a little framework you can use when thinking about data quality. Your data needs to be accurate, granular, and complete. Accuracy of data: Does the data accurately capture the attribute (e.g., income) that it was intended to? Granularity of data: In every case, the more granular the data, the better the predictive modeling can be. Also, you can usually roll up granular data (from individual to Households, Households to zip+4, etc) to higher level if you need to (for analytic or appending purposes.) Completeness of data: Any given dataset is going to have missing data. Missing data is a funny thing. Of the three, accuracy, granularity, and completeness, the later is the one that you can most influence. Obviously, the less missing data, obviously, the better. But before choosing an appropriate remediation method you need to understand why the data is missing. If the problem is that the datafeeds are broken, you are going to need to get the feeds fixed. If the data does not exist, but you have enough coverage to do some predictive modeling, you can predict the values of missing data. Or you may just need to fill missing values with the mean or median values.

2 comments:

Mark Pilipczuk said...

At what point does filling in the missing data with mean/median values start to impact analytics and modeling.

And does that point of impact change, depending on whether you're doing analytics or modeling?

Unknown said...

Depends on the size of the dataset. Smaller datasets are less tolerant of inserting the mean/median. For a big dataset, I would start to get worried after the 5% mark and take a different approach at the 8-10% mark. At some point, you need to start modeling this missing data.