Tuesday, August 7, 2007

Data Quality on the Cheap

I had a funny moment about 3 months ago. I was chatting with the VP of Advertising for a large retail chain and we got to talking about data quality (He was also responsible for Direct Marketing). I asked him “How do you check your data quality?” and he replied “I don’t think we do. Should I?” Yes. Yes, you should. If no one is checking regularly, your data quality is bad and your resulting decisions are going to be…well, you know.

Data tables are like cars. They need regular attention to ensure that they are performing well. If no one is checking, then your tables are out of tune. And for those of you who get your data from an outside vendor, don’t think that they are doing regular data QA. In my experience, my vendors ensure that the tables are produced, but are not tracking to see if the values in a given variable are reasonable. So, how should you check your data quality? We did some very simple things to give ourselves a pretty complete picture of the state of our data.

The first thing we did was build a historical database of some basic statistics for each variable in our bi-weekly production table. We tracked: mean, median, mode, 25th percentile, 75th percentile, standard deviation, skew, and kurtosis. We also tracked number of 0 values and number of missing values.

We found, straight away, that roughly five percent of our variables had large numbers of 0’s or missing values. We went back to our data provider for an explanation on why data seemed to be missing in these variables. Over the course of the next 2 weeks, they either found a problem with the data feed, the logic used to create the variable, or gave us a satisfactory reason why the data looked like it had some many holes.

Our next step was to look at variables that were not stable over time. Our dataset included all US households; the variables associated with the household don’t vary much in the aggregate. We focuses on calculated the mean, medians, and standard deviations over time for each variable (the other metrics, skew, ketosis, modes, etc, did not add any value over the basics). I think we went back 12 weeks (or 6 periods)

I was frankly shocked at how easy it was to find “suspect” variables! If you plot the values over time, suspects just jump out at you. We had some variables (I want to say 10% of the total number) whose means varied by greater than 10%, period over period. There were too many variables to chase down all at once, so we identified the 20 or so variables who were the worst offenders; their means varied by more than 50%, period over period. We went through the same process with the vendor as we did with the missing variables; fix the variable or justify why it varied so much. Over the next 8 weeks, we steadily reduced the amount of acceptable variation, going back and speaking with the vendor, variable by variable. This was a very valuable exercise. Our current variance threshold now hovers around 2%.

In the last part of the project, we made some process changes. First, I had a conversation with the vendor and offered them the SAS code we were using for data QA. They accepted immediately. They wanted to do the QA themselves, before we found a problem. We keep checking, but now the vendor can get ahead of the game and provide even better service. We review our data QA checks, bi-weekly, at my staff meeting. Typically, the person responsible now says “nothing to report.” In addition, we have created good SAS code to automate the process and have just move the QA process to India (I guess I should do a “Lessons learned in off-shoring” piece). All in all, our ongoing QA process is relatively painless.

Was this the most bullet proof data QA process we could have put in? No. We are relaying on changes in distribution to catch bad data. Some variables may be of poor quality and because they have not varied much, it is possible that we may never catch them. I don’t think this is likely. Each variable is used in some project or another on a pretty regular basis. Once a variable makes it into a projects, its quality is checked extensively. We have not found a new suspect variable this way yet, but you never know. I can say, pretty authoritatively, that our data quality has gotten much better.

2 comments:

John Aitchison said...

Nice to see that someone is taking data quality seriously, and has an ongoing "intelligent supervisory statistical app" type of approach.

Data cleansing and monitoring is quite frankly boring.. it takes you away from the fun stuff. I wrote a bit here
http://dsanalytics.com/dsblog/%e2%80%9chorizontal-data-dancing%e2%80%9d-record-parsing-%e2%80%93-the-first-stage_63
about how one could do some of this with "intelligent objects", but you can still get caught - worse, it is hard to estimate how much time will be involved in the cleanup after you discover a data error.

Over the last several weeks I have been working with the Australian Census data. Perhaps because I have different requirements from them I have discovered several anomalies, unexpected aggregations and inconsistent data formats : the lesson being that you simply MUST have some sort of automation in place to test your assumptions against the reality of the data in each and every single case. Only that way will you discover the anomalies, which may turn out to be real data errors, or errors in your assumptions.

Unknown said...

Thanks for the comments John. That was a great first comment for the site. I think the US census is also messy. We use 3rd party data that clean it up for analytic use.

We went through a format clean up period as well, but it was a fix it one kind of thing.