Da Facto: More on data quality

Tuesday, March 4, 2008

More on data quality

I was speaking with someone about ways to assess data quality for predictive modeling. I have written on tactically how to ensure quality data, but here is a little framework you can use when thinking about data quality. Your data needs to be accurate, granular, and complete. Accuracy of data: Does the data accurately capture the attribute (e.g., income) that it was intended to? Granularity of data: In every case, the more granular the data, the better the predictive modeling can be. Also, you can usually roll up granular data (from individual to Households, Households to zip+4, etc) to higher level if you need to (for analytic or appending purposes.) Completeness of data: Any given dataset is going to have missing data. Missing data is a funny thing. Of the three, accuracy, granularity, and completeness, the later is the one that you can most influence. Obviously, the less missing data, obviously, the better. But before choosing an appropriate remediation method you need to understand why the data is missing. If the problem is that the datafeeds are broken, you are going to need to get the feeds fixed. If the data does not exist, but you have enough coverage to do some predictive modeling, you can predict the values of missing data. Or you may just need to fill missing values with the mean or median values.

2 comments:

Mark Pilipczuk said...: At what point does filling in the missing data with mean/median values start to impact analytics and modeling.

And does that point of impact change, depending on whether you're doing analytics or modeling?; April 29, 2008 at 9:05 AM
Unknown said...: Depends on the size of the dataset. Smaller datasets are less tolerant of inserting the mean/median. For a big dataset, I would start to get worried after the 5% mark and take a different approach at the 8-10% mark. At some point, you need to start modeling this missing data.; May 12, 2008 at 1:34 PM

Subscribe to: Post Comments (Atom)

Recomended Books

Thinking, Fast and Slow My Phd was in Behavioral Economics and Kahneman co-invented the field. This books sums up Kahneman's observations over an almost 50 year career. Amazing, though not for the uninitiated, overveiw of how people actually make decisions. Competing on Analytics This is a great read if your organization is just starting to realize the importance of analytics as a competitive advantage.

Visual Display of Quantitative Information This is a classic book on displaying, well, quantitative data. If you want people to pay attention to your results, you need to present your analyses well.

The Pyrimid Principle Another classic book on how to communicate complex ideas in a business environment. Not cheap but very worthwhile. The author is a former McKinsey consultant who had a major impact on how the Firm conducts its written communication with clients.

Say it with Charts This is another McKinsey book. Mr. Zelazny has been the major force in defining the look and feel of McKinsey PowerPoint decks. He also has a second book on presentations, called, strangely enough, Say it with Presenations.

The Fifth Discipline The Fifth Discipline is all about systems thinking. I use the phrase "Business Dynamics" from this book a couple of times a week. The notion is that your business is made up of some core systems and that the behaviors of those systems can be represented with some basic building blocks. By understanding how these building blocks interact, you can radically improve your understanding of the systems that drive your business performance.

Made to Stick I want my ideas to have longevity in my organization. This is a good primer on how to make ideas last.

Information Dashboard Design My company is heavily invested in Business Objects, but our dashboards always looked clunky. As with most technology implementations, the difficulty is not in getting the technology to work, it is in getting people to use technology effectively. This is a very pragmatic book that discounts use of cool looking gadgets in favor of functional design.

Business Process Management: Practical Guidelines to Successful Implementations A basic understanding of business process is critical in most corporate settings and serves as the cornerstone for process improvement. Here is a link to an Amazon search for other bookson business processes.

Loyalty Myths Loyalty is a good thing, right? How about the loyalty of an unprofitable customer? The book lays out a number of fallacies about loyalty and can help you develop an educated perspective if loyalty is a good thing for your company.

The Leadership Challenge This book identifies the critical elements of leadership through survey research. It was the first good and actionable book I have read on leadership.

The book links go through my Amazon Associates account. All proceeds to go for my kids education, buy them soldering irons, lathes, drill presses, that kind of thing.