Monday, September 24, 2007

The Analytic Value Chain - Locate relevant data

Related posts
The Analytic Value Chain introduced
Defining the problem
Determine Data Requirements

This post overlaps with the previous Value Chain post on determining the data requirements, but is meant to be a bit more tactical. If you have followed my advice from the previous post, you have a good sense of what data you are going to need. Now you need to find it. In most organizations I have worked with, the data to solve any given business problem exists. The challenge is that the data often exists in a place that is not accessible to most of the folks in the organization. The data may not be in a production environment (it is sitting on a server under someone's desk) and if it is in production, the data warehouse might be so large that no one really knows what is in there (this is not an uncommon problem in real warehouses. I had an expert in physical warehousing once tell me that a really good warehouse knows where a specific pallet can be found 80% of the time). I once had both problems at the same time. I found two data sets that answered a critical business problem when combined, but were running on different desktops in two different parts of my client's organization. I found the data by luck, but wound up doing a very impactful analysis. My value was in carrying the data sets, on floppy disks, back to my PC. Obviously, you can't analyze data that you can't find.

So, what to do? I don't have a ton of advice here, but I have found somethings work pretty well in identifying the data that is out there and making it accessible. First, treasure your staff that really know your data infrastructure because they have hard fought knowledge (for those in my organization, you know who you are. And you know how much I value your contributions. You also know that I am understating.) There is no replacement for just having experience in your data infrastructure. Having said that we rely on people power, metadata helps. And even the best staff are not going to be able to find useful data if you don't have data dictionaries for every dataset.

Second, create tables that aggregate your most useful data. We do this and can get our hands on useful data sets pretty quickly; in minutes. We evaluate the variables in those tables about once a year (or after any major strategy shift) to ensure that the data set maintains its usefulness. This data set has an additional value in that it can be shared with your entire organization and forms a common "fact base" for the organization.

Third, try to think ahead and ask your team to be on the lookout for certain types of data. There are business questions that I know I am going to want to take a look at and by communicating to the team my topics of interest, they can make serendipitous discoveries.

Currently using Google Presentation and Speadsheet

What pieces of crap. The flaws are too numerous to list. I have confidence that they'll get better over time, but for now, unusable. Back to Office.

It is too bad, this is the first time I have not liked a Google product.

Wednesday, September 19, 2007

The Analytic Value Chain - Determine Data Requirements

Related posts
The Analytic Value Chain introduced
Defining the problem

So, once you have a good understanding, you should probably grab every piece of data you can find, transform the variables every way you can think of and start analyzing to figure out the variables that impact your dependant variable, right?

This way lies madness and spurious results. And is an approach that a naive data analyst would take (experts in data-mining might have a different take, but I was trained as a research scientist and I don't like the kitchen sink approach.) Before you start grabbing data, you should spend some time thinking about the behavior you are trying to predict and what phenomena might drive that behavior. Let me go a step further, I recommend that you generate hypotheses on what relationships you expect to find. Write 'em down. I go so far as to create something I call a conceptual map that graphically shows how I think the all of the variables, including the dependant, impact each other. Once I see the relationships, I can quickly generate hypotheses. Maybe you are thinking, what does this have to do with determining your data requirements? Hang on, I am getting to it.

Once you have the hypotheses, then you can start thinking about what data you need to test those hypotheses. In my world, it is important to have a good idea of what kinds of analyses we are going to be doing because the most time we have access to many more variables than we can possibly look at. We can't go on fishing expedition. Some other reasons to think hard about your data requirements:

  • In rapidly changing businesses, you'll spend more time finding data than analyzing the data. So, parsimony is key, you don't want to spend time getting data that is not going to be useful.
  • Most large enterprises have real restrictions on who can access specific types of data. So, even if you find the data, you'll have to figure out who owns the data.
  • Merging and analyzing larger datasets takes processor time. The smaller the datasets, the less time the analysts spend waiting for results.
  • The world is complicated. The more variables you have, the greater the chance of finding a spurious correlation. Also, if you have a good set of hypotheses and a conceptual map, you'll have a better sense if a relationship makes sense.

In sum, planning for your data needs, though it takes time up front, saves you time in the end.

Friday, September 14, 2007

SAS or SPSS

Someone asked me the other day if I prefered SAS or SPSS. As with so many other things, it depends. Here is a very good discussion on the differences between the two packages. From my perspective, you only go with SAS if you need their functionality, typically required if you are conducting very advanced statistics or handling very large data sets. Also, their support is very good (which you need because their product is complicated) and a very complete product suite (think end-to-end.) SPSS is much more user friendly and produces better looking output that can be easily imported into Office applications. But their product suite is a little more limited. They also have solutions, but they are more of a research focused company. So, for my everyday use, I like SPSS, but like to have access to SAS for when I need the big guns. If you would like to talk about the differences, feel free to drop me a line at ken.rona@gmail.com. I am happy to offer guidance.

Thursday, September 13, 2007

Principles of Innovation

Over the past couple of years, I have been thinking about innovation. Specifically, what (and should ) I do with my teams to encourage innovations. Someone brought to my attention a video from Google's Vice President, Search Products & User Experience, Marisa Mayer.



Marisa talks about nine guiding principles that she believes are key enablers of innovation in her business. I would not argue that we all should adopt her principles, just that it was an interesting discussion and that having such a set of guiding principles is probably something worth thinking about. Mine are (somewhat redundantly with Marisa's):

1. Transparency is your friend.
2. Don't be afraid to iterate and experiment. Always test. Without it, you will not learn.
3. Noble failure is ok. The corollary is that stupid failure is not ok.
5. Share the credit. There is plenty to go around and you will likely have more than 1 good idea in your life.
6. Try to work around people you would be honored to hang with - and recruit someone when you find them! I am fortunate in this regard.
7. Don't have an opinion. Have facts.
8. You can build a business on users, but you need to run it on money. So, while you are innovative, think about how you are going to make money with the idea.
9. Be brave (but not disrespectful) in the presence of senior executives. That is how most of them got to be in their position.
10. Encourage dissent in your organization.
11. Be dissatisfied, but don't be a jerk. It makes you want to make things better.

Sunday, September 9, 2007

The Analytic Value Chain - Determine data requirements

Defining the analytic value chain post

This year, I have had three conversations with vendors who are trying to sell my company some piece of third party data. We currently have well over a thousand variables available to us for modeling and insight, before transformations; our problem is not a lack of data. Our problem is that we have too much of it. On a recent data mining project, we included over 6000 variables in the dataset. So, why do I even mention determining data requirements as part of the analytic value chain? If you can include every variable under the sun, just grab 'em all? This way, you don't have to make any hard choices about what to include and what to leave out of the analysis.

I wish that things were that simple. For well understood problems,

Thursday, September 6, 2007

The Analytic Value Chain - Defining the Problem

Related posts
The Analytic Value Chain introduced

One mistake analysts make when starting analytic work is to just jump right in with the data. I have had staff waste weeks of time learning that they (and me) misunderstood the problem being asked by our business partners. I am not proud of this, but it is a lesson learned. What have we done to mitigate the problem? We try to have kick off meetings for every large scale analysis so that the person doing the analysis (and their manager) can get a good bead on the problem the business is trying to solve. In these conversations, we spend a lot of time trying to understand why our internal clients think something is a problem, try to get a sense of root causes (that drive our data needs and analytic choices), and get agreement as to the deliverable (as well as timing.) We also try to set some ground rules around changing the problem in mid-stream. Once we kick off an analysis, any change in the request resets the timeline that we provided for the analysis. Another best practice here is for the manager to check in with the analyst a week or so after the analysts starts the work.

So, the first way that an analytic staff can add value is to help their internal clients define the problem.

Tuesday, September 4, 2007

Using segmentation in the real world

(note: This post might require a little bit of analytic expertise to understand. Let's see how it goes. I may need to write up a post on explaining clustering...)

I am always looking for creative uses of analytics in business and I saw something interesting in Border's about a week ago,; I noticed they were making product recommendations. Not on their web site, but in their stores. In effect, they were taking a page out of Amazon's playbook and moved product recommendations into the real world.

So what did I see? If you go into a Border's you'll notice a couple of new aspects of how they are merchandising their books. First, they are running a 3 books for the price of 2 promotion, with about 20 books to chose from. Second, they have a section that has a series (may 6) pairs of books, side by side. The book on the right is labeled "If you liked this book" and the book on the left is labeled "You might like this book." I would bet that both of these merchandising efforts are being driven by someone who is mining Border's purchase data. For each case, let me tell you what I think they are doing.

3/2

In the first example, the 3 for 2 promotion: They have figured out what set of books would be interesting to some targeted (though fairly large) set of customers and are both selling and cross-selling them at the same time. The tricky part is getting a large enough set of books so that people can find three books that they would want to buy as a package. Amazon gets around this by showing you something like 20 items for cross-sell on any given page (I went to the latest Harry Potter book and counted 27 cross-selling opportunities). They probably ran a cluster analysis on customers' book purchases. A cluster analysis is a way of grouping like things together, in this case, they were probably grouping the types of books purchased by homogeneous groups customers. The goal in this clustering exercise is to find a set of books purchased by groups of customer and figure out how many people are in each customer group (to give you market sizing.) They don't even need to know anything about the customers except that they like a certain set of books. They may be doing other kinds of optimization to ensure that they promotion is profitable, but let's table that for now. So the output would be a series of list of 10-20 books, all of interest to a specific customer type.

<a quick aside>The number of books (10-20) is just illustrative. Could be more, could be less. In this case, the number of books is driven by a business need. Obviously, you can't have a 3 for 2 promotion with only 2 books. Alternatively, you could not have reasonably display 100 books. This kind of segmentation must be closely linked to business needs. For those who have heard me on my segmentation soapbox...well, lets just say some folks are tired of hearing me talk about the limits of segmentation. </a quick aside>

For a pilot, they probably picked the largest segments (for example, foodies and history buffs), pull the books of interest to the segments, and get the books out on the floor. My guess is that they actually have thew books for a couple of different segments on display. So, say there are 20 books eligible for the promotion. I found 3 books I wanted to read, but would have had a hard time coming up with another set of three. It will be interesting to see if they refresh the merchandising with new books and if they are creating different segments for each store.

Pairwise

For the pairwise book recommendations, this is a pretty straight forward analytic exercise. I would have done the analysis by category (Sci-Fi, Business, Romance, etc.) to see which most popular books had been purchased by people who have also purchased books with relatively few sales (and maybe high profit margins). My first thought is that a bi-variate correlational analysis would be a good first stab at this. In effect, the most popular book is recommending the weaker selling book. I have seen local books stores do something similar by having staff recommendations in each section. I like the Borders approach better. It is data driven.

While I have suggested some possible ways that they have developed their promotions, there are other options. Also, I assume they are testing the heck out of this new merchandising approach. All in all, very clever.