In the last few months I’ve been involved in nearly 20 data mining projects done by student teams at ISB, as part of the MBA-level course and an executive education program. All projects relied on real data. One of the data sources was transactional data from a large regional hyper market. While the topics of the projects ranged across a large spectrum of business goals and opportunities for retail, one point in particular struck me as repeating across many projects and in many face-to-face discussions. The use of secondary data (data that were already collected for some purpose) for making decisions and deriving insights regarding future interventions.
By intervention I mean any action. In a marketing context, we can think of personalized coupons, advertising, customer care, etc.
In particular, many teams defined a data mining problem that would help them in determining appropriate target marketing. For example, predict whether the next shopping trip of a customer will include dairy products and then use this for offering appropriate promotions. Another example: predict whether a relatively new customer will be a high-value customer at the end of a year (as defined by some metric related to the customer’s spending or shopping behavior), and use it to target for a “white glove” service. In other words, building a predictive model for deciding who, when and what to offer. While this approach seemed natural to many students and professionals, there are two major sticky points:
- we cannot properly evaluate the performance of the model in terms of actual business impact without post-intervention data. The reason is that without historical data on a similar intervention, we cannot evaluate how the targeted intervention will perform. For instance, while we can predict who is most likely to purchase dairy products from a large existing transactional database, we cannot tell whether they would redeem a coupon that is targeted to them unless we have some data post a similar coupon campaign.
- we cannot build a predictive model that is optimized with the intervention goal unless we have post-intervention data. For example, if coupon redemption is the intervention performance metric, we cannot build a predictive model optimizing coupon redemption unless we have data on coupon redemption.
A predictive model is trained on past data. To evaluate the effect of an intervention, we must have some post-intervention data in order to build a model that aims at optimizing the intervention goal, and also for being able to evaluate model performance in light of that goal. A pilot study/period is therefore a good way to start: either deploy it randomly or to the sample that is indicated by a predictive model to be optimal in some way (it is best to do both: deploy to a sample that has both a random choice and a model-indicated choice). Once you have the post-intervention data on the intervention results, you can build a predictive model to optimize results on a future, larger-scale intervention.