OK, I admit it – I did peak over the shoulder of my fellow Metro rider last night (while returning from teaching Classification Trees), to better see her Wall Street Journal’s front page. The article that caught my eye was “Democracts, Playing Catch-Up, Tap Database to Woo Potential Voters“. I only managed to catch the first few paragraphs before the newspaper owner flipped to the next page.
Luckily, my student Michael Melcer just emailed me the complete article. He put it very nicely:
Thought you might find this article interesting. Sounds like politicians are using regression with a binary y to predict who is likely to vote for a party member.
So what exactly are we reading about? Apparently a direct marketing application used to capture people who are most likely to vote (democratic). “The technique aims to identify potential supporters by collecting and analyzing the unprecedented amount of information now readily available — from census data to credit-card bills — to profile individual voters.”
In classic direct marketing, companies use information such as demographics and the historical relationship of the customer with the company (e.g., number of purchases, dates and amounts of purchases) as predictor variables. They use data from a pilot study or previous campaigns to collect information on the outcome variable of interest – did the customer respond to the marketing effort? (e.g., respond to a credit card solicitaion). Combining the predictor and outcome information, a model is created that predicts the probability of responding to the marketing, based on the predictor information. In the voting solicitation context, the company “developed mathematic formulas based on such factors as length of residence, amount of money spent on golf, voting patterns in recent elections and a handful of other variables to calculate the likelihood that a particular American will vote Democratic.”
Since the final goal is to choose (“microtarget” as in WSJ) the people who are most likely to vote democractic and solicit them to vote, the model should be able not necessarily to correctly classify as many people as possible (=model accuracy), but rather to be able to correctly rank people who are most likely to vote democratic. This is an important distinction: while some models can have very low accuracy, they might be excellent at capturing the top 10% of democratic voters. The main tool for assessing such performance is the lift-chart (AKA gains-chart).
So Michael – whether it is a logistic regression model, a classification tree, a neaural network, or any other classification method we will not know. But the term “mathematical formulas” does hint at statistically oriented models such as logistic regression rather than maching-learning methods. I suppose we should get our hands on such “publicly available datasets” and see what gives good lift!