Yesterday I happened to hear talks by two excellent speakers, both on major data mining applications in industry. One common theme was that both speakers gave compelling and easy to grasp examples of what data mining algorithms and statistics can do beyond human intelligence, and how the two relate.
- Data analytics methods are designed not only to give an answer, but also to evaluate how confident they are about the answer. In answering the jeopardy questions, the data mining approach tells you not only what is the most likely answer, but also how confident you are about that answer.
- Building trust in an analytics tool occurs when you see it make mistakes and learn from those mistakes.
The second talk, “The Art and Science of Matching Items to Users” was given by Deepak Agarwal , a Yahoo! principle research scientist and fellow statistician, was webcasted at ISB’s seminar series. You can still catch it on Aug 10 at Yahoo!’s Big Thinker Series in Bangalore. The talk was about recommender systems and their use within Yahoo!. Among various approaches used by Yahoo! to improve recommendations, Deepak described a main idea for improving the customization of news item displays on news.yahoo.com.
On the relation between human intelligence and automation, the process of choosing which items to display on Yahoo! is a two-step process, where first human editors create a pool of potential interesting news items, and then automated machine-learning algorithms choose which individual items to display from that pool.
Like Christer Johnson’s point #2, Deepak illustrated the difference between “the answer” (what we statisticians call a point estimate) and “the potential of it being good” (what we call the confidence in the estimate, AKA variability) in a very cool way: Consider two news items of which one will be displayed to a user. The first item was already shown to 100 users and 2 users clicked on links from that page. The second was shown to 10,000 users and 250 users clicked on links. Which news item should you show to maximize clicks? (yes, this is about ad revenues…) Although the first item has a lower click-through-rate (2%), it is also less certain, in the sense that it is based on less data than item 2. Hence, it is potentially good. He then took this one step further: Combine the two! “Exploit what is known to be good, explore what is potentially good”.
So what do we have here? Very practical and clear examples of why we care about variance, the weakness of point estimates, and expanding the notion of diversification to combining certain good results with uncertain not-that-good results.