Advancing science vs. compromising privacy

Data mining often brings up the association of malicious organizations that violate individuals’ privacy. Three days ago, this tension was brought up a notch (at least in my eyes): Netflix decided to cancel the second round of the famous Netflix Prize. The reason is apparent in the New York Times article “Netflix Cancels Contest After Concerns Are Raised About Privacy“. Researchers from the University of Texas have shown that the data disclosed by Netflix in the first contest could be used to identify users. One woman sued Netflix. The Federal Trade Commission got involved, and the rest is history. What’s … Continue reading Advancing science vs. compromising privacy

Are experiments always better?

This continues my “To Explain or To Predict?” argument (in brief: statistical models aimed at causal explanation will not necessarily be good predictors). And now, I move to a very early stage in the study design: how should we collect data? A well-known notion is that experiments are preferable to observational studies. The main difference between experimental studies and observational studies is an issue of control. In experiments, the researcher can deliberately choose “treatments” and control the assignment of subjects to the “treatments”, and then can measure the outcome. Whereas in observational studies, the researcher can only observe the subjects … Continue reading Are experiments always better?

Data Mining Cup 2008 releases data today

Although the call for this competition has been out for a while on KDnuggets.com, today is the day when the data and the task description are released. This data mining competition is aimed at students. The prizes probably might not sound that attractive to student (“participation in the KDD 2008, the world’s largest international conference for “Knowledge Discovery and Data Mining” (August 24-27, 2008 in Las Vegas)”, so I’d say the real prize is cracking the problem and winning! An interesting related story that I recently heard from Chris Volinsky from the Belkor team (who is currently in first place) … Continue reading Data Mining Cup 2008 releases data today

Insights from the Netflix contest

The neat recent Wall Street Journal article Netflix Aims to Refine Art of Picking Films (Nov 20, 2007) was sent to me by Moshe Cohen, one of my dedicated ex-data-mining-course students. In the article, a spokesman from Netflix demystifies some of the winning techniques in the Netflix $1 million contest. OK, not really demystifying, but revealing two interesting insights: 1) Some teams joined forces by combining their predictions to obtain improved predictions (without disclosing their actual algorithms to each other). Today, for instance, the third best team on the Netflix Leaderboard is “When Gravity and Dinosaurs Unite”, which is the … Continue reading Insights from the Netflix contest