Running a data mining contest on Kaggle

Following the success last year, I’ve decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle. Setting up a competition on Kaggle is not trivial and I’d like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I’ve discovered). With some assistance from … Continue reading Running a data mining contest on Kaggle

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates.  A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors … Continue reading An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

The world is flat? Only for US students

Learning and teaching has become a global endeavor with lots of online resources and technologies. Contests are an effective way to engage a diverse community from around the world. In the past I have written several posts about contests and competitions in data mining, statistics and more. And now about a new one. Tableau is a US-based company that sells a cool data visualization tool (there’s a free version too). The company has recently seen huge growth with lots of new adopters in industry and academia. Their “Tableau for teaching” (TfT) program is intended to assist instructors and teachers by … Continue reading The world is flat? Only for US students

Mining health-related data: How to benefit scientific research

Image from KDnuggets.com While debates over privacy issues related to electronic health records are still ongoing, predictive analytics are beginning to being used with administrative health data (available to health insurance companies, aka, “health provider networks”). One such venue are large data mining contests. Let me describe a few and then get to my point about their contribution to pubic health, medicine and to data mining research. The latest and grandest is the ongoing $3 million prize contest by Hereitage Provider Network, which opened in 2010 and lasts 2 years. The contest’s stated goal is to create “an algorithm that … Continue reading Mining health-related data: How to benefit scientific research

Got Data?!

The American Statistical Association’s store used to sell cool T-shirts with the old-time beggar-statistician question “Got Data?” Today it is much easier to find data, thanks to the Internet. Dozens of student teams taking my data mining course have been able to find data from various sources on the Internet for their team projects. Yet, I often receive queries from colleagues in search of data for their students’ projects. This is especially true for short courses, where students don’t have sufficient time to search and gather data (which is highly educational in itself!). One solution that I often offer is … Continue reading Got Data?!

Forecasting stock prices? The new INFORMS competition

Image from www.lumaxart.com The 2010 INFORMS Data Mining Contest is underway. This time the goal is to predict 5-minute stock prices. That’s right – forecasting stock prices! In my view, the meta-contest is going to be the most interesting part. By meta-contest I mean looking beyond the winning result (what method, what prediction accuracy)  and examining the distribution of prediction accuracies across all the contestants, how the winner is chosen, and most importantly, how the winning result will be interpreted in terms of concluding about the predictability level of stocks. Why is a stock prediction competition interesting? Because according to … Continue reading Forecasting stock prices? The new INFORMS competition

Advancing science vs. compromising privacy

Data mining often brings up the association of malicious organizations that violate individuals’ privacy. Three days ago, this tension was brought up a notch (at least in my eyes): Netflix decided to cancel the second round of the famous Netflix Prize. The reason is apparent in the New York Times article “Netflix Cancels Contest After Concerns Are Raised About Privacy“. Researchers from the University of Texas have shown that the data disclosed by Netflix in the first contest could be used to identify users. One woman sued Netflix. The Federal Trade Commission got involved, and the rest is history. What’s … Continue reading Advancing science vs. compromising privacy

Data Exploration Celebration: The ENBIS 2009 Challenge

The European Network for Business and Industrial Statistics (ENBIS) has released the 2009 ENBIS Challenge. The challenge this time is to use an exploratory data analysis (EDA) tool to answer a bunch of questions regarding sales of laptop computers in London. The data on nearly 200,000 transactions include 3 files: sales data (for each computer sold, with time stamps and zipcode locations of customer and store), computer configuration information, and geographic information linking zipcodes to GIS coordinates. Participants are challenged to answer a set of 11 questions using EDA. The challenge is sponsored by JMP (by SAS), who are obviously … Continue reading Data Exploration Celebration: The ENBIS 2009 Challenge

Data Mining Cup 2008 releases data today

Although the call for this competition has been out for a while on KDnuggets.com, today is the day when the data and the task description are released. This data mining competition is aimed at students. The prizes probably might not sound that attractive to student (“participation in the KDD 2008, the world’s largest international conference for “Knowledge Discovery and Data Mining” (August 24-27, 2008 in Las Vegas)”, so I’d say the real prize is cracking the problem and winning! An interesting related story that I recently heard from Chris Volinsky from the Belkor team (who is currently in first place) … Continue reading Data Mining Cup 2008 releases data today

Data mining competition season

Those who’ve been following my postings probably recall “competition season” when all of a sudden there are multiple new interesting datasets out there, each framing a business problem that requires the combination of data mining and creativity. Two such competitions are the SAS Data Mining Shootout and the 2008 Neural Forecasting Competition. The SAS problem concerns revenue management for an airline who wants to improve their customer satisfaction. The NN5 competition is about forecasting cash withdrawals from ATMs. Here are the similarities between the two competitions: they both provide real data and reasonably real business problems. Now to a more … Continue reading Data mining competition season