The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables: “If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a “baseline” and leave out its indicator.” (from Business Statistics by Sharpe, De Veaux & Velleman) Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a … Continue reading The use of dummy variables in predictive algorithms

Running a data mining contest on Kaggle

Following the success last year, I’ve decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle. Setting up a competition on Kaggle is not trivial and I’d like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I’ve discovered). With some assistance from … Continue reading Running a data mining contest on Kaggle

The Scientific Value of Testing Predictive Performance

This week’s NY Times article Risk Calculator for Cholesterol Appears Flawed and CNN article Does calculator overstate heart attack risk? illustrate the power of evaluating the predictive performance of a model for purposes of validating the underlying theory. The NYT article describes findings by two Harvard Medical School professors, Ridker and Cook, about extreme over-estimation of the 10-year risk of a heart-attack or stroke when using a calculator released by the American Heart Association and the American College of Cardiology. “According to the new guidelines, if a person’s risk is above 7.5%, he or she should be put on a statin.” (CNN … Continue reading The Scientific Value of Testing Predictive Performance

A Tale of Two (Business Analytics) Courses

I have been teaching two business analytics elective MBA-level courses at ISB. One is called “Business Analytics Using Data Mining” (BADM) and the other, “Forecasting Analytics” (FCAS). Although we share the syllabi for both courses, I often receive the following question, in this variant or the other: What is the difference between the two courses? The short answer is: BADM is focused on analyzing cross-sectional data, while FCAS is focused on time series data. This answer clarifies the issue to data miners and statisticians, but sometimes leaves aspiring data analytics students perplexed. So let me elaborate. What is the difference … Continue reading A Tale of Two (Business Analytics) Courses

Designing a Business Analytics program, Part 3: Structure

This post continues two earlier posts (Part 1: Intro and Part 2: Content) on Designing a Business Analytics (BA) program. This part focuses on the structure of a BA program, and especially course structure. In the program that I designed, each of the 16 courses combines on-ground sessions with online components. Importantly, the opening and closing of a course should be on-ground. The hybrid online/on-ground design is intended to accommodate participants who cannot take long periods of time-off to attend campus. Yet, even in a residential program, a hybrid structure can be more effective, if it is properly implemented. The … Continue reading Designing a Business Analytics program, Part 3: Structure

Designing a Business Analytics program, Part 2: Content

This post follows Part 1: Intro of Designing a Business Analytics program. In this post, I focus on the content to be covered in the program, in the form of courses and projects. The following design is based on my research of many programs, on discussions with faculty in various analytics areas, with analysts and managers at different levels, and on feedback from many past MBA students who have taken my analytics courses over the years (data mining, forecasting, visualization, statistics, etc.) and are now managing data at a broad range of companies and organizations. Content Dealing with data, little … Continue reading Designing a Business Analytics program, Part 2: Content

Designing a Business Analytics program, Part 1: Intro

I have been receiving many inquiries about programs in “Business Analytics” (BA), online and offline, in the US and outside the US. The few programs that are already out there (see an earlier post) are relatively new, so it is difficult to assess their success in producing data-savvy analysts. Rather than concentrate on the uncertainty, let me share my view and experience regarding the skill set that such programs should provide. To be practical, I will share the program that I designed for the Indian School of Business one-year certificate program in BA(*), in terms of content and structure. Both … Continue reading Designing a Business Analytics program, Part 1: Intro

Predictive relationships and A/B testing

I recently watched an interesting webinar on Seeking the Magic Optimization Metric: When Complex Relationships Between Predictors Lead You Astray by Kelly Uphoff, manager of experimental analytics at Netflix. The presenter mentioned that Netflix is a heavy user of A/B testing for experimentation, and in this talk focused on the goal of optimizing retention. In ideal A/B testing, the company would test the effect of an intervention of choice (such as displaying a promotion on their website) on retention, by assigning it to a random sample of users, and then comparing retention of the intervention group to that of a control … Continue reading Predictive relationships and A/B testing

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates.  A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors … Continue reading An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

Collaborations of Latex and Word users

The two popular text editors used by researchers in academia are LaTex and Microsoft Word. Or, put differently: Microsoft Word and LaTex. In more technical fields, LaTex is king, while in less technical fields, it is Word. In the business school worlds collide. Coming from a technical background, I am a heavy user of LaTex, for research papers and even for book writing. However, many of my business school collaborators (e.g., from fields of Information Systems and Marketing) are Word users. While collaboration platforms such as Google Drive and Dropbox have greatly enhanced collaborative possibilities, including co-editing a document, the Word-or-Latex schism … Continue reading Collaborations of Latex and Word users