My videos for “Business Analytics using Data Mining” now publicly available!

Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections of BADM (which was started by Ravi Bapna in 2006), and the course was fast growing in popularity. Repeating the same lectures in multiple course sections made me realize it was time for scale! … Continue reading My videos for “Business Analytics using Data Mining” now publicly available!

Data mining algorithms: how many dummies?

There’s lots of posts on “k-NN for Dummies”. This one is about “Dummies for k-NN” Categorical predictor variables are very common. Those who’ve taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps: Convert the categorical variable that has m categories, into m binary dummy variables Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category) For example, if we have X={red, yellow, green}, in step 1 we create three dummies: D_red = … Continue reading Data mining algorithms: how many dummies?

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I’ve had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods. From http://blog.excelmasterseries.com Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,…, Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) … Continue reading Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables: “If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a “baseline” and leave out its indicator.” (from Business Statistics by Sharpe, De Veaux & Velleman) Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a … Continue reading The use of dummy variables in predictive algorithms

Running a data mining contest on Kaggle

Following the success last year, I’ve decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle. Setting up a competition on Kaggle is not trivial and I’d like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I’ve discovered). With some assistance from … Continue reading Running a data mining contest on Kaggle

What does “business analytics” mean in academia?

But what exactly does this mean? In the recent ISIS conference, I organized and moderated a panel called “Business Analytics and Big Data: How it affects Business School Research and Teaching“. The goal was to tackle the ambiguity in the terms “Business Analytics” and “Big Data” in the context of business school research and teaching. I opened with a few points: Some research b-schools are posting job ads for tenure-track faculty in “Business Analytics” (e.g., University of Maryland; Google “professor business analytics position” for plenty more). What does this mean? what is supposed to be the background of these candidates … Continue reading What does “business analytics” mean in academia?

Flipping and virtualizing learning

Adopting new technology for teaching has been one of my passions, and luckily my students have been understanding even during glitches or choices that turn out to be ineffective (such as the mobile/Internet voting technology that I wrote about last year). My goal has been to use technology to make my courses more interactive: I use clickers for in-class polling (to start discussions and assess understanding, not for grading!); last year, after realizing that my students were constantly on Facebook, I finally opened a Facebook account and ran a closed FB group for out-of-class discussions; In my online courses on statistics.com … Continue reading Flipping and virtualizing learning

The mad rush: Masters in Analytics programs

The recent trend among mainstream business schools is opening a graduate program or a concentration in Business Analytics (BA). Googling “MS Business Analytics” reveals lots of big players offering such programs. A few examples (among many others) are: Carnegie Mellon’s Heinz College Michigan State’s Broad School of Business NYU Stern University of Connecticut’s School of Business Rutgers Business School Drexel’s Lebow College of Business These programs are intended (aside from making money) to bridge the knowledge gap between the “data or IT team” and the business experts. Graduates should be able to lead analytics teams in companies, identifying opportunities where … Continue reading The mad rush: Masters in Analytics programs

Google Scholar — you’re not alone; Microsoft Academic Search coming up in searches

In searching for a few colleagues’ webpages I noticed a new URL popping up in the search results. It either included the prefix academic.microsoft.com or the IP address 65.54.113.26. I got curious and checked it out to discover Microsoft Academic Search (Beta) — a neat presentation of the author’s research publications and collaborations. In addition to the usual list of publications, there are nice visualizations of publications and citations over time, a network chart of co-authors and citations, and even an Erdos Number graph. The genealogy graph claims that it is based on data mining so “might not be perfect”. All this is … Continue reading Google Scholar — you’re not alone; Microsoft Academic Search coming up in searches

Big Data: The Big Bad Wolf?

“Big Data” is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call “fat data”), others of huge numbers of records (“long data”), and some talk of both. How much is big? Again, depends who you ask. As a statistician who’s (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of “just get a good sample and get it … Continue reading Big Data: The Big Bad Wolf?