Data mining algorithms: how many dummies?

There’s lots of posts on “k-NN for Dummies”. This one is about “Dummies for k-NN” Categorical predictor variables are very common. Those who’ve taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps: Convert the categorical variable that has m categories, into m binary dummy variables Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category) For example, if we have X={red, yellow, green}, in step 1 we create three dummies: D_red = … Continue reading Data mining algorithms: how many dummies?

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I’ve had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods. From http://blog.excelmasterseries.com Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,…, Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) … Continue reading Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Weighted nearest-neighbors

K-nearest neighbors (k-NN) is a simple yet often powerful classification / prediction method. The basic idea, for predicting a new observation, is to find the k most similar observations in terms of the predictor (X) values, and then let those k neighbors vote to determine the predicted class membership (or take their Y average to predict their numerical outcome). Since this is such an intuitive method, I thought it would be useful to discuss two improvements that have been suggested by data miners. Both use weighting, but in different ways. One intuitive improvement is to weight the neighbors by their … Continue reading Weighted nearest-neighbors