Data mining algorithms: how many dummies?

There’s lots of posts on “k-NN for Dummies”. This one is about “Dummies for k-NN” Categorical predictor variables are very common. Those who’ve taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps: Convert the categorical variable that has m categories, into m binary dummy variables Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category) For example, if we have X={red, yellow, green}, in step 1 we create three dummies: D_red = … Continue reading Data mining algorithms: how many dummies?

Linear regression for binary outcome: even better news

I recently attended the 8th World Congress in Probability and Statistics, where I heard an interesting talk by Andy Tsao. His talk “Naivity can be good: a theoretical study of naive regression” (Abstract #0586) was about the use of Naive Regression, which is the application of linear regression to a categorical outcome, treating the outcome as numerical. He asserted that predictions from Naive Regression will be quite good. My last post was about the “goodness” of a linear regression applied to a binary outcome in terms of the estimated coefficients. That’s what explanatory modeling is about. What Dr. Tsao alerted me to, … Continue reading Linear regression for binary outcome: even better news

Sensitivity, specificity, false positive and false negative rates

I recently had an interesting discussion with a few colleagues in Korea regarding the definition of false positive and false negative rates and their relation to sensitivity and specificity. Apparently there is real confusion out there, and if you search the web you’ll find conflicting information. So let’s sort this out: Let’s assume we have a dataset of bankrupt and solvent firms. We now want to evaluate the performance of a certain model for predicting bankruptcy. Clearly here, the important class is “bankrupt”, as the consequences of misclassifying bankrupt firms as solvent are heavier than misclassifying solvent firms as bankrupt. … Continue reading Sensitivity, specificity, false positive and false negative rates

Weighted nearest-neighbors

K-nearest neighbors (k-NN) is a simple yet often powerful classification / prediction method. The basic idea, for predicting a new observation, is to find the k most similar observations in terms of the predictor (X) values, and then let those k neighbors vote to determine the predicted class membership (or take their Y average to predict their numerical outcome). Since this is such an intuitive method, I thought it would be useful to discuss two improvements that have been suggested by data miners. Both use weighting, but in different ways. One intuitive improvement is to weight the neighbors by their … Continue reading Weighted nearest-neighbors