Data mining algorithms: how many dummies?

There’s lots of posts on “k-NN for Dummies”. This one is about “Dummies for k-NN” Categorical predictor variables are very common. Those who’ve taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps: Convert the categorical variable that has m categories, into m binary dummy variables Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category) For example, if we have X={red, yellow, green}, in step 1 we create three dummies: D_red = … Continue reading Data mining algorithms: how many dummies?

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I’ve had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods. From Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,…, Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) … Continue reading Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables: “If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a “baseline” and leave out its indicator.” (from Business Statistics by Sharpe, De Veaux & Velleman) Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a … Continue reading The use of dummy variables in predictive algorithms

Linear regression for binary outcome: even better news

I recently attended the 8th World Congress in Probability and Statistics, where I heard an interesting talk by Andy Tsao. His talk “Naivity can be good: a theoretical study of naive regression” (Abstract #0586) was about the use of Naive Regression, which is the application of linear regression to a categorical outcome, treating the outcome as numerical. He asserted that predictions from Naive Regression will be quite good. My last post was about the “goodness” of a linear regression applied to a binary outcome in terms of the estimated coefficients. That’s what explanatory modeling is about. What Dr. Tsao alerted me to, … Continue reading Linear regression for binary outcome: even better news

Linear regression for a binary outcome: is it Kosher?

Regression models are the most popular tool for modeling the relationship between an outcome and a set of inputs. Models can be used for descriptive, causal-explanatory, and predictive goals (but in very different ways! see Shmueli 2010 for more). The family of regression models includes two especially popular members: linear regression and logistic regression (with probit regression more popular than logistic in some research areas). Common knowledge, as taught in statistics courses, is: use linear regression for a continuous outcome and logistic regression for a binary or categorical outcome. But why not use linear regression for a binary outcome? the … Continue reading Linear regression for a binary outcome: is it Kosher?

Discovering moderated relationship in the era of large samples

I am currently visiting the Indian School of Business (ISB) and enjoying their excellent library. As in my student days, I roam the bookshelves and discover books on topics that I know little, some, or a lot. Reading and leafing through a variety of books, especially across different disciplines, gives some serious points for thought. As a statistician I have the urge to see how statistics is taught and used in other disciplines. I discovered an interesting book coming from the psychology literature by Herman Aguinas called Regression Analysis for Categorical Moderators. “Moderators” in statistician language is “interactions”. However, when … Continue reading Discovering moderated relationship in the era of large samples

Testing directional hypotheses: p-values can bite

I’ve recently had interesting discussions with colleagues in Information Systems regarding testing directional hypotheses. Following their request, I’m posting about this apparently illusive issue. In information systems research, the most common type of hypothesis is directional, i.e. the parameter of interest is hypothesized to go in a certain direction. An example would be testing the hypothesis that teenagers are more likely than older folks to use Facebook. Another example is the hypothesis that higher opening bids on eBay lead to higher final prices. In the Facebook example, the researcher would test the hypothesis by gathering data on Facebook usage by … Continue reading Testing directional hypotheses: p-values can bite

Start the Revolution

Variability is a key concept in statistics. The Greek letter Sigma has such importance, that it is probably associated more closely with statistics than with Greek. Yet, if you have a chance to examine the bookshelf of introductory statistics textbooks in a bookstore or the library you will notice that the variability between the zillions of textbooks, whether in engineering, business, or the social sciences, is nearly zero. And I am not only referring to price. I can close my eyes and place a bet on the topics that will show up in the table of contents of any textbook … Continue reading Start the Revolution

Summaries or graphs?

Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe’s example, by Basset et al. (1986):We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values. Here are the summary statistics for each of the four Y variables (A, B, C, D): A B C D Average 20.95 20.95 20.95 20.95 Std 1.495794 1.495794 … Continue reading Summaries or graphs?