Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I’ve had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods. From http://blog.excelmasterseries.com Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,…, Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) … Continue reading Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Good and bad of classification/regression trees

Classification and Regression Trees are great for both explanatory and predictive modeling. Although data driven, they provide transparency about the resulting classifier are are far from being a blackbox. For this reason trees are often in applications that require transparency, such as insurance or credit approvals. Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as “key players”. Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that … Continue reading Good and bad of classification/regression trees

Classification Trees: CART vs. CHAID

When it comes to classification trees, there are three major algorithms used in practice. CART (“Classification and Regression Trees”), C4.5, and CHAID. All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways. The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree … Continue reading Classification Trees: CART vs. CHAID