Trees in pivot table terminology

Recently, I’ve been requested by non-data-mining colleagues to explain how Classification and Regression Trees work. While a detailed explanation with examples exists in my co-authored textbook Data Mining for Business Intelligence, I found that the following explanation worked well with people who are familiar with Excel’s Pivot Tables: Classification tree for predicting vulnerability to famine Suppose the goal is to generate predictions for some variable, numerical or categorical, given a set of predictors. The idea behind trees is to create groups of records with similar profiles in terms of their predictors, and then average the outcome variable of interest to … Continue reading Trees in pivot table terminology

Good and bad of classification/regression trees

Classification and Regression Trees are great for both explanatory and predictive modeling. Although data driven, they provide transparency about the resulting classifier are are far from being a blackbox. For this reason trees are often in applications that require transparency, such as insurance or credit approvals. Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as “key players”. Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that … Continue reading Good and bad of classification/regression trees

Classification Trees: CART vs. CHAID

When it comes to classification trees, there are three major algorithms used in practice. CART (“Classification and Regression Trees”), C4.5, and CHAID. All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways. The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree … Continue reading Classification Trees: CART vs. CHAID