Over-fitting analogies

To explain the danger of model over-fitting in prediction to data mining newcomers, I often use the following analogy: Say you are at the tailor’s, who will be sewing an expensive suit (or dress) for you. The tailor takes your measurements and asks whether you’d like the suit to fit you exactly, or whether there should be some “wiggle room”. What would you choose? The answer is, “it depends how you plan to use the suit”. If you are getting married in a few days, then probably a close fit is desirable. In contrast, if you plan to wear the … Continue reading Over-fitting analogies

Good and bad of classification/regression trees

Classification and Regression Trees are great for both explanatory and predictive modeling. Although data driven, they provide transparency about the resulting classifier are are far from being a blackbox. For this reason trees are often in applications that require transparency, such as insurance or credit approvals. Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as “key players”. Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that … Continue reading Good and bad of classification/regression trees