To explain the danger of model over-fitting in prediction to data mining newcomers, I often use the following analogy:
Say you are at the tailor’s, who will be sewing an expensive suit (or dress) for you. The tailor takes your measurements and asks whether you’d like the suit to fit you exactly, or whether there should be some “wiggle room”. What would you choose?
The answer is, “it depends how you plan to use the suit”. If you are getting married in a few days, then probably a close fit is desirable. In contrast, if you plan to wear the suit to work throughout the next few years, you’d most likely want some “wiggle room”… The latter case is similar to prediction, where you want to make sure to accommodate new records (your body’s measurements during the next few years) that are not exactly identical to the current data. Hence, you want to avoid over-fitting. The wedding scenario is similar to models built for causal explanation, where you do want the model to fit the data well (back to the explanation vs. prediction distinction).
I just found some nice terminology, by Bruce Ratner (GenIQ.net
), explaining the idea of over-fitting:
A model is built to represent a training data… not to reproduce a training data. [Otherwise], a visitor from the validation data will not feel at home. The visitor encounters an uncomfortable fit in the model because s/he probabilistically does not look like a typical data-point from the training data. Thus, the misfit visitor takes a poor prediction