Online data collection

Online data are a huge resources for research as well as in practice. Although it is often tempting to "scrape everything" using technologies like web-crawling, it is extremely important to keep the goal of the analysis in mind. Are you trying to build a predictive model? A descriptive model? How will the model be used? Deployed to new records? etc. Dean Tau from Co-soft recently posted an interesting and useful comment in the Linked-in group Data Mining, Statistics, and Data Visualization. With his permission, I am reproducing his post: What you need to do before online data collection? Data colllection

Over-fitting analogies

To explain the danger of model over-fitting in prediction to data mining newcomers, I often use the following analogy: Say you are at the tailor's, who will be sewing an expensive suit (or dress) for you. The tailor takes your measurements and asks whether you'd like the suit to fit you exactly, or whether there should be some "wiggle room". What would you choose? The answer is, "it depends how you plan to use the suit". If you are getting married in a few days, then probably a close fit is desirable. In contrast, if you plan to wear the