This continues my “To Explain or To Predict?” argument (in brief: statistical models aimed at causal explanation will not necessarily be good predictors). And now, I move to a very early stage in the study design: how should we collect data?
A well-known notion is that experiments are preferable to observational studies. The main difference between experimental studies and observational studies is an issue of control. In experiments, the researcher can deliberately choose “treatments” and control the assignment of subjects to the “treatments”, and then can measure the outcome. Whereas in observational studies, the researcher can only observe the subjects and measure variables of interest.
An experimental setting is therefore considered “cleaner”: you manipulate what you can, and randomize what you can’t (like the famous saying Block what you can and randomize what you can’t). In his book Observational Studies, Paul Rosenbaum writes “Experiments are better than observational studies because there are fewer grounds for doubt.” (p. 11).
Better for what purpose?
I claim that sometimes observational data are preferable. Why is that? well, it all depends on the goal. If the goal is to infer causality, then indeed an experimental setting wins hands down (if feasible, of course). However, what if the goal is to accurately predict some measure for new subjects? Say, to predict which statisticians will write blogs.
Because prediction does not rely on causal arguments but rather on associations (e.g., “statistician blog writers attend more international conferences”), the choice between an experimental and observational setting should be guided by additional considerations beyond the usual ethical, economic, and feasibility constraints. For instance, for prediction we care about the closeness of the study environment and the reality in which we will be predicting; we care about measurement quality and its availability at the time of prediction.
An experimental setting might be too clean compared to the reality in which prediction will take place, thereby eliminating the ability of a predictive model to capture authentic “noisy” behavior. Hence, if the “dirtier” observational context contains association-type information that benefits prediction, it might be preferable to an experiment.
There are additional benefits of observational data for building predictive models:
- Predictive reality: Not only can the predictive model benefit from the “dirty” environment, but the assessment of how well the model performs (in terms of predictive accuracy) will be more realistic if tested in the “dirty” environment.
- Effect magnitude: Even if an input is shown to cause an output in an experiment, the magnitude of the effect within the experiment might not generalize to the “dirtier” reality.
- The unknown: even scientists don’t know everything! predictive models can discover previously unknown relationships (associative or even causal). Hence, limiting ourselves to an experimental setting that is designed and limited to our knowledge, can keep our knowledge stagnant, and predictive accuracy low.
The Netflix prize competition is a good example: if the goal were to find the causal underpinnings of movie ratings by users, then an experiment may have been useful. But if the goal is to predict user ratings of movies, then observational data like those released to the public are perhaps better than an experiment.