Image from http://www.slews.de Spatial data are inherently important in environmental applications. An example is collecting data from air or water quality sensors. Such data collection mechanisms introduce dependence in the collected data due to their spatial proximity/distance. This dependence must be taken into account not only in the data analysis stage (and there is a good statistical literature on spatial data analysis methods), but also in the design of experiments stage. One example of a design question is where to locate the sensors and how many sensors are needed? Where does explain vs. predict come into the picture? An interesting 2006 … Continue reading Designing an experiment on a spatial network: To Explain or To Predict?
Recently a posting on the Research Methods Linked-In group asked what is Principal Components Analysis (PCA) in laymen terms and what is it useful for. The answers clearly reflected the two “camps”: social science researchers and data miners. For data miners PCA is a popular and useful data reduction method for reducing the dimension of dataset with many variables. For social scientists PCA is a type of factor analysis without a rotation step. The last sentence might sound cryptic to a non-social-scientist, so a brief explanation is in place: The goal of rotation is to simplify and clarify the interpretation … Continue reading The PCA Debate
To explain the danger of model over-fitting in prediction to data mining newcomers, I often use the following analogy: Say you are at the tailor’s, who will be sewing an expensive suit (or dress) for you. The tailor takes your measurements and asks whether you’d like the suit to fit you exactly, or whether there should be some “wiggle room”. What would you choose? The answer is, “it depends how you plan to use the suit”. If you are getting married in a few days, then probably a close fit is desirable. In contrast, if you plan to wear the … Continue reading Over-fitting analogies
Here is an interesting example of how similar mechanics lead to two very different statistical tools. Principal Components Analysis (PCA) is a powerful method for data compression, in the sense of capturing the information contained in a large set of variables by a smaller set of linear combinations of those variables. As such, it is widely used in applications that require data compression, such as visualization of high-dimensional data and prediction. Factor Analysis (FA), technically considered a close cousin of PCA, is popular in the social sciences, and is used for the purpose of discovering a small number of ‘underlying … Continue reading Principal Components Analysis vs. Factor Analysis
This continues my “To Explain or To Predict?” argument (in brief: statistical models aimed at causal explanation will not necessarily be good predictors). And now, I move to a very early stage in the study design: how should we collect data? A well-known notion is that experiments are preferable to observational studies. The main difference between experimental studies and observational studies is an issue of control. In experiments, the researcher can deliberately choose “treatments” and control the assignment of subjects to the “treatments”, and then can measure the outcome. Whereas in observational studies, the researcher can only observe the subjects … Continue reading Are experiments always better?
I often glimpse the local newspapers while visiting a foreign country (as long as it is in a language I can read). Yesterday, the Australian Herald Sun had the article “Drop in light beer sales blamed for surge in street violence“. The facts presented: “Light beer sales have fallen 15% in seven years, while street crime has soared 43%”. More specifically: “Police statistics show street assaults rose from 6400 in 2000-01 to more than 9000 in 2007-08. At the same time, Victorians’ thirst for light beer dried up.” The interpretation by health officials: “there was a definite connection between the … Continue reading Beer and … crime
Last month The New York Times featured an article about Dr. Doom: Economics professor “Roubini, a respected but formerly obscure academic, has become a major figure in the public debate about the economy: the seer who saw it coming.” This article caught my statistician eye due to the description of “data” and “models”. While economists in the article portray Roubini as not using data and econometric models, a careful read shows that he actually does use data and models, but perhaps unusual data and unusual models! Here are two interesting quotes: “When I weigh evidence,” he told me, “I’m drawing … Continue reading Dr. Doom and data mining
Are explaining and predicting the same? An age-old debate in philosophy of science started with Hempel & Oppenheim’s 1948 paper that equates the logical structure of predicting and explaining (saying that in effect they are the same, except that in explaining the phenomenon already happened while in prediction it hasn’t occurred). Later on it was recognized that the two are in fact very different. When it comes to statistical modeling, how are the two different? Do we model data differently when the goal is to explain than to predict? In a recent paper co-authored with Otto Koppius from Erasmus University, … Continue reading Good predictions by wrong model?
Here’s another interesting example where explanatory and predictive tasks create different models: econometric models. These are essentially regression models of the form: Y(t) = beta0 + beta1 Y(t-1) + beta2 X(t) + beta3 X(t-1) + beta4 Z(t-1) + noise An example would be forecasting Y(t)= consumer spending at time t, where the input variables can be consumer spending in previous time periods and/or other information that is available at time t or earlier. In economics, when Y(t) is the state of the economy at time t, there is a distinction between three types of variables (aka “indicators”): Leading, coincident, and … Continue reading Forecasting with econometric models
When it comes to classification trees, there are three major algorithms used in practice. CART (“Classification and Regression Trees”), C4.5, and CHAID. All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways. The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree … Continue reading Classification Trees: CART vs. CHAID