I’ve noticed that in almost every talk or discussion that involves the term Big Data, one of the first slides by the presenter or the first questions to be asked by the audience is “what is Big Data?” The typical answer has to do with some digits, many V’s, terms that end with “bytes”, or statements about software or hardware capacity.
I beg to differ.
“Big” is relative. It is relative to a certain field, and specifically to the practices in the field. We therefore must consider the benchmark of a specific field to determine if today’s data are “Big”. My definition of Big Data is therefore data that require a field to change its practices of data processing and analysis.
On the one extreme, consider weather forecasting, where data collection, huge computing power, and algorithms for analyzing huge amounts of data have been around for a long time. So is today’s climatology data “Big” for the field of weather forecasting? Probably not, unless you start considering new types of data that the “old” methods cannot process or analyze.
Another example is the field of genetics, where researchers have been working with an analyzing large-scale datasets (notably from the Human Genome Project) for some time. The “Big Data” in this field is about linking different databases and integrating domain knowledge with the patterns found in the data (“As big-data researchers churn through large tumour databases looking for patterns of mutations, they are adding new categories of breast cancer.“)
On the other extreme, consider studies in the social sciences, in fields such as political science or psychology that have traditionally relied on 3-digit sample sizes (if you were lucky). In these fields, a sample of 100,000 people is Big Data, because it challenges the methodologies used by researchers in the field. Here are some of the challenges that arise:
- Old methods break down: the common method of statistical significance tests for testing theory no longer works, as p-values will tend to be tiny irrespective of practical significance (one more reason to carefully consider the recent statement by the American Statistical Association about the danger of using the “p-value < 0.5” rule.
- Technology challenge: the statistical software and hardware used by many social science researchers might not be able to handle these new data sizes. Simple operations such as visualizing 100,000 observations in a scatter plot require new practices and software (such as state-of-the-art interactive software packages).
- Social science researchers need to learn how to ask more nuanced questions, now that richer data are available to them.
- Social scientists are not trained in data mining, yet the new sizes of datasets can allow them to discover patterns that are not hypothesized by theory