“Big Data” is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call “fat data”), others of huge numbers of records (“long data”), and some talk of both. How much is big? Again, depends who you ask.
As a statistician who’s (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of “just get a good sample and get it over with”, and later recognizing that “fitting the data to the toolkit” (or, “to a hammer everything looks like a nail”) is straight-jacketing some great opportunities.
The LinkedIn group Advanced Business Analytics, Data Mining and Predictive Modeling reacted passionately to a the question “What is the value of Big Data research vs. good samples?” posted by a statistician and analytics veteran Michael Mout. Respondents have been mainly from industry – statisticians and data miners. I’d say that the sentiment analysis would come out mixed, but slightly negative at first (“at some level, big data is not necessarily a good thing“; “as statisticians, we need to point out the disadvantages of Big Data“). Over time, sentiment appears to be more positive, but not reaching anywhere close to the huge Big Data excitement in the media.
I created a Wordle of the text in the discussion until today (size represents frequency). It highlights the main advantages and concerns of Big Data. Let me elaborate:
- Big data permit the detection of complex patterns (small effects, high order interactions, polynomials, inclusion of many features) that are invisible with small data sets
- Big data allow studying rare phenomena, where a small percentage of records contain an event of interest (fraud, security)
- Sampling is still highly useful with big data (see also blog post by Meta Brown); with the ability to take lots of smaller samples, we can evaluate model stability, validity and predictive performance
- Statistical significance and p-values become meaningless when statistical models are fitted to very large samples. It is then practical significance that plays the key role.
- Big data support the use of algorithmic data mining methods that are good at feature selection. Of course, it is still necessary to use domain knowledge to avoid “garbage-in-garbage-out”
- Such algorithms might be black-boxes that do not help understand the underlying relationship, but are useful in practice for predicting new records accurately
- Big data allow the use of many non-parametric methods (statistical and data mining algorithms) that make much less assumptions about data (such as independent observations)