Forecasting large collections of time series

With the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting “at scale”! Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financial forecasting. The types of methods taught in forecasting courses tends to be discipline-specific: Statisticians love ARIMA (auto regressive integrated moving average) models, with multivariate versions such as Vector ARIMA, as well as state space models and non-parametric methods such as STL decompositions. Econometricians and finance academics go one step further into ARIMA variations such as ARFIMA (f=fractional), … Continue reading Forecasting large collections of time series

Election polls: description vs. prediction

My papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven’t expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and predictive modeling both share the reliance on correlations, whereas explanatory modeling relies on causality, the former two are in fact different. Descriptive modeling aims to give a parsimonious statistical representation of a distribution or relationship, whereas predictive modeling aims at generating values for new/future observations. … Continue reading Election polls: description vs. prediction

Key challenges in online experiments: where are the statisticians?

Randomized experiments (or randomized controlled trials, RCT) are a powerful tool for testing causal relationships. Their main principle is random assignment, where subjects or items are assigned randomly to one of the experimental conditions. A classic example is a clinical trial with one or more treatment groups and a no-treatment (control) group, where individuals are assigned at random to one of these groups. Story 1: (Internet) experiments in industry  Internet experiments have now become a major activity in giant companies such as Amazon, Google, and Microsoft, in smaller web-based companies, and among academic researchers in management and the social sciences. … Continue reading Key challenges in online experiments: where are the statisticians?

Experimenting with quantified self: two months hooked up to a fitness band

It’s one thing to collect and analyze behavioral big data (BBD) and another to understand what it means to be the subject of that data. To really understand. Yes, we’re all aware that our social network accounts and IoT devices share our private information with large and small companies and other organizations. And although we complain about our privacy, we are forgiving about sharing it, most likely because we really appreciate the benefits. So, I decided to check out my data sharing in a way that I cannot ignore: I started wearing a fitness band. I bought one of the … Continue reading Experimenting with quantified self: two months hooked up to a fitness band

A non-traditional definition of Big Data: Big is Relative

I’ve noticed that in almost every talk or discussion that involves the term Big Data, one of the first slides by the presenter or the first questions to be asked by the audience is “what is Big Data?” The typical answer has to do with some digits, many V’s, terms that end with “bytes”, or statements about software or hardware capacity. I beg to differ. “Big” is relative. It is relative to a certain field, and specifically to the practices in the field. We therefore must consider the benchmark of a specific field to determine if today’s data are “Big”. … Continue reading A non-traditional definition of Big Data: Big is Relative

What’s in a name? “Data” in Mandarin Chinese

The term “data”, now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I’ve been aware of is whether the English term “data” is singular or plural. The tone of an entire article would be different based on the author’s decision. In Hebrew, the word is in plural (Netunim, with the final “im” signifying plural), so no question arises. Today I discovered another “data” duality, this time in Mandarin Chinese. In Taiwan, the term used is 資料 (Zīliào), while in Mainland China it is 數據 (Shùjù). Which one to use? What is the … Continue reading What’s in a name? “Data” in Mandarin Chinese

What does “business analytics” mean in academia?

But what exactly does this mean? In the recent ISIS conference, I organized and moderated a panel called “Business Analytics and Big Data: How it affects Business School Research and Teaching“. The goal was to tackle the ambiguity in the terms “Business Analytics” and “Big Data” in the context of business school research and teaching. I opened with a few points: Some research b-schools are posting job ads for tenure-track faculty in “Business Analytics” (e.g., University of Maryland; Google “professor business analytics position” for plenty more). What does this mean? what is supposed to be the background of these candidates … Continue reading What does “business analytics” mean in academia?

Linear regression for a binary outcome: is it Kosher?

Regression models are the most popular tool for modeling the relationship between an outcome and a set of inputs. Models can be used for descriptive, causal-explanatory, and predictive goals (but in very different ways! see Shmueli 2010 for more). The family of regression models includes two especially popular members: linear regression and logistic regression (with probit regression more popular than logistic in some research areas). Common knowledge, as taught in statistics courses, is: use linear regression for a continuous outcome and logistic regression for a binary or categorical outcome. But why not use linear regression for a binary outcome? the … Continue reading Linear regression for a binary outcome: is it Kosher?

Policy-changing results or artifacts of big data?

The New York Times article Big Study Links Good Teachers to Lasting Gain covers a research study coming out of Harvard and Columbia on “The Long-Term Impacts of Teachers: Teacher Value-Added and Student Outcomes in Adulthood“. The authors used sophisticated econometric models applied to data from a million students to conclude: “We find that students assigned to higher VA [Value-Added] teachers are more successful in many dimensions. They are more likely to attend college, earn higher salaries, live in better neighborhoods, and save more for retirement. They are also less likely to have children as teenagers.” When I see social scientists using statistical … Continue reading Policy-changing results or artifacts of big data?

Big Data: The Big Bad Wolf?

“Big Data” is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call “fat data”), others of huge numbers of records (“long data”), and some talk of both. How much is big? Again, depends who you ask. As a statistician who’s (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of “just get a good sample and get it … Continue reading Big Data: The Big Bad Wolf?