Forecasting large collections of time series

With the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting “at scale”! Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financial forecasting. The types of methods taught in forecasting courses tends to be discipline-specific: Statisticians love ARIMA (auto regressive integrated moving average) models, with multivariate versions such as Vector ARIMA, as well as state space models and non-parametric methods such as STL decompositions. Econometricians and finance academics go one step further into ARIMA variations such as ARFIMA (f=fractional), … Continue reading Forecasting large collections of time series

Data Ethics Regulation: Two key updates in 2018

This year, two important new regulations will be impacting research with human subjects: the EU’s General Data Protection Regulation (GDPR), which kicks in May 2018, and the USA’s updated Common Rule, called the Final Rule, is in effect from Jan 2018. Both changes relate to protecting individuals’ private information and will affect researchers using behavioral data in terms of data collection, access, use, applications for ethics committee (IRB) approvals/exemptions, collaborations within the same country/region and beyond, and collaborations with industry. Both GDPR and the final rule try to modernize what today constitutes “private data” and data subjects’ rights and balance … Continue reading Data Ethics Regulation: Two key updates in 2018

Election polls: description vs. prediction

My papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven’t expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and predictive modeling both share the reliance on correlations, whereas explanatory modeling relies on causality, the former two are in fact different. Descriptive modeling aims to give a parsimonious statistical representation of a distribution or relationship, whereas predictive modeling aims at generating values for new/future observations. … Continue reading Election polls: description vs. prediction

Statistical test for “no difference”

To most researchers and practitioners using statistical inference, the popular hypothesis testing universe consists of two hypotheses: H0 is the null hypothesis of “zero effect” H1 is the alternative hypothesis of “a non-zero effect” The alternative hypothesis (H1) is typically what the researcher is trying to find: a different outcome for a treatment and control group in an experiment, a regression coefficient that is non-zero, etc. Recently, several independent colleagues have asked me if there’s a statistical way to show that an effect is zero, or, that there’s no difference between groups. Can we simply use the above setup? The answer … Continue reading Statistical test for “no difference”

My videos for “Business Analytics using Data Mining” now publicly available!

Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections of BADM (which was started by Ravi Bapna in 2006), and the course was fast growing in popularity. Repeating the same lectures in multiple course sections made me realize it was time for scale! … Continue reading My videos for “Business Analytics using Data Mining” now publicly available!

Data mining algorithms: how many dummies?

There’s lots of posts on “k-NN for Dummies”. This one is about “Dummies for k-NN” Categorical predictor variables are very common. Those who’ve taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps: Convert the categorical variable that has m categories, into m binary dummy variables Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category) For example, if we have X={red, yellow, green}, in step 1 we create three dummies: D_red = … Continue reading Data mining algorithms: how many dummies?

Key challenges in online experiments: where are the statisticians?

Randomized experiments (or randomized controlled trials, RCT) are a powerful tool for testing causal relationships. Their main principle is random assignment, where subjects or items are assigned randomly to one of the experimental conditions. A classic example is a clinical trial with one or more treatment groups and a no-treatment (control) group, where individuals are assigned at random to one of these groups. Story 1: (Internet) experiments in industry  Internet experiments have now become a major activity in giant companies such as Amazon, Google, and Microsoft, in smaller web-based companies, and among academic researchers in management and the social sciences. … Continue reading Key challenges in online experiments: where are the statisticians?

Experimenting with quantified self: two months hooked up to a fitness band

It’s one thing to collect and analyze behavioral big data (BBD) and another to understand what it means to be the subject of that data. To really understand. Yes, we’re all aware that our social network accounts and IoT devices share our private information with large and small companies and other organizations. And although we complain about our privacy, we are forgiving about sharing it, most likely because we really appreciate the benefits. So, I decided to check out my data sharing in a way that I cannot ignore: I started wearing a fitness band. I bought one of the … Continue reading Experimenting with quantified self: two months hooked up to a fitness band

Statistical software should remove *** notation for statistical significance

Now that the emotional storm following the American Statistical Association’s statement on p-values is slowing down (is it? was there even a storm outside of the statistics area?), let’s think about a practical issue. One that greatly influences data analysis in most fields: statistical software. Statistical software influences which methods are used and how they are reported. Software companies thus affect entire disciplines and how they progress and communicate. Star notation for p-value thresholds in statistical software No matter whether your field uses SAS, SPSS (now IBM), STATA, or another statistical software package, you’re likely to have seen the star … Continue reading Statistical software should remove *** notation for statistical significance

A non-traditional definition of Big Data: Big is Relative

I’ve noticed that in almost every talk or discussion that involves the term Big Data, one of the first slides by the presenter or the first questions to be asked by the audience is “what is Big Data?” The typical answer has to do with some digits, many V’s, terms that end with “bytes”, or statements about software or hardware capacity. I beg to differ. “Big” is relative. It is relative to a certain field, and specifically to the practices in the field. We therefore must consider the benchmark of a specific field to determine if today’s data are “Big”. … Continue reading A non-traditional definition of Big Data: Big is Relative