The new Coursera course by Princeton Professor Mung Chiang was so popular that Amazon and the publisher ran out of copies of the textbook before the course even started (see “new website features” announcement; requires login). I experienced a stockout of my own textbook (“Data Mining for Business Intelligence”) a couple of years ago, which caused grief and slight panic to both students and instructors. With stockouts in mind, and recognizing the difficulty of obtaining textbooks outside of North America (unavailable, too expensive, or long/costly shipping), I decided to take things into my own hands and self-publish a “Practical Analytics” series of … Continue reading Self-publishing to the rescue
Quite a few of my social science colleagues think that predictive modeling is not a kosher tool for theory building. In our 2011 MISQ paper “Predictive Analytics in Information Systems Research” we argue that predictive modeling has a critical role to play not only in theory testing but also in theory building. How does it work? Here’s an interesting example: The new book The Secret Life of Pronouns by the cognitive psychologist Pennebaker is a fascinating read in many ways. The book describes how analysis of written language can be predictive of psychological state. In particular, the author describes an … Continue reading Language and psychological state: explain or predict?
I find it illuminating to read statistics “bibles” in various fields, which not only open my eyes to different domains, but also present the statistical approach and methods somewhat differently and considering unique domain-specific issues that cause “hmmmm” moments. The 4th edition of Fundamentals of Clinical Trials, whose authors combine extensive practical experience at NIH and in academia, is full of hmmm moments. In one, the authors mention an important issue related to sampling that I have not encountered in other fields. In clinical trials, the gold standard is to allocate participants to either an intervention or a non-intervention (baseline) … Continue reading Statistical considerations and psychological effects in clinical trials
Visualizing a time series is an essential step in exploring its behavior. Statisticians think of a time series as a combination of four components: trend, seasonality, level and noise. All real-world series contain a level and noise, but not necessarily a trend and/or seasonality. It is important to determine whether trend and/or seasonality exist in a series in order to choose appropriate models and methods for descriptive or forecasting purposes. Hence, looking at a time plot, typical questions include: is there a trend? if so, what type of function can approximate it? (linear, exponential, etc.) is the trend fixed throughout the period … Continue reading Visualizing time series: suppressing one pattern to enhance another pattern
I found an interesting variation on the “correlation does not imply causation” mantra in the book Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences by Cohen et al. (apparently one of the statistics bibles in behavioral sciences). The quote (p.7) looks like this: Correlation does not prove causation; however, the absence of correlation implies the absence of the existence of a causal relationship Let’s let the first part rest in peace. At first glance, the second part seems logical: you find no correlation, then how can there be causation? However, after further pondering I reached the conclusion that this logic is flawed, … Continue reading No correlation -> no causation?
I am currently visiting the Indian School of Business (ISB) and enjoying their excellent library. As in my student days, I roam the bookshelves and discover books on topics that I know little, some, or a lot. Reading and leafing through a variety of books, especially across different disciplines, gives some serious points for thought. As a statistician I have the urge to see how statistics is taught and used in other disciplines. I discovered an interesting book coming from the psychology literature by Herman Aguinas called Regression Analysis for Categorical Moderators. “Moderators” in statistician language is “interactions”. However, when … Continue reading Discovering moderated relationship in the era of large samples
I just discovered a short set of videos (currently 35) on different data mining methods on the StatSoft website. This accompanies their neat free online book (I admit, I did end up buying the print copy). The videos show up at the top of various data mining topics in the online book. You can also subscribe to the video series. Continue reading Short data mining videos
A new book is gaining emotional reactions for the normally calm statistics community (no pun intended): The Black Swan: The Impact of the Highly Improbably by Nassim Taleb uses blunt language to critique the field of statistics, statisticians, and users of statistics. I have not yet read the book, but from the many reviews and coverage I am running to get a copy. The widely read ASA statistics journal The American Statistician decided to devote a special section that reviews the book and even obtained a (somewhat bland) response from the author. Four reputable statisticians (Robert Lund, Peter Westfall, Joseph … Continue reading Shaking up the statistics community