p-values in LARGE datasets

We had an interesting discussion in our department today, the result of confining statisticians and non-statisticians in a maize-like building. Our colleague who called himself “non-stat-guru” sent a query to us “stat-gurus” (his labels) regarding p-values in a model that is estimated from a very large dataset. The problem: a cetain statistical model was fit to 120,000 observations (that’s right, n=120K). And obviously, all p-values for all predictors turned out to be highly statistically significant. Why does this happen and what does it mean? When the number of observations is very large, standard errors of estimates become very small: a … Continue reading p-values in LARGE datasets

Symposium on Statistical Challenges in eCommerce Research

The second symposium on statistical chellnges in eCommerce will take place at the Carlson School of Management, University of Minnesota, May 22-23. For further details see http://misrc.csom.umn.edu/symposia/2006.05.22/ This symosium follows up the inaugural event held at the R. H. Smith School of Management of the University of Maryland last year, which brought together almost 100 researchers from the fields of information systems, statistics, data mining, marketing, and more. It was a stimulating event with lots of energy. Last year’s event lead to collaborations, discussions, and a special issue of the high-imparct journal Statistical Science which should be out in May … Continue reading Symposium on Statistical Challenges in eCommerce Research

Interactive visualization of data

Two interesting articles describe how interactive visualization tools can be used for deriving insight from large business datasets. They both describe tools developed by the Human-Computer Interaction Lab at the University of Maryland. For James Bond fans, this place reminds me of Q branch — they come up with amazingly cool visualization tools that save your day. The first article “Describing Business Intelligence Using Treemap Visualizations” by Ben Shneiderman describes Treemap, a tool for visualizing hierarchical data. Know smartmoney.com’s “Map of the Market“? Guess where that came from! The second article “The Surest Path to Visual Discovery” by Stephen Few … Continue reading Interactive visualization of data

You can’t escape Bayes…

My students are currently studying for a quiz on classification. One of the classifiers that we talked about is the Naive Bayes classifier. On Saturday evening I received a terrific email from Jason Madhosingh, one of my students. He writes: So, I’m taking a break from studying this evening by watching “numb3rs“, aCBS crime drama where a mathematician uses math to help solve crimes (gofigure). Of course… he brings us Bayes theorem. There truly is no escape! And as a visualization junkie, I was also thrilled about his last comment “They also did a 3D scatterplot”!I guess I’ll have to … Continue reading You can’t escape Bayes…

Patenting predictive models?

A curious sentence in a short BusinessWeek report sent me hunting for clues. In Rep of a (Drug) Salesman, a consulting firm by the name of TargetRx “claims it can identify what really makes a sales rep effective”. From a press release on TargetRx’s website I found the following: Data collected from physicians via survey are then merged with actual prescribing and other behavioral data and analyzed using proprietary analytic methods to develop predictive models of physician prescribing behavior. The proprietary analytics are based in part on TargetRx’s patent-pending Method andSystem for Analyzing the Effectiveness of Marketing Strategies. TargetRxreceived notice … Continue reading Patenting predictive models?

Data mining and privacy

BusinessWeek touched upon a sensitive issue in the article If You’re Cheating on Your Taxes. It’s about federal and state agencies using data mining to find “the bad guys”. Although “data mining” is the term used in many of these stories, a more careful look reveals that there are hardly any advanced statistical/DM methods involved. The issue is the linkage/matching of different data sources. In the Statistical Challenges & Opportunities in eCommerce symposium last year, Stephen Fienberg, a professor of statistics at CMU and an expert on disclosure limitation showed a semi-futuristic movie on a pizza parlor using an array … Continue reading Data mining and privacy

More on predictive vs. explanatory models

This week the predictive vs. explanatory modeling came up in multiple occasions: First, in a study with an information systems colleague where the goal is to build a predictive application for ranking the most-likely auctions to transact; Then, an example that I gave in class of modeling eBay data in to distinguish competitive from non-competitive auctions. And then, a bunch of conversations with students that followed. The point that I want to make here, which I did not mention directly in my previous post on this subject, is that the set of PREDICTORS your model will include can be very … Continue reading More on predictive vs. explanatory models