So, when one of the PhD students taking my “Scientific Data Collection” course stumbled upon this recent Science Magazine article “Mission Improbable: A Concise and Precise Definition of P-Value” he couldn’t resist emailing it to me. The article showcases the abuse of p-values in medical research due their illusive meaning. This is not even with large samples! Researchers incorrectly interpret the meaning of a p-value to be the *probability *of an effect rather than its statistical significance. The result of such confusion can clearly be devastating when the issue at stake is the effectiveness of a new drug or vaccine.

There are obviously better ways for assessing statistical significance, which are better aligned with practical significance and are also less ambiguous than p-values. One is confidence intervals. You get an estimate of your effect plus/minus some margin. You can then evaluate what the interval means practically. Another approach (good to try both) is to test predictive accuracy of your model, to see whether the prediction error is at a reasonable level — this is achieved by applying your model to new data, and evaluating how well it fits those new data.

Shockingly enough, people seem to really want to use p-values, even if they don’t understand them. I recently was involved in designing materials for a basic course on statistics for engineers and managers in a big company. We created an innovative and beautiful set of slides, with real examples, straightforward explanations, and practical advice. The 200+ slides did not have mention of p-values, but rather focused on measuring effects, understanding sampling variability, standard errors and confidence intervals, seeing the value of residual analysis in linear regression, and learning how to perform and evaluate prediction. Yet, we were requested by the company to replace some of this material (“not sure if we will need residual analysis, sampling error etc. our target audience may not use it”) with material on p-values and on the 0.05 threshold (“It will be sufficient to know interpreting the p-value and R-sq to interpret the results”). Sigh.

It’s hard to change a culture with such a long history.

The Albright book uses the dreaded word probability in it's definition of p-value.

"The p-value of a sample is the probability of seeing a sample with at least as much evidence in favor of the alternative hypothesis as the sample actually observed. The small the p-value, the more evidence there is in favor of the alternative hypothesis." pg 485

Indeed — this very popular definition is correct, although very cryptic. It also means that you have to keep in mind what is the alternative hypothesis, whether it is one- or two-sided, etc. You get so bogged down in these details that it is very likely to forget what it really means and how this relates to practical importance.

Relating to the p-value point, another fundamental flaw is assuming a normal distribution for almost every matter. It has become a tradition to not only use p-values excessively and inappropriately, but also with the wrong distribution assumption, which further exacerbates the flaw of their outputs. Although quality controls systems, such as Six-Sigma are great tools, but with abnormal skewness and kurtosis, they fail to do their job. I think it unfortunately comes down to presenting in a language that non-technical audience would understand.

The normal assumption, when adequate, is very powerful. When violated, it depends what you are trying to do. If prediction, then perhaps not too bad (but this should be assessed through a holdout set or a similar method). If for inference, then indeed you can be way off. Statisticians will always look at residuals for assessing normality (and perhaps some normality tests with p-values…). In Six Sigma I'd expect residual analysis to be a major emphasis, especially when the official Six Sigma software Minitab has residual analysis almost automatically outputted.

A dummy-proof solution for inference when you don't want to assume normality is to use more robust methods. Bootstrap is one good way (the empirical distribution of a coefficient is obtained by resampling).

I agree that culture is one of the driving forces in the reliance on the p-value. Many introductory courses teach the "simple rule" that anything with a p-value less than .05 is to be considered important. That simple rule then drives results interpretation. While the rule is remembered by many, few remember the associated caveats. I have found that is hard to move away from the reliance on a rule when meaningful interpretation requires more in-depth analysis; people sometimes feel less comfortable and out of their element.

Its true, in the west we are very reductionist. We like to isolate single variables in complex systems, and hope that leads us to understand the system itself. I think this tendency is what causes us to look for simple numerical thresholds that provide us with a clear yes/no. Fuzziness is hard for the western mind to deal with.

Hi Rob – interesting perspective! When I think of a recent blood test that I did, the report was a list of components, each with a confidence interval and a note whether my number was within that interval or not. My doctor explained that just picking the extreme numbers and treating them would be nonsense. Instead, one has to look at the whole picture including the "within-interval" numbers to understand what is going on. I guess we should learn from Integrative Medicine and encourage "Holistic Statistics".

I have mixed feelings on the whole thing. While I do disapprove of over interpreting statistical figures, there are "costs" to consuming more accurate information.

When I was in high school doing the student newspaper, we always had to include more or less all the relevant facts in the first two paragraphs because we knew very few people would read beyond that. I used to be extremely annoyed by this fact but I ended up doing the same thing once I started working. When reading a movie review, for instance, I’d just skip to the box with the “pros” and “cons” or even just the final score.

We have to keep in mind that the entire goal of statistics is to make consuming a large amount of information more manageable by making things simpler via models. By doing this we always give up some information in exchange for simplicity and practicality (principal component analysis is a good illustration of this). The question is really how much is too much. For those of us that have some familiarity with the field, looking at just the p-value seems to be absurd but it’s understandable for the general public. Both the effort to learn deeper statistics and the time to actually consume more complex analysis are costly, especially when dealing with the business world.

Fan – I agree that simple is good. It's all about parsimony. The problem is that p-value is mis-understood. A confidence interval is just as simple and less prone to error.

Yeah I definitely agree on that point. I think the thing about the p-value is that it's one of those things that everyone is "comfortable with." Kind of like how the "unemployment rate" that is reported is often misunderstood (how it doesn't include people who are not actively looking for jobs and that it excludes people with only part time jobs). To an economist, this number must be used with caution but the general public prefers not to deal with all the details and just use the one number. This of course can lead to mistakes (like how the recent "reduction" in unemployement rate from 10.2% to 10% is likely caused by the hiring of seasonal workers by retailers for the holiday shopping season than any real improvement in the economy).

Besides being misunderstood, I think making a decision based on p-values can sometimes magnify the actual advantage of making one choice over the other. There was an interesting article that I had come across – http://www.medscape.com/viewarticle/524206 (To P or Not to P: Why Use a P Value, Anyway?). The author here provides an example in making a choice between 2 bicycle routes from work to home. He gives a hypothetical example where one route has a much lower lower p-value however since the time saved is relatively insignificant, one wouldn't change their mind based on the p-values alone.

Thanks Shalin — cool example! Here is a URL that has this article publicly available.