p-values aren’t wrong; they’re just uninteresting

The journal Basic and Applied Social Psychology has come out with a ban on p-values. To be precise, they’ve banned the “null hypothesis significance testing procedure” from articles published in the journal. This ban means that authors in the journal can’t claim that an effect they see in their data is “statistically significant” in the usual way that we’re all accustomed to reading.

Like all right-thinking people, I believe that the only coherent way to think about statistical questions is the Bayesian way, and that there are serious problems with the alternative “frequentist” approach. Moreover, the sort of (frequentist) significance testing banned by this journal can indeed lead to serious problems. It’s a large part of the reason that a strong argument can be made that (at least in some scientific disciplines) most published research findings are false.

All of that suggests that I ought to applaud this decision, but at the risk of seeming disloyal to my fellow Bayesians, I don’t. In fact, the journal editors’ decision to impose this ban makes me trust the quality of the journal less than I otherwise would, not more.

I was pleased to see that my old friend Allen Downey, always a voice of sanity on matters of this sort, is of the same opinion. I won’t rehash everything he says in his post, but I heartily endorse it.

The main thing to realize is that the techniques in question aren’t actually wrong. On the contrary, they correctly answer the questions they’re supposed to answer.

If your data allow you to reject the null hypothesis with a significance (“p-value”) of 5%, that means that, if the null hypothesis were true, there’d be only a 5% chance of getting data that look like the data you actually got. 

Some people — or so I’ve been told — labor under the misconception that the p-value tells you the probability that the null hypothesis is true, but it doesn’t. I’m going to rehash the old story here; skip ahead if you’ve heard it before.

Suppose that a pregnancy test yields correct results 95% of the time. Pat takes the test, which comes out positive. That means that the “null hypothesis” that Pat is not pregnant can be ruled out with a significance (p-value) of 5%. But it does not mean that there’s a 95% chance that Pat is pregnant. The probability that Pat is pregnant depends both on the result of the test and on any additional information you have about Pat — that is, the prior probability that Pat is pregnant. For example, if Pat is anatomically male, then the probability of pregnancy is zero, regardless of the test result.

Needless to say, Pat doesn’t care about the p-value; Pat cares about whether Pat is pregnant. The p-value is not wrong; it’s just uninteresting. As I’ve said before, this can be summarized with a convenient Venn diagram:

Slide11

So if p-values are uninteresting, why shouldn’t the journal ban them?

The main reason is because they can be an ingredient in drawing useful conclusions. You can combine the p-value with your prior knowledge to get an answer to the question you’re actually interested in (“How likely is the hypothesis to be true?”). Just because the p-value doesn’t directly answer Pat’s question about whether she’s pregnant, that doesn’t mean that it’s not valid and useful information about the reliability of the pregnancy test, which she can (and should) use in drawing conclusions.

As far as I can tell, the argument in favor of banning p-values is that people sometimes misinterpret them, but that’s a weak argument for a ban. It’s worth distinguishing two possibilities here:

  1. The article describes the statistical procedures and results accurately, but statistically illiterate readers misunderstand them. In this situation, I’m perfectly happy to go all libertarian and caveat emptor. Why should an intelligent reader be denied the right to hear about a p-value just because someone else might be misled due to his own ignorance?
  2. The article describes the statistical procedures and results in a misleading way. Obviously, the journal should not allow this. But a ban shouldn’t be necessary to enforce this. The whole point of a peer-reviewed journal is that experts are evaluating the article to make sure it doesn’t contain incorrect or misleading statements. If the editors feel the need for a ban, then they are in effect admitting that they and referees cannot effectively evaluate the validity of an article’s statistical claims.

The editorial describing the reasons for the ban states, incorrectly, that the banned technique is “invalid.” Moreover, the perceived need for a ban seems to me to arise from the editors’ lack of confidence in their own ability to weed out good statistics from bad. That’s why I say that, although I have no great love for p-values, this ban reduces my willingness to trust any results published in this journal.