The journal Basic and Applied Social Psychology has come out with a ban on p-values. To be precise, they’ve banned the “null hypothesis significance testing procedure” from articles published in the journal. This ban means that authors in the journal can’t claim that an effect they see in their data is “statistically significant” in the usual way that we’re all accustomed to reading.
Like all right-thinking people, I believe that the only coherent way to think about statistical questions is the Bayesian way, and that there are serious problems with the alternative “frequentist” approach. Moreover, the sort of (frequentist) significance testing banned by this journal can indeed lead to serious problems. It’s a large part of the reason that a strong argument can be made that (at least in some scientific disciplines) most published research findings are false.
All of that suggests that I ought to applaud this decision, but at the risk of seeming disloyal to my fellow Bayesians, I don’t. In fact, the journal editors’ decision to impose this ban makes me trust the quality of the journal less than I otherwise would, not more.
I was pleased to see that my old friend Allen Downey, always a voice of sanity on matters of this sort, is of the same opinion. I won’t rehash everything he says in his post, but I heartily endorse it.
The main thing to realize is that the techniques in question aren’t actually wrong. On the contrary, they correctly answer the questions they’re supposed to answer.
If your data allow you to reject the null hypothesis with a significance (“p-value”) of 5%, that means that, if the null hypothesis were true, there’d be only a 5% chance of getting data that look like the data you actually got.
Some people — or so I’ve been told — labor under the misconception that the p-value tells you the probability that the null hypothesis is true, but it doesn’t. I’m going to rehash the old story here; skip ahead if you’ve heard it before.
Suppose that a pregnancy test yields correct results 95% of the time. Pat takes the test, which comes out positive. That means that the “null hypothesis” that Pat is not pregnant can be ruled out with a significance (p-value) of 5%. But it does not mean that there’s a 95% chance that Pat is pregnant. The probability that Pat is pregnant depends both on the result of the test and on any additional information you have about Pat — that is, the prior probability that Pat is pregnant. For example, if Pat is anatomically male, then the probability of pregnancy is zero, regardless of the test result.
Needless to say, Pat doesn’t care about the p-value; Pat cares about whether Pat is pregnant. The p-value is not wrong; it’s just uninteresting. As I’ve said before, this can be summarized with a convenient Venn diagram:
So if p-values are uninteresting, why shouldn’t the journal ban them?
The main reason is because they can be an ingredient in drawing useful conclusions. You can combine the p-value with your prior knowledge to get an answer to the question you’re actually interested in (“How likely is the hypothesis to be true?”). Just because the p-value doesn’t directly answer Pat’s question about whether she’s pregnant, that doesn’t mean that it’s not valid and useful information about the reliability of the pregnancy test, which she can (and should) use in drawing conclusions.
As far as I can tell, the argument in favor of banning p-values is that people sometimes misinterpret them, but that’s a weak argument for a ban. It’s worth distinguishing two possibilities here:
- The article describes the statistical procedures and results accurately, but statistically illiterate readers misunderstand them. In this situation, I’m perfectly happy to go all libertarian and caveat emptor. Why should an intelligent reader be denied the right to hear about a p-value just because someone else might be misled due to his own ignorance?
- The article describes the statistical procedures and results in a misleading way. Obviously, the journal should not allow this. But a ban shouldn’t be necessary to enforce this. The whole point of a peer-reviewed journal is that experts are evaluating the article to make sure it doesn’t contain incorrect or misleading statements. If the editors feel the need for a ban, then they are in effect admitting that they and referees cannot effectively evaluate the validity of an article’s statistical claims.
The editorial describing the reasons for the ban states, incorrectly, that the banned technique is “invalid.” Moreover, the perceived need for a ban seems to me to arise from the editors’ lack of confidence in their own ability to weed out good statistics from bad. That’s why I say that, although I have no great love for p-values, this ban reduces my willingness to trust any results published in this journal.
there is so much bad use of misunderstood stats theorems, and wide use of tests (like odds ratios) so unrigorous as not to be taught in courses in mathematical statistics that:
all published experimental papers should require a sign-off by an actual phd in MATHEMATICAL STATISTICS (from a math department), and I do not
mean a biostats phd–I mean a mathematician.
Well, the pregnancy test doesn’t know that Pat is male. Besides, the test is indeed highly unlikely to come out positive if Pat is male. So I think the example is a little weak. The whole point of making pregnancy tests is to get a jump on the more definitive knowledge that comes from visible pregnancy, ultrasounds, etc. And intrinsic to their validation (and interest) is to know what their rate of false positives/negatives is, which is all conveniently embodied in the p-value.
So you are right that p-values are perfectly valid. But they are also useful and interesting, in the right circumstances. And you are right as well that people who do not feel comfortable with statistics like this shouldn’t be publishing journals at all. The editors are right that these statistics are too often used to dress up uninteresting hypotheses, but they should be editing them out, not issuing fatwas. It is not an issue of mathematical validation, but of scientific relevance evaluation. Perhaps a more useful directive might be that all tests of this sort need to be throughly summarized / explained in words as well as in the p-value symbology.
Thanks for this post-
I often thought that P value measure of the strength of the evidence against the null hypothesis. It is how unlikely to obtain the data going against the null hypothesis if it were true.
Susan, Susan! Yes, yes, yes! Studies based on statistical analyses, that are way over the heads of the people doing the studies, should be checked and validated by a real mathematician. I’m in biology, I would welcome this with great enthusiasm. So many papers would simply be binned. Info glut reduced.