Nature has a depressing piece about how to interpret p-values (i.e., the numbers people generally use to describe “statistical significance”). What’s depressing about it? Sentences like this:
Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm.
If it’s really true that “most scientists” think this, then we’re in deep trouble.
Anyway, the article goes on to give a good explanation of why this is wrong:
But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is.
The main point here is the standard workhorse idea of Bayesian statistics: the experimental evidence gives you a recipe for updating your prior beliefs about the probability that any given statement about the world is true. The evidence alone does not tell you the probability that a hypothesis is true. It cannot do so without folding in a prior.
To rehash the old, standard example, suppose that you take a test to see if you have kuru. The test gives the right answer 99% of the time. You test positive. That test “rules out” the null hypothesis that you’re disease-free with a p-value of 1%. But that doesn’t mean there’s a 99% chance you have the disease. The reason is that the prior probability that you have kuru is very low. Say one person in 100,000 has the disease. When you test 100,000 people, you’ll get roughly one true positive and 1000 false positives. Your positive test is overwhelmingly likely to be one of the false ones, low p-value notwithstanding.
For some reason, people regard “Bayesian statistics” as something controversial and heterodox. Maybe they wouldn’t think so if it were simply called “correct reasoning,” which is all it is.
You don’t have to think of yourself as “a Bayesian” to interpret p-values in the correct way. Standard statistics textbooks all state clearly that a p-value is not the probability that a hypothesis is true, but rather the probability that, if the null hypothesis is true, a result as extreme as the one actually found would occur.
Here’s a convenient Venn diagram to help you remember this:
(Confession: this picture is a rerun.)
If Nature‘s readers really don’t know this, then something’s seriously wrong with the way we train scientists.
Anyway, there’s a bunch of good stuff in this article:
The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look.
Fisher’s got this exactly right. The standard in many fields for “statistical significance” is a p-value of 0.05. Unless you set the value far, far lower than that, a very large number of “significant” results are going to be false. That doesn’t necessarily mean that you shouldn’t use p-values. It just means that you should regard them (particularly with this easy-to-cross 0.05 threshold) as ways to decide which hypotheses to investigate further.
Another really important point:
Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously.
I didn’t know the term P-hacking, although I’d heard some of the others. Anyway, it’s a sure-fire way to generate significant-looking but utterly false results, and it’s unfortunately not at all unusual.
My first formal introduction* to proper “correct reasoning” came my second year in graduate school, in a class only given to observational cosmologists (which I took for fun). Many of my peers in other subfields as yet have had no such formal introduction, and this point never will. Some of them I know mis-interpret p-values.
In contrast, my introduction to p-values happened somewhere in my Algebra II textbook in middle school–in which, IIRC from my days teaching high school, included a statement akin to, “The p-value tests whether or not hypotheses are true.” Giving someone a tool so far in advance of giving them the proper guidance on what the tool is useful for…no wonder people get it wrong.
*Of course, as you obviously know Ted, I was informally introduced by a very capable practitioner of Bayesian statistics much earlier.
It’s worth noting that physics students typically are not required to take any formal statistics courses, at least not in any program I’ve been part of. We’re supposed to just pick it up on the streets. I’m pretty sure things are very different in other fields. The fact that physicists aren’t dramatically worse than other scientists therefore seems like an indictment of the way statistics is taught, in those disciplines where it is taught.