Nature has a depressing piece about how to interpret p-values (i.e., the numbers people generally use to describe “statistical significance”). What’s depressing about it? Sentences like this:
Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm.
If it’s really true that “most scientists” think this, then we’re in deep trouble.
Anyway, the article goes on to give a good explanation of why this is wrong:
But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is.
The main point here is the standard workhorse idea of Bayesian statistics: the experimental evidence gives you a recipe for updating your prior beliefs about the probability that any given statement about the world is true. The evidence alone does not tell you the probability that a hypothesis is true. It cannot do so without folding in a prior.
To rehash the old, standard example, suppose that you take a test to see if you have kuru. The test gives the right answer 99% of the time. You test positive. That test “rules out” the null hypothesis that you’re disease-free with a p-value of 1%. But that doesn’t mean there’s a 99% chance you have the disease. The reason is that the prior probability that you have kuru is very low. Say one person in 100,000 has the disease. When you test 100,000 people, you’ll get roughly one true positive and 1000 false positives. Your positive test is overwhelmingly likely to be one of the false ones, low p-value notwithstanding.
For some reason, people regard “Bayesian statistics” as something controversial and heterodox. Maybe they wouldn’t think so if it were simply called “correct reasoning,” which is all it is.
You don’t have to think of yourself as “a Bayesian” to interpret p-values in the correct way. Standard statistics textbooks all state clearly that a p-value is not the probability that a hypothesis is true, but rather the probability that, if the null hypothesis is true, a result as extreme as the one actually found would occur.
Here’s a convenient Venn diagram to help you remember this:
(Confession: this picture is a rerun.)
If Nature‘s readers really don’t know this, then something’s seriously wrong with the way we train scientists.
Anyway, there’s a bunch of good stuff in this article:
The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look.
Fisher’s got this exactly right. The standard in many fields for “statistical significance” is a p-value of 0.05. Unless you set the value far, far lower than that, a very large number of “significant” results are going to be false. That doesn’t necessarily mean that you shouldn’t use p-values. It just means that you should regard them (particularly with this easy-to-cross 0.05 threshold) as ways to decide which hypotheses to investigate further.
Another really important point:
Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously.
I didn’t know the term P-hacking, although I’d heard some of the others. Anyway, it’s a sure-fire way to generate significant-looking but utterly false results, and it’s unfortunately not at all unusual.