If only scientists hyped marginal results more, our problems would be solved

Yesterday’s New York Times Sunday Review section contains one of the most gloriously silly pieces of science journalism I’ve seen in a while.

The main point of the article, which is by the science historian Naomi Oreskes and is headlined “Playing Dumb on Climate Change,” is that the 95% confidence threshold that’s commonly used as the requirement for “statistical significance” is too high. That’s right — in a world in which there’s strong reason to believe that most published research findings are false (in biomedical research), Oreskes thinks that the main problem we need to address is that scientists are too shy and retiring when it comes to promoting marginal results.

The truth, of course, is precisely the opposite. To quote from a good Nature article on this stuff from last year (which I wrote about back in February),

The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look.

In case anyone doesn’t know, what both the Times article and I referred to as “95% confidence” is the same thing as what statisticians call a P-value of 5%.

Fisher, not surprisingly, had this exactly right. A “95% confidence” result is merely a hint that something interesting might be going on. It’s far from definitive evidence. And yet scientists and journalists routinely report these results as if they were virtual certainties. Oreskes’s proposal to lower that threshold is precisely the opposite of what we should do.

In the course of arguing for this position, Oreskes repeats a common misconception about P-values:

Typically, scientists apply a 95 percent confidence limit, meaning that they will accept a causal claim only if they can show that the odds of the relationship’s occurring by chance are no more than one in 20. But it also means that if there’s more than even a scant 5 percent possibility that an event occurred by chance, scientists will reject the causal claim. It’s like not gambling in Las Vegas even though you had a nearly 95 percent chance of winning.

A 95% confidence result (P < 5%) certainly does not mean that there’s a 5% probability that the event occurred by chance. It means that, if we assume that there is no causal link, then there’s a 5% chance of seeing results as extreme as the ones we found.

That distinction (which I’ve written about before) may sound like nitpicking, but it’s extremely important. Suppose that a pregnancy test is guaranteed to be 95% accurate. If I take that pregnancy test and get a positive result, It does not mean that there’s a 95% chance that I’m pregnant. Because there was a very low prior probability of my being pregnant (among other reasons, because I’m male), I’d be quite confident that that test result was a false positive.

When scientists quote P-values, they’re like the 95% accuracy quoted for the pregnancy test: both are probabilities of getting a given outcome (positive test result), assuming a given hypothesis (I’m pregnant). Oreskes (like all too many others) turns this into a statement about the probability of the hypothesis being true. You can’t do that without folding in information about the hypothesis’s prior probability (I’m unlikely to be pregnant, test or no test).

This is one reason that 95%-confidence results aren’t nearly as certain as people seem to think. Lots of scientific hypotheses are a priori not very likely, so even a 95%-confidence confirmation of them doesn’t mean all that much. Other phenomena, such as P-hacking and publication bias, mean that even fewer 95%-confidence results are true than you’d expect.

Oreskes says that scientists “practice a form of self-denial, denying themselves the right to believe anything that has not passed very high intellectual hurdles,” a description I’m happy to agree with, as an aspiration if not always a reality. Where she loses me is in suggesting that this is a bad thing.

She also claims that this posture of extreme skepticism is due to scientists’ fervent desire to distinguish their beliefs from religious beliefs. It’s possible that the latter claim could be justified (Oreskes is a historian of science, after all), but there’s not a hint of evidence or argument to support it in this article.

Oreskes’s article is misleading in another way:

We’ve all heard the slogan “correlation is not causation,” but that’s a misleading way to think about the issue. It would be better to say that correlation is not necessarily causation, because we need to rule out the possibility that we are just observing a coincidence.

This is at best a confusing way to think about the correlation-causation business, as it seems to suggest that the only two possibilities for explaining a correlation are coincidence and causation. This dichotomy is incorrect. There is a correlation between chocolate consumption and Nobel prizes. The correlation is not due to chance (the P-value is extremely low), but one cannot conclude that chocolate causes Nobel prizes (or vice versa).


Published by

Ted Bunn

I am chair of the physics department at the University of Richmond. In addition to teaching a variety of undergraduate physics courses, I work on a variety of research projects in cosmology, the study of the origin, structure, and evolution of the Universe. University of Richmond undergraduates are involved in all aspects of this research. If you want to know more about my research, ask me!

5 thoughts on “If only scientists hyped marginal results more, our problems would be solved”

  1. The only context in which this would make sense is to fold in the results of not acting. For example, if some statistical test with the CMB has a p value of 0.05, I’m not going to get very excited about an “anomaly” or, without further evidence, see it as a hint of “new physics beyond the standard model” or whatever. On the other hand, if I get a tip that there is a 95% chance of a terrorist being on my plane, I certainly wouldn’t take that flight. I wouldn’t do it at 5% either. (Suppose I knew that someone smuggled a bomb onto one of 20 planes, one of which is mine. Would I fly? No.)

    (This reminds me of the story of the bad statistician who was caught trying to smuggle a bomb onto a plane. His defense was that he was trying to prevent terrorism: since the priority of a bomb on a plane is actually very low, the probability of two bombs on a plane must be negligible, and he wasn’t going to explode his.)


  2. Rather than make the blanket statement “there’s strong reason to believe that most published research findings are false” you should at least qualify the assertion by pointing out that your link refers to biomedical research.

  3. Dear Ted,
    Clearly you are very right in underlying one of many misinterpretations of the p statistics.
    Yet, I feel you are too hard on Oreskes: she has a point in insisting on another fact: that the choice of .05 as default threshold in null hypothesis statistical significance testing makes no sense.
    P<.05 as an accept/reject threshold is based on laziness, and misunderstanding of statistical reasoning. The threshold for P should be chosen -through decision theory- to minimise costs or maximise benefits, it can't be that in oncology (when people's life is in danger) one uses the same threshold as in "not life-or-death" decisions. Anyway, despite not being a physicist or a mathematician, or a statistician, I feel p values in NHSST are almost "rubbish" as they are too often misunderstood (as you said) and are very counter-intuitive

  4. I wish someone would enlighten me on this issue: a p-value is the p-robability of my test data (or “more extreme” data, under the assumption that the null hypothesis is TRUE. P-value = p(D/H0) As Ted said in another post, the question answerable by a p-value is not the good one, the latter being p(H0/D) (is my hypothesis supported by data?). Now NHSST is LOGICALLY a huge blunder, is it not? B/c the whole thing is about answering the question p (H0/D) via a comparison of a p-value and an alpha (perfectly arbitrary threshold) value. But the p-value was calculated by ASSUMING that H0 was true! How come now I would want to use the p-value [whose calculation once again depends on the truth of H0] for deciding whether or not the H0 assumption is true? There is not only a transposed conditional fallacy at play here; there is also a fundamental logical inconsistency here. In the elementary logic I am familiar with, one cannot use a “conclusion” to prove or disprove a “premise”, it is “illegal” as a form of reasoning. So please clarify…Thank you greatly

Comments are closed.