This piece does a good job of explaining some of the dangers in the use and interpretation of statistics in scientific studies. It’s mostly about ways in which the statistical results quoted in the scientific literature can be misleading (to scientists as well as to the general public).

The article chiefly attacks the use of 95% confidence results (i.e., results that reject the null hypothesis with p=0.05 or less) as indications that something has been scientifically established. It does a good job laying out several related problems with this attitude:

- 95% isn’t all that high. If you do a bunch of such tests, you get a false positive one time out of every 20. Lots of scientists are out there doing lots of tests, so there are bound to be lots of false positives.
- Sometimes even a single study will perform many comparisons, each of which could yield a positive result. In that case, the probability of getting false positives goes up very rapidly.
- Of course we hear about positive results, not the negative ones. The result is that a lot (much more than 5%) of the results you hear about are false positives.
- People — even lots of scientists — misunderstand what the probabilities here refer to. When a test is done that has a p-value of 5% (often referred to as a 95% confidence result), they think that it means that there’s a 95% chance the hypothesis being tested is correct. In fact, it means that there’s a 5% chance that the test would have come out the way it did if the hypothesis is false. That is, they’re probabilities about the
*possible results of the test*, not probabilities about *the ideas being tested*. That distinction seems minor, but it’s actually hugely important.

If you don’t think that that last distinction matters, imagine the following scenario. Your doctor gives you a test for a deadly disease, and the test is 99% accurate. If you get a positive result, it does *not *mean there’s a 99% chance you have the disease. Box 4 in the article works through a numerical example of this.

As the article puts it,

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the .05 level: *"This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance."*

That interpretation commits an egregious logical error (technical term: "transposed conditional"): confusing the odds of getting a result (if a hypothesis is true) with the odds favoring the hypothesis if you observe that result. A well-fed dog may seldom bark, but observing the rare bark does not imply that the dog is hungry. A dog may bark 5 percent of the time even if it is well-fed all of the time.

This is exactly right, and it’s a very important distinction.

The specific cases discussed in the article mostly have to do with medical research. I know very little about the cultural attitudes in that discipline, so it’s hard for me to judge some things that are said. The article seems (as I read it) to imply that lots of people, including scientists, regard a 95% confidence result as meaning that something is pretty well established as true. If that’s correct, then lots of people are out there believing lots of wrong things. A 95% confidence result should be regarded as an interesting hint that something *might *be true, leading to new hypotheses and experiments that will either confirm or refute it. Something’s not well-established until its statistical significance is way better than that.

Let me repeat: I have no idea whether medical researchers really do routinely make that error. The article seems to me to suggest that they do, but I have no way of telling whether it’s right. It certainly is true that science journalism falls into this trap with depressing regularity, though.

Since I don’t know much about medical research, let me comment on a couple of ways this stuff plays out in physics.

- In astrophysics we do quote 95% confidence results quite often, although we also use other confidence levels. Most of the time, I think, other researchers correctly adopt the interesting-hint attitude towards such results. In particle physics, they’re often quite strict in their use of terminology: a particle physicist would never claim a “detection” of a particle based on a mere 95% confidence result. I think that their usual threshold for use of that magic word is either 4 or 5 sigma (for normally distributed errors), which means either 99.99% or 99.9999% confidence.
- The multiple-tests problem, on the other hand, can be serious in physics. One way, which I’ve written about before, is in the various “anomalies” that people have claimed to see in the microwave background data. A bunch of these anomalies show up in statistical tests at 95% to 99% confidence. But we don’t have any good way of assessing how many tests have been done (many, especially those that yield no detection, aren’t published), so it’s hard to tell how to interpret these results.

Although the Science News article is mostly right, I do have some complaints. The main one is just that it is overwrought from time to time:

It's science's dirtiest secret: The "scientific method" of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions.

Nonsense. It’s certainly true that people mis- and overinterpret statistical statements all the time, both in the general-interest press and in the scholarly literature, but that doesn’t mean that the tools themselves are invalid. If I use a hammer to open a beer bottle, I’ll get bad results, but it’s not the hammer’s fault.

By the way, the “mutually inconsistent philosophies” here seem at one point to refer ot the quite obscure difference between Fisher’s approach and that of Neyman and Pearson, and later to be the somewhat less obscure difference between frequentists and Bayesian. Either way, “mutually inconsistent” and “offer no meaningful basis” are huge exaggerations.

(Lots of people seem to think that such clash-of-the-titans language is correct when applied to frequentists vs. Bayesians, but I think that’s wrong. When it comes to statistical methods, as opposed to the pure philosophy of probability, the two approaches are simply different sets of tools, not irreconcilable ways of viewing the world. People can and do use tools from both boxes.)

The concluding few paragraphs of the article take this hand-wringing to absurd lengths. For the record, it is absolutely not true, as a quotation from the author David Salsburg claims, that the coexistence of Bayesian and frequentist attitudes to statistics means that the whole edifice “may come crashing down from the weight of its own inconsistencies." The problems described in the article are real, but they’re cultural problems in the way people communicate and talk about their results, not problems in the philosophy of probability.

One other technical problem: the article suggests that randomized clinical trials have a problem because they don’t guarantee that all relevant characteristics are equally split between the trial and control groups:

Randomization also should ensure that unknown differences among individuals are mixed in roughly the same proportions in the groups being tested. But statistics do not guarantee an equal distribution any more than they prohibit 10 heads in a row when flipping a penny. With thousands of clinical trials in progress, some will not be well randomized.

This is true but is not a problem. This fact is automatically accounted for in the statistical analysis (i.e., the p-values) that result from the study.