I’m a little late getting to this, but in case you haven’t heard about it, here it is.
A journalist named John Bohannon did a stunt recently in which he and some coauthors “published” a “study” that “showed” that chocolate caused weight loss. (The reasons for the scare quotes will become apparent.) The work was picked up by a bunch of news outlets. Bohannon wrote about the whole thing on io9. It’s also been picked up in a bunch of other places, including the BBC radio program More or Less (which I’ve mentioned before a few times).
The idea was to do a study that was shoddy in precisely the ways that many “real” studies are, get it published in a low-quality journal, and see if they could get it picked up by credulous journalists.
My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.
There is an interesting question of journalistic ethics here. Bohannon calls himself a journalist, but he deliberately introduced bad science into the mediasphere with the specific intent of deception. Is it OK for a journalist to do that, if his motives are pure? I don’t know.
I don’t want to focus on that sort of issue, because I don’t have anything non-obvious to say. Instead, I want to dig a bit into the details of what Bohannon et al. did. Although Bohannon’s io9 post is well worth reading and gets the big picture largerly right, it’s wrong or misleading in a few ways, which happen to be the sort of thing I care about.
Bohannon et al. recruited a group of subjects and divided them into three groups: a control group, a group that was put on a low-carb diet, and a group that was put on a low-carb diet but also told to eat a certain amount of chocolate each day. The chocolate group lost weight faster than the other groups. The result was “statistically significant,” in the usual meaning of that term — the p-value was below 0.05.
So what was wrong with this study? As Bohannon explains it in his io9 post,
Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.
…
The conventional cutoff for being “significant” is 0.05, which means that there is just a 5 percent chance that your result is a random fluctuation. The more lottery tickets, the better your chances of getting a false positive. So how many tickets do you need to buy?
P(winning) = 1 – (1 – p)n
With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.
It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works.” Or they drop “outlier” data points.
Sadly, even in this piece, whose purpose is to debunk bad statistics, Bohannon repeats the usual incredibly common error. A p-value of 0.05 does not mean that “there is just a 5 percent chance that your result is a random fluctuation.” It means that, if you assume that nothing but random fluctuations are at work, there’s a 5% chance of getting results as extreme as you did. A (frequentist) p-value is incapable of telling you anything about the probability of any given hypothesis (such as “your result is a random fluctuation”).
One other quibble: the parenthetical remark about the measurements not being independent is literally true but misleading. The fact that the measurements aren’t independent means that the probability of a false positive “could be even higher”, but it could also be lower. In fact, the latter seems more likely to me. (The probability goes down if the measurements are positively correlated with each other, and up if they’re anticorrelated.)
The other thing that’s worth focusing on is the number of subjects in the study, which was incredibly small (15 across all three groups). Bohannon suggests (in the first sentence quoted above) that this is part of the reason they got a false positive, and other pieces I’ve read on this say the same thing. But it’s not true. The reason they got a false positive was p-hacking (buying many lottery tickets), which would have worked just as well with a larger number of subjects. If you had more subjects, the random fluctuations would have gotten smaller, but the level of fluctuation required for statistical “significance” would have gone down as well. By definition, the odds of any one “lottery ticket” winning is 5%, whether you have a lot of subjects or a few.
It’s true that with fewer subjects the effect size (i.e., the number of extra pounds lost, on average) is likely to be larger, but the published article went to great lengths to downplay the effect size (e.g., not mentioning it in the abstract, which is often all anyone reads).
Let me repeat that I think that Bohannon’s description of what he did is well worth reading and has a lot that’s right at the macro-scale, even though I wish that he’d gotten the above details right.