## Another strange NY Times article about probability

Oddly enough, I still read the Sunday New York Times on paper. As a result, I was extremely confused by George Johnson’s article headlined Gamblers, Scientists, and the Mysterious Hot Hand.

The heart of the article is this claim:

In a study that appeared this summer, Joshua B. Miller and Adam Sanjurjo suggest why the gambler’s fallacy remains so deeply ingrained. Take a fair coin — one as likely to land on heads as tails — and flip it four times. How often was heads followed by another head? In the sequence HHHT, for example, that happened two out of three times — a score of about 67 percent. For HHTH or HHTT, the score is 50 percent.

Altogether there are 16 different ways the coins can fall. I know it sounds crazy but when you average the scores together the answer is not 50-50, as most people would expect, but about 40-60 in favor of tails.

Maybe it’s just me, but I couldn’t make sense of this claim. The online version has a graphic that clears it up. The last sentence is literally correct (with the possible exception of the phrase “as most people would expect,” as I’ll explain), but I couldn’t manage to parse it. I wonder if it was just me.

Here’s my summary of exactly what is being said:

Suppose you flip a coin four times. Every time heads comes up, you look at the next flip and see if it’s heads or tails. (Of course, you can’t do this if heads comes up on the last flip, since there is no next flip.) You write down the fraction of the time that it came up heads. For instance, if the coin flips went HHTH, you’d write down 1/2, because the first H was followed by an H, but the second H was followed by a T.

You then repeat the procedure many times (each time using a sequence of four coin flips). You average together all the results you get. The average comes out less than 1/2.

I guess that might be a counterintuitive result. Maybe. Personally, I find the described procedure so baroque that I’m not sure I would have had any intuition at all as to what the result should be. Hence my skepticism about the “as most people would expect” phrase. I think that if you took a survey, you’d get something like this:

(And, by the way, I don’t mean this as an insulting remark about the average person’s mathematical skills: I think I would have been in the 90%.)

The reason you don’t get 50% from this procedure is that it weights the outcomes of individual flips unevenly. For instance whenever HHTH shows up, the HH gets an effective weight of 1/2 (because it’s averaged together with the HT). But in each instance of HHHH, each of the three HH’s gets an effective weight of 1/3 (because there are three of them in that sequence). The correct averaging procedure is to count all the individual instances of the things you’re looking for (HH’s); not to group them together, average those, and then average the averages.

My question is whether the average-of-averages procedure described in the article actually corresponds to anything that any actual human would do.

The paper Johnson cites (which, incidentally, is an non-peer-reviewed working paper) makes grandiose claims about this result. For one thing, it is supposed to explain the “gambler’s fallacy.” Also, supposedly some published analyses of the “hot hands” phenomenon in various sports are incorrect because the authors used an averaging method like this.

At some point, I’ll look at the publications that supposedly fall prey to this error, but I have to say that I find all this extremely dubious. It doesn’t seem at all likely to me that that bizarre averaging procedure corresponds to people’s intuitive notions of probability, nor does it seem likely that a statistician would use such a method in a published analysis.

## Replication in psychology

A few quick thoughts on the news that many published results in psychology can’t be replicated.

First, good for the researchers for doing this!

Second, I’ve read that R.A. Fisher, who was largely responsible for introducing the notion of using p-values to test for statistical significance, regarded the standard 5% level as merely an indication that something interesting might be going on, not as a definitive detection of an effect (although other sources seem to indicate that his views were more complicated than that). In any case, whether or not that’s what Fisher thought, it’s a good way to think of things. If you see a hypothesis confirmed with a 5% level of significance, you should think, “Hmm. Somebody should do a follow-up to see if this interesting result holds up,” rather than “Wow! It must be true, then.”

Finally, a bit of bragging about my own discipline. There’s plenty of bad work in physics, but I suspect that the particular problems that this study measured are not as bad in physics. The main reason is that in physics we do publish, and often even value, null results.

Take, for instance, attempts to detect dark matter particles. No one has ever done it, but the failed attempts to do so are not only publishable but highly respected. Here is a review article on the subject, which includes the following figure:

Every point in here is an upper limit — a record of a failed attempt to measure the number of dark matter particles.

I suspect that part of the reason we do this in physics is that we often think of our experiments primarily as measuring numbers, not testing hypotheses. Each dark matter experiment can be thought of as an attempt to measure the density of dark matter particles. Each measurement has an associated uncertainty. So far, all of those measurements have included the value zero within their error bars — that is, they have no statistically significant detection, and can’t rule out the null hypothesis that there are no dark matter particles. But if the measurement is better than previous ones — if it has smaller errors — then it’s valued.

## 538 on p-hacking

Christie Aschwanden has a piece on fivethirtyeight.com about the supposed “crisis” in science. She writes about recent high-profile results that turned out to be wrong, flaws in peer review, and bad statistics in published papers.

By far the best part of the article is the applet that lets you engage in your own p-hacking. It’s a great way to illustrate what p-hacking is and why it’s a problem. The idea is to take a bunch of data on performance of the US economy over time, and examine whether it has done better under Democrats or Republicans. There are multiple different measures of the economy one might choose to focus on, and multiple ways one might quantify levels of Democratic or Republican power. The applet lets you make different choices and determines whether there’s a statistically significant effect. By fiddling around for a few minutes, you can easily get a “significant” result in either direction.

Go and play around with it for a few minutes.

The rest of the article has some valuable observations, but it’s a bit of a hodgepodge. Curmudgeon that I am, I have to complain about a couple of things.

Here’s a longish quote from the article:

P-hacking is generally thought of as cheating, but what if we made it compulsory instead? If the purpose of studies is to push the frontiers of knowledge, then perhaps playing around with different methods shouldn’t be thought of as a dirty trick, but encouraged as a way of exploring boundaries. A recent project spearheaded by Brian Nosek, a founder of the nonprofit Center for Open Science, offered a clever way to do this.

Nosek’s team invited researchers to take part in a crowdsourcing data analysis project. The setup was simple. Participants were all given the same data set and prompt: Do soccer referees give more red cards to dark-skinned players than light-skinned ones? They were then asked to submit their analytical approach for feedback from other teams before diving into the analysis.

Twenty-nine teams with a total of 61 analysts took part. The researchers used a wide variety of methods, ranging — for those of you interested in the methodological gore — from simple linear regression techniques to complex multilevel regressions and Bayesian approaches. They also made different decisions about which secondary variables to use in their analyses.

Despite analyzing the same data, the researchers got a variety of results. Twenty teams concluded that soccer referees gave more red cards to dark-skinned players, and nine teams found no significant relationship between skin color and red cards.

The variability in results wasn’t due to fraud or sloppy work. These were highly competent analysts who were motivated to find the truth, said Eric Luis Uhlmann, a psychologist at the Insead business school in Singapore and one of the project leaders. Even the most skilled researchers must make subjective choices that have a huge impact on the result they find.

But these disparate results don’t mean that studies can’t inch us toward truth. “On the one hand, our study shows that results are heavily reliant on analytic choices,” Uhlmann told me. “On the other hand, it also suggests there’s a there there. It’s hard to look at that data and say there’s no bias against dark-skinned players.” Similarly, most of the permutations you could test in the study of politics and the economy produced, at best, only weak effects, which suggests that if there’s a relationship between the number of Democrats or Republicans in office and the economy, it’s not a strong one.

The last paragraph is simply appalling. This is precisely the sort of conclusion you can’t draw. Some methods got marginally “significant” results — if you define “significance” by the ridiculously weak 5% threshold — and others didn’t. The reason p-hacking is a problem is that people may be choosing their methods (either consciously or otherwise) to lead to their preferred conclusion. If that’s really a problem, then you can’t draw any valid conclusion from the fact that these analyses tended to go one way.

As long as I’m whining, there’s this:

Take, for instance, naive realism — the idea that whatever belief you hold, you believe it because it’s true.

Naive realism means different things to psychologists and philosophers, but this isn’t either of them.

Anyway, despite my complaining, there’s some good stuff in here.

## How strong is the scientific consensus on climate change?

There is overwhelming consensus among scientists that climate change is real and is caused largely by human activity. When people want to emphasize this fact, they often cite the finding that 97% of climate scientists agree with the consensus view.

I’ve always been dubious about that claim, but in the opposite way from the climate-change skeptics: I find it hard to believe that the figure can be as low as 97%. I’m not a climate scientist myself, but whenever I talk to one, I get the impression that the consensus is much stronger than that.

97% may sound like a large number, but a scientific paradigm strongly supported by a wide variety of forms of evidence will typically garner essentially 100% consensus among experts. That’s one of the great things about science: you can gather evidence that, for all practical purposes, completely settles a question. My impression was that human-caused climate change was pretty much there.

So I was interested to see this article by James Powell, which claims that the 97% figure is a gross understatement. The author estimates the actual figure to be well over 99.9%. A few caveats: I haven’t gone over the paper in great detail, this is not my area of expertise, and the paper is still under peer review. But I find this result quite easy to believe, both because the final number sounds plausible to me and because the main methodological point in the paper strikes me as unquestionably correct.

Powell points out that the study that led to the 97% figure was derived in a way that excluded a large number of articles from consideration:

Cook et al. (2013) used the Web of Science to review the titles and abstracts of peer-reviewed articles from 1991-2011 with the keywords “global climate change” and “global warming.” With no reason to suppose otherwise, the reader assumes from the title of their article, “Quantifying the consensus on anthropogenic global warming [AGW] in the scientific literature,” that they had measured the scientific consensus on AGW using consensus in its dictionary definition and common understanding: the extent to which scientists agree with or accept the theory. But that is not how CEA used consensus.

Instead, CEA defined consensus to include only abstracts in which the authors “express[ed] an opinion on AGW (p. 1).” No matter how clearly an abstract revealed that its author accepts AGW, if it did “not address or mention the cause of global warming (p. 3),” CEA classified the abstract as having “no position” and omitted it from their calculation of the consensus. Of the 11944 articles in their database, 7930 (66.4%) were labeled as taking no position. If AGW is the ruling paradigm of climate science, as CEA set out to show, then rather than having no position, the vast majority of authors in that category must accept the theory. I return to this point below.
CEA went on to report that “Among abstracts expressing a position on AGW, 97.1% endorsed the consensus position that humans are causing global warming (p. 6, italics added).” Thus, the now widely adopted “97% consensus” refers not to what scientists accept, the conventional meaning, but to whether they used language that met the more restrictive CEA definition.

This strikes me as a completely valid critique. When there is a consensus on something, people tend not to mention it at all. Articles in physics journals pretty much never state explicitly in the abstract that, say, quantum mechanics is a good theory, or that gravity is what makes things fall.

I particularly recommend the paper’s section on plate tectonics as an explicit example of how the above methodology would lead to a false conclusion.

Let me be clear: I haven’t studied this paper enough to vouch for the complete correctness the methodology contained in it. But it looks to me like the author has made a convincing case that the 97% figure is a great understatement.

## Prescription: retire the words “prescriptivist” and “descriptivist”

If, like me, you like arguments about English grammar and usage, you frequently come across the distinction between “prescriptivists” and “descriptivists.” Supposedly, prescriptivists are people who think that grammatical rules are absolute and unchanging. Descriptivists, on the other hand, supposedly think that there’s no use in rules that tell people how they should speak and write, and that all that matters is the way people actually do speak and write. As the story goes, descriptivists regard prescriptivists as persnickety schoolmarms, while prescriptivists regard descriptivists as people with no standards who are causing the language to go to hell in a handbasket.

In fact, of course, these terms are mostly mere invective. If you call someone either of these names, you’re almost always trying to tar that person with an absurd set of beliefs that nobody actually holds.

A prescriptivist, in the usual telling, is someone who thinks that all grammatical rules are set in stone forever. (Anyone who actually believed this would presumably have to speak only in Old English, or better yet, proto-Indo-European.) A descriptivist, on the other hand, is supposedly someone who thinks that all ways of writing and speaking are equally valid, and that we therefore shouldn’t teach students to write standard English grammar.

I can’t claim that there is nobody who believes either of these caricatures — I’m old enough to have learned that there’s someone out there who believes pretty much any foolish thing you can think of — but I do claim that practically nobody you’re likely to encounter when wading into a usage debate fits them.

I was reminded of all this when listening to an interview with Bryan Garner, author of Modern American Usage, which is by far the best usage guide I know of. Incidentally, I’m not the only one in my household who likes Garner’s book: my dog Nora also devoured it.

Like many people, I first learned about Garner in David Foster Wallace’s essay “Authority and American Usage,” which originally appeared in Harper’s and was reprinted in his book Consider the Lobster. Wallace’s essay is very long and shaggy, but it’s quite entertaining (if you like that sort of thing) and has some insightful observations. The essay is nominally a review of Garner’s usage guide, but it turns into a meditation on the nature of grammatical rules and their role in society and education.

From his book, Garner strikes me as a clear thinker about lexicographic matters, so I was disappointed to hear him go in for the most simpleminded straw-man caricatures of the hated descriptivists in that interview.

Garner’s scorn is mostly reserved for Steven Pinker. Pinker is pretty much the arch-descriptivist, in the minds of those people for whom that is a term of invective. But that hasn’t stopped him from writing a usage guide, in which he espouses some (but not all) of the standard prescriptivist rules. Pinker’s and Garner’s approaches actually have something in common: both try to give reasons for the various rules they advocate, rather than simply issuing fiats. But because Garner thinks of Pinker as a loosy-goosy descriptivist, he can’t bring himself to engage with Pinker’s actual arguments.

Garner says that Pinker has “flip-flopped,” and that his new book is  “a confused book, because he’s trying to be prescriptivist while at the same time being descriptivist.” As it turns out, what he means by this is that Pinker has declined to nestle himself into the pigeonhole that Garner has designated for him. I’ve read all of Pinker’s general-audience books on language — his first such book, The Language Instinct, may be the best pop-science book I’ve ever read — and I don’t see the new one as contradicting the previous ones. Pinker has never espoused the straw-man position that all ways of writing are equally good, or that there’s no point in trying to teach people write more effectively. Garner thinks that that’s what a descriptivist believes, and so he can’t be bothered to check.

The Language Instinct has a chapter on “language mavens,” which is the place to go for Pinker’s views on prescriptive grammatical rules. (That chapter is essentially reproduced as an essay published in the New Republic.) Garner has evidently read this chapter, as he mockingly summarizes Pinker’s view as “You shouldn’t criticize the way people use language any more than you should criticize how whales emit their moans,” which is a direct reference to an analogy found in this chapter. But he either deliberately or carelessly misleads the reader about its meaning.

Pinker is not saying that there are no rules that can help people improve their writing. Rather, he’s making the simple but important point that scientists are more interested in studying language as a natural system (how do people talk) than in how they should talk.

So there is no contradiction, after all, in saying that every normal person can speak grammatically (in the sense of systematically) and ungrammatically (in the sense of nonprescriptively), just as there is no contradiction in saying that a taxi obeys the laws of  physics but breaks the laws of Massachusetts.

Pinker is a descriptivist because, as a scientist, he’s more interested in the first kind of rules than the second kind. It doesn’t follow from this that he thinks the second kind don’t or shouldn’t exist. A physicist is more interested in studying the laws of physics than the laws of Massachusetts, but you can’t conclude from this that he’s an anarchist.

(Pinker does unleash a great deal of scorn on the language mavens, not for saying that there are guidelines that can help you improve your writing, but for saying specific stupid things, which he documents thoroughly and, in my opinion, convincingly.)

Although Pinker is the only so-called descriptivist Garner mentions by name, he does tar other people by saying, “There was this view, in the mid-20th century, that we should not try to change the dialect into which somebody was born.” He doesn’t indicate who those people were, but my best guess is that this is a reference to the controversy that arose over Webster’s Third in the 1950s. If so, it sounds as if Garner (like Wallace) is buying into a mythologized version of that controversy.

It seems to me that the habit of lumping people into the “prescriptive” and “descriptive” categories is responsible for Garner’s inability to pay attention to what Pinker et al. are actually saying (and for various other silly things he says in this interview). All sane people agree with the prescriptivists that some ways of writing are more effective than others and that it’s worthwhile to try to teach people what those ways are. All sane people agree with the descriptivists that some specific things written by some language “experts” are stupid, and that at least some prescriptive rules are mere shibboleths, signaling membership in an elite group rather than enhancing clarity. All of the interesting questions arise after you acknowledge that common ground, but if you start by dividing people up according to a false dichotomy, you never get to them.

Hence my prescription.

## It’s still not rocket science

In the last couple of days, I’ve seen a little flareup of interest on social media in the “reactionless drive” that supposedly generates thrust without expelling any sort of propellant. This was impossible a year ago, and it’s still impossible.

OK, it’s not literally impossible in the mathematical sense, but it’s close enough. Such a device would violate the law of conservation of momentum, which is an incredibly well-tested part of physics. Any reasonable application of reasoning (or as some people insist on calling it, Bayesian reasoning) says, with overwhelmingly high probability, that conservation of momentum is right and this result is wrong.

Extraordinary claims require extraordinary evidence, never believe an experiment until it’s been confirmed by a theory, etc.

The reason for the recent flareup seems to be that another group has replicated the original group’s results. They actually do seem to have done a better job. In particular, they did the experiment in a vacuum. Bizarrely, the original experimenters went to great lengths to describe the vacuum chamber in which they did their experiment, and then noted, in a way that was easy for a reader to miss, that the experiments were done “at ambient pressure.” That’s important, because stray air currents were a plausible source of error that could have explained the tiny thrust they found.

The main thing to note about the new experiment is that they are appropriately circumspect in describing their results. In particular, they make clear that what they’re seeing is almost certainly some sort of undiagnosed effect of ordinary (momentum-conserving) physics, not a revolutionary reactionless drive.

We identified the magnetic interaction of the power feeding lines going to and from the liquid metal contacts as the most important possible side-effect that is not fully characterized yet. Our test campaign can not confirm or refute the claims of the EMDrive …

Just because I like it, let me repeat what my old friend John Baez said about the original claim a year ago. The original researchers speculated that they were seeing some sort of effect due to interactions with the “quantum vacuum virtual plasma.” As John put it,

“Quantum vacuum virtual plasma” is something you’d say if you failed a course in quantum field theory and then smoked too much weed.

## I’ll take Bayes over Popper any day

A provocative article appeared on the arxiv last month:

### Inflation, evidence and falsifiability

#### Giulia Gubitosi, Macarena Lagos, Joao Magueijo, Rupert Allison

(Submitted on 30 Jun 2015)
In this paper we consider the issue of paradigm evaluation by applying Bayes’ theorem along the following nested chain of progressively more complex structures: i) parameter estimation (within a model), ii) model selection and comparison (within a paradigm), iii) paradigm evaluation … Whilst raising no objections to the standard application of the procedure at the two lowest levels, we argue that it should receive an essential modification when evaluating paradigms, in view of the issue of falsifiability. By considering toy models we illustrate how unfalsifiable models and paradigms are always favoured by the Bayes factor … We propose a measure of falsifiability (which we term predictivity), and a prior to be incorporated into the Bayesian framework, suitably penalising unfalsifiability …

(I’ve abbreviated the abstract.)

Ewan Cameron and Peter Coles have good critiques of the article. Cameron notes specific problems with the details, while Coles takes a broader view. Personally, I’m more interested in the sort of issues that Coles raises, although I recommend reading both.

The nub of the paper’s argument is that the method of Bayesian inference does not “suitably penalise” theories that are unfalsifiable. My first reaction, like Coles’s, is not to care much, because the idea that falsifiability is essential to science is largely a fairy tale. As Coles puts it,

In fact, evidence neither confirms nor discounts a theory; it either makes the theory more probable (supports it) or makes it less probable (undermines it). For a theory to be scientific it must be capable having its probability influenced in this way, i.e. amenable to being altered by incoming data “i.e. evidence”. The right criterion for a scientific theory is therefore not falsifiability but testability.

Here’s pretty much the same thing, in my words:

For rhetorical purposes if nothing else, it’s nice to have a clean way of describing what makes a hypothesis scientific, so that we can state succinctly why, say, astrology doesn’t count.  Popperian falsifiability nicely meets that need, which is probably part of the reason scientists like it.  Since I’m asking you to reject it, I should offer up a replacement.  The Bayesian way of looking at things does supply a natural replacement for falsifiability, although I don’t know of a catchy one-word name for it.  To me, what makes a hypothesis scientific is that it is amenable to evidence.  That just means that we can imagine experiments whose results would drive the probability of the hypothesis arbitrarily close to one, and (possibly different) experiments that would drive the probability arbitrarily close to zero.

Sean Carroll is also worth reading on this point.

The problem with the Gubitosi et al. article is not merely that the emphasis on falsifiability is misplaced, but that the authors reason backwards from the conclusion they want to reach, rather than letting logic guide them to a conclusion. Because Bayesian inference doesn’t “suitably” penalize the theories they want to penalize, it “should” be replaced by something that does.

Bayes’s theorem is undisputedly true (that’s what the word “theorem” means), and conclusions derived from it are therefore also true. (That’s what I mean when use the phrase “Bayesian reasoning, or as I like to call it, ‘reasoning’.) To be precise, Bayesian inference is the provably correct way to draw probabilistic conclusions in cases where your data do not provide a conclusion with 100% logical certainty (i.e., pretty much all cases outside of pure mathematics and logic).

When reading this paper, it’s worthwhile keeping track of all of the places where words like “should” appear, and asking yourself what is meant by those statements. Are they moral statements? Aesthetic ones? And in any case, recall Hume’s famous dictum that you can’t reason from “is” to “ought”: those “should” statements are not, and by their nature cannot be, supported by the reasoning that leads up to them.

In particular, Gubitosi et al. are sad that the data don’t sufficiently disfavor the inflationary paradigm, which they regard as unfalsifiable. But their sadness is irrelevant. The Universe may have been born in an inflationary epoch, even if the inflation paradigm does not meet their desired falsifiability criterion. The way you should decide how likely that is is Bayesian inference.

## That fake chocolate weight-loss study

I’m a little late getting to this, but in case you haven’t heard about it, here it is.

A journalist named John Bohannon did a stunt recently in which he and some coauthors “published” a “study” that “showed” that chocolate caused weight loss. (The reasons for the scare quotes will become apparent.) The work was picked up by a bunch of news outlets. Bohannon wrote about the whole thing on io9. It’s also been picked up in a bunch of other places, including the BBC radio program More or Less (which I’ve mentioned before a few times).

The idea was to do a study that was shoddy in precisely the ways that many “real” studies are, get it published in a low-quality journal, and see if they could get it picked up by credulous journalists.

My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

There is an interesting question of journalistic ethics here. Bohannon calls himself a journalist, but he deliberately introduced bad science into the mediasphere with the specific intent of deception. Is it OK for a journalist to do that, if his motives are pure? I don’t know.

I don’t want to focus on that sort of issue, because I don’t have anything non-obvious to say. Instead, I want to dig a bit into the details of what Bohannon et al. did. Although Bohannon’s io9 post is well worth reading and gets the big picture largerly right, it’s wrong or misleading in a few ways, which happen to be the sort of thing I care about.

Bohannon et al.  recruited a group of subjects and divided them into three groups: a control group, a group that was put on a low-carb diet, and a group that was put on a low-carb diet but also told to eat a certain amount of chocolate each day. The chocolate group lost weight faster than the other groups. The result was “statistically significant,” in the usual meaning of that term — the p-value was below 0.05.

So what was wrong with this study? As Bohannon explains it in his io9 post,

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

The conventional cutoff for being “significant” is 0.05, which means that there is just a 5 percent chance that your result is a random fluctuation. The more lottery tickets, the better your chances of getting a false positive. So how many tickets do you need to buy?

P(winning) = 1 – (1 – p)n

With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works.” Or they drop “outlier” data points.

Sadly, even in this piece, whose purpose is to debunk bad statistics, Bohannon repeats the usual incredibly common error. A p-value of 0.05 does not mean that “there is just a 5 percent chance that your result is a random fluctuation.” It means that, if you assume that nothing but random fluctuations are at work, there’s a 5% chance of getting results as extreme as you did. A (frequentist) p-value is incapable of telling you anything about the probability of any given hypothesis (such as “your result is a random fluctuation”).

One other quibble: the parenthetical remark about the measurements not being independent is literally true but misleading. The fact that the measurements aren’t independent means that the probability of a false positive “could be even higher”, but it could also be lower. In fact, the latter seems more likely to me. (The probability goes down if the measurements are positively correlated with each other, and up if they’re anticorrelated.)

The other thing that’s worth focusing on is the number of subjects in the study, which was incredibly small (15 across all three groups). Bohannon suggests (in the first sentence quoted above) that this is part of the reason they got a false positive, and other pieces I’ve read on this say the same thing. But it’s not true. The reason they got a false positive was p-hacking (buying many lottery tickets), which would have worked just as well with a larger number of subjects. If you had more subjects, the random fluctuations would have gotten smaller, but the level of fluctuation required for statistical “significance” would have gone down as well. By definition, the odds of any one “lottery ticket” winning is 5%, whether you have a lot of subjects or a few.

It’s true that with fewer subjects the effect size (i.e., the number of extra pounds lost, on average) is likely to be larger, but the published article went to great lengths to downplay the effect size (e.g., not mentioning it in the abstract, which is often all anyone reads).

Let me repeat that I think that Bohannon’s description of what he did is well worth reading and has a lot that’s right at the macro-scale, even though I wish that he’d gotten the above details right.

## Does the inside of a brick exist?

In Surely You’re Joking, Mr. Feynman, Richard Feynman tells a story of sitting in on a philosophy seminar and being asked by the instructor whether he thought that an electron was an “essential object.”

Well, now I was in trouble. I admitted that I hadn’t read the book, so I had no idea of what Whitehead meant by the phrase; I had only come to watch. “But,” I said, “I’ll try to answer the professor’s question if you will first answer a question from me, so I can have a better idea of what ‘essential object’ means. Is a brick an essential object?”

What I had intended to do was to find out whether they thought theoretical constructs were essential objects. The electron is a theory that we use; it is so useful in understanding the way nature works that we can almost call it real. I wanted to make the idea of a theory clear by analogy. In the case of the brick, my next question was going to be, “What about the inside of the brick?”–and I would then point out that no one has ever seen the inside of a brick. Every time you break the brick, you only see the surface. That the brick has an inside is a simple theory which helps us understand things better. The theory of electrons is analogous. So I began by asking, “Is a brick an essential object?”

The way he tells the story (which, of course, need not be presumed to be 100% accurate), he never got to the followup question, because the philosophers got bogged down in an argument over the first question.

I was reminded of this when I read A Crisis at the Edge of Physics , by Adam Frank and Marcelo Gleiser, in tomorrow’s New York Times. The article is a pretty good overview of some of the recent hand-wringing over certain areas of theoretical physics that seem, to some people, to be straying too far from experimental testability. (Frank and Gleiser mention a silly article by my old Ph.D. adviser that waxes particularly melodramatic on this subject.)

From the Times piece:

If a theory successfully explains what we can detect but does so by positing entities that we can’t detect (like other universes or the hyperdimensional superstrings of string theory) then what is the status of these posited entities? Should we consider them as real as the verified particles of the standard model? How are scientific claims about them any different from any other untestable — but useful — explanations of reality?

These entities are, it seems to me, not fundamentally different from the inside of Feynman’s brick, or from an electron for that matter. No one has ever seen an electron, or the inside of a brick, or the core of the Earth, for that matter. We believe that those things are real, because they’re essential parts of a theory that we believe in. We believe in that theory because it makes a lot of successful predictions. If string theory or theories that predict a multiverse someday produce a rich set of confirmed predictions, then the entities contained on those theories will have as much claim to reality as electrons do.

Just to be clear, that hasn’t happened yet, and it may never happen. But it’s just wrong to say that these theories represent a fundamental retreat from the scientific method, just because they contain unobservable entities. (To be fair, Frank and Gleiser don’t say this, but many other people do.) Most interesting theories contain unobservable entities!