Let me tell you a story (originally inspired by this post on Allen Downey’s blog).
Frank and Betsy are wondering whether a particular coin is a fair coin (i.e., comes up heads and tails equally often when flipped). Frank, being a go-getter type, offers to do some tests to find out. He takes the coin away, flips it a bunch of times, and eventually comes back to Betsy to report his results.
“I flipped the coin 3022 times,” he says, “and it came up heads 1583 times. That’s 72 more heads than you’d expect with a fair coin. I worked out the p-value — that is, the probability of this large an excess occurring if the coin is fair — and it’s under 1%. So we can conclude that the coin is unfair at a significance level of 1% (or ‘99% confidence’ as physicists often say).”
You can take my word for it that Frank’s done the calculation correctly (or you can check it yourself if you like). Now, I want you to consider two different possibilities:
- Frank is an honest man, who has followed completely orthodox (frequentist) statistical procedure. To be specific, he decided on the exact protocol for his test (including, for some reason, the decision to do 3022 trials) in advance.
- Frank is a scoundrel who, for some reason, wants to reach the conclusion that the coin is unfair. He comes up with a nefarious plan: he keeps flipping the coin for as long as it takes to reach that 1% significance threshold, and then he stops and reports his results.
(I thought about making up some sort of backstory to explain why scoundrel Frank would behave this way, but I couldn’t come up with anything that wasn’t stupid.)
Here are some questions for you:
- What should Betsy conclude on the basis of the information Frank has given her?
- Does the answer depend on whether Frank is an honest man or a scoundrel?
I should add one more bit of information: Betsy is a rational person — that is, she draws conclusions from the available evidence via Bayesian inference.
As you can guess, I’m asking these questions because I think the answers are surprising. In fact, they turn out to be surprising in two different ways.
There’s one thing we can say immediately: if Frank is a scoundrel, then the 1% significance figure is meaningless. It turns out that, if you start with a fair coin and flip it long enough, you will (with probability 1) always eventually reach 1% significance (or, for that matter, any other significance you care to name). So the fact that he reached 1% significance conveys no information in this scenario.
On the other hand, the fact that he reached 1% significance after 3022 trials does still convey some information, which Betsy will use when she performs her Bayesian inference. In fact, the conclusion Betsy draws will be exactly the same whether Frank is an honest man or a scoundrel. The reason is that, either way, the evidence Betsy uses in performing her Bayesian inference is the same, namely that there were 1583 heads in 3022 flips.
[Technical aside: if Frank is a scoundrel, and Betsy knows it, then she has some additional information about the order in which those heads and tails occurred. For instance, she knows that Frank didn’t start with an initial run of 20 heads in a row, because if he had he would have stopped long before 3022 flips. You can convince yourself that this doesn’t affect the conclusion.]
That’s surprise #1. (At least, I think it’s kind of surprising. Maybe you don’t.) From a frequentist point of view, the p-value is the main thing that matters. Once we realize that the p-value quoted by scoundrel Frank is meaningless, you might think that the whole data set is useless. But in fact, viewed rationally (i.e., using Bayesian inference), the data set means exactly the same thing as if Frank had produced it honestly.
Here’s surprise #2: for reasonable assumptions about Betsy’s prior beliefs, she should regard this evidence as increasing the probability that the coin is fair, even though Frank thinks the evidence establishes (at 1% significance) the coin’s unfairness. Moreover, even if Frank’s results had ruled out the coin’s fairness at a more stringent signficance (0.1%, 0.00001%, whatever), it’s always possible that he’ll wind up with a result that Betsy regards as evidence in favor of the coin’s fairness.
Often, we expect Bayesians and frequentists to come up with different conclusions when the evidence is weak, but we expect the difference to go away when the evidence is strong. But in fact, no matter how strong the evidence is from a frequentist point of view, it’s always possible that the Bayesian will view it in precisely the opposite way.
I’ll show you that this is true with some specific assumptions, although the conclusion applies more generally.
Suppose that Betsy’s initial belief is that 95% of coins are fair — that is, the probability P that they come up heads is exactly 0.5. Betsy has no idea what the other 5% of coins are like, so she assumes that all values of P are equally likely for them. To be precise, her prior probability density on P, the probability that the given coin comes up heads, is
Pr[P] = 0.95 δ(P-0.5) + 0.05
over the range 0 < P < 1. (I’m using the Dirac delta notation here.)
The likelihood function (i.e., the probability of getting the observed evidence for any given P) is
Pr[ E | P] = A P1583 (1-P)1439.
Here A is a constant whose value doesn’t matter. (To be precise, it’s the number of possible orders in which heads and tails could have arisen.) Turning the Bayes’s theorem crank, we find that the posterior probability distribution is
Pr[P | E] = 0.964 δ(P-0.5) + B P1583 (1-P)1439.
Here B is some other constant I’m not bothering to tell you because it doesn’t matter. What does matter is the factor 0.964 in front of the delta function, which says that, in this particular case, Betsy regards Frank’s information as increasing the probability that the coin is fair from 95% to 96.4%. In other words, she initially thought that there was a 5% chance the coin was unfair, but based on Frank’s results she now thinks there’s only a 3.2% chance that it is.
It’s not surprising that a Bayesian and frequentist interpretation of the same result give different answers, but I think it’s kind of surprising that Frank and Betsy interpret the same evidence in opposite ways: Frank says it rules out the possibility that the coin is fair with high significance, but Betsy says it increases her belief that the coin is fair. Moreover, as I mentioned before, even if Frank had adopted a more stringent criterion for significance — say 0.01% instead of 1% — the same sort of thing could happen.
If Betsy had had a different prior, this evidence might not have had the same effect, but it turns out that you’d get the same kind of result for a pretty broad range of priors. In particular, you could change the 95% in the prior to any value you like, and you’d still find that the evidence increases the probability that the coin is fair. Also, you could decide that the assumption of a uniform prior for the unfair coins is unrealistic. (There probably aren’t any coins that come up heads 99% of the time, for instance.) But if you changed that uniform prior to any reasonably smooth, not too sharply peaked function, it wouldn’t change the result much.
In fact, you can prove a general theorem that says essentially the following:
No matter what significance level s Frank chooses, and what Betsy’s prior is, it’s still possible to find a number of coin flips and a number of heads such that Frank rules out the possibility that the coin is fair at significance s, while Betsy regards the evidence as increasing the probability that the coin is fair.
I could write out a formal proof of this with a few equations, but instead I’ll just sketch the main idea. Let n be the number of flips and k be the number of heads. Suppose Frank is a scoundrel, flipping the coin until he reaches the desired significance and then stopping. Imagine listing all the possible pairs (n,k) at which he might stop. If you just told Betsy that Frank had stopped at one of those points, but not which one, then you’d be telling Betsy no information at all (since Frank is guaranteed to stop eventually). With that information, therefore, her posterior probability distribution would be the same as her prior. But that posterior probability distribution is also a weighted average of the posterior probability distributions corresponding to each of the possible pairs (n,k), with weights given by the probability that Frank stops at each of those points. Since the weighted average comes out the same as the prior, some terms in the average must give a probability of the coin being fair which is greater than the prior (and some must be less).
Incidentally, in case you’re wondering, Betsy and Frank are my parents’ names, which fortuitously have the same initials as Bayesian and frequentist. My father Frank probably did learn statistics from a frequentist point of view (for which he deserves pity, not blame), but he would certainly never behave like a scoundrel.