Let me tell you a story (originally inspired by this post on Allen Downey’s blog).

Frank and Betsy are wondering whether a particular coin is a fair coin (i.e., comes up heads and tails equally often when flipped). Frank, being a go-getter type, offers to do some tests to find out. He takes the coin away, flips it a bunch of times, and eventually comes back to Betsy to report his results.

“I flipped the coin 3022 times,” he says, “and it came up heads 1583 times. That’s 72 more heads than you’d expect with a fair coin. I worked out the *p*-value — that is, the probability of this large an excess occurring if the coin is fair — and it’s under 1%. So we can conclude that the coin is unfair at a significance level of 1% (or ‘99% confidence’ as physicists often say).”

You can take my word for it that Frank’s done the calculation correctly (or you can check it yourself if you like). Now, I want you to consider two different possibilities:

- Frank is an honest man, who has followed completely orthodox (frequentist) statistical procedure. To be specific, he decided on the exact protocol for his test (including, for some reason, the decision to do 3022 trials) in advance.
- Frank is a scoundrel who, for some reason,
*wants*to reach the conclusion that the coin is unfair. He comes up with a nefarious plan: he keeps flipping the coin for as long as it takes to reach that 1% significance threshold, and then he stops and reports his results.

(I thought about making up some sort of backstory to explain why scoundrel Frank would behave this way, but I couldn’t come up with anything that wasn’t stupid.)

Here are some questions for you:

- What should Betsy conclude on the basis of the information Frank has given her?
- Does the answer depend on whether Frank is an honest man or a scoundrel?

I should add one more bit of information: Betsy is a rational person — that is, she draws conclusions from the available evidence via Bayesian inference.

As you can guess, I’m asking these questions because I think the answers are surprising. In fact, they turn out to be surprising in two different ways.

There’s one thing we can say immediately: if Frank is a scoundrel, then the 1% significance figure is meaningless. It turns out that, if you start with a fair coin and flip it long enough, you will (with probability 1) always eventually reach 1% significance (or, for that matter, any other significance you care to name). So the fact that he reached 1% significance conveys no information in this scenario.

On the other hand, the fact that he reached 1% significance *after 3022 trials* does still convey some information, which Betsy will use when she performs her Bayesian inference. In fact, *the conclusion Betsy draws will be exactly the same whether Frank is an honest man or a scoundrel*. The reason is that, either way, the evidence Betsy uses in performing her Bayesian inference is the same, namely that there were 1583 heads in 3022 flips.

[Technical aside: if Frank is a scoundrel, and Betsy knows it, then she has some additional information about the order in which those heads and tails occurred. For instance, she knows that Frank didn’t start with an initial run of 20 heads in a row, because if he had he would have stopped long before 3022 flips. You can convince yourself that this doesn’t affect the conclusion.]

That’s surprise #1. (At least, I think it’s kind of surprising. Maybe you don’t.) From a frequentist point of view, the *p*-value is the main thing that matters. Once we realize that the *p*-value quoted by scoundrel Frank is meaningless, you might think that the whole data set is useless. But in fact, viewed rationally (i.e., using Bayesian inference), the data set means exactly the same thing as if Frank had produced it honestly.

Here’s surprise #2: for reasonable assumptions about Betsy’s prior beliefs, she should regard this evidence as *increasing* the probability that the coin is fair, even though Frank thinks the evidence establishes (at 1% significance) the coin’s unfairness. Moreover, even if Frank’s results had ruled out the coin’s fairness at a more stringent signficance (0.1%, 0.00001%, whatever), it’s always possible that he’ll wind up with a result that Betsy regards as evidence *in favor* of the coin’s fairness.

Often, we expect Bayesians and frequentists to come up with different conclusions when the evidence is weak, but we expect the difference to go away when the evidence is strong. But in fact, no matter how strong the evidence is from a frequentist point of view, it’s always possible that the Bayesian will view it in precisely the opposite way.

I’ll show you that this is true with some specific assumptions, although the conclusion applies more generally.

Suppose that Betsy’s initial belief is that 95% of coins are fair — that is, the probability* P* that they come up heads is exactly 0.5. Betsy has no idea what the other 5% of coins are like, so she assumes that all values of *P* are equally likely for them. To be precise, her prior probability density on *P*, the probability that the given coin comes up heads, is

Pr[*P*] = 0.95 δ(*P-*0.5) + 0.05

over the range 0 < *P *< 1. (I’m using the Dirac delta notation here.)

The likelihood function (i.e., the probability of getting the observed evidence for any given *P*) is

Pr[ E | *P*] = *A* *P*^{1583} (1-*P*)^{1439}.

Here *A* is a constant whose value doesn’t matter. (To be precise, it’s the number of possible orders in which heads and tails could have arisen.) Turning the Bayes’s theorem crank, we find that the posterior probability distribution is

Pr[*P* | E] = 0.964 δ(*P*-0.5) + *B* *P*^{1583} (1-*P*)^{1439}.

Here *B* is some other constant I’m not bothering to tell you because it doesn’t matter. What does matter is the factor 0.964 in front of the delta function, which says that, in this particular case, Betsy regards Frank’s information as increasing the probability that the coin is fair from 95% to 96.4%. In other words, she initially thought that there was a 5% chance the coin was unfair, but based on Frank’s results she now thinks there’s only a 3.2% chance that it is.

It’s not surprising that a Bayesian and frequentist interpretation of the same result give different answers, but I think it’s kind of surprising that Frank and Betsy interpret the same evidence in *opposite* ways: Frank says it rules out the possibility that the coin is fair with high significance, but Betsy says it increases her belief that the coin is fair. Moreover, as I mentioned before, even if Frank had adopted a more stringent criterion for significance — say 0.01% instead of 1% — the same sort of thing could happen.

If Betsy had had a different prior, this evidence might not have had the same effect, but it turns out that you’d get the same kind of result for a pretty broad range of priors. In particular, you could change the 95% in the prior to any value you like, and you’d still find that the evidence increases the probability that the coin is fair. Also, you could decide that the assumption of a uniform prior for the unfair coins is unrealistic. (There probably aren’t any coins that come up heads 99% of the time, for instance.) But if you changed that uniform prior to any reasonably smooth, not too sharply peaked function, it wouldn’t change the result much.

In fact, you can prove a general theorem that says essentially the following:

No matter what significance level

sFrank chooses, and what Betsy’s prior is, it’s still possible to find a number of coin flips and a number of heads such that Frank rules out the possibility that the coin is fair at significances, while Betsy regards the evidence as increasing the probability that the coin is fair.

I could write out a formal proof of this with a few equations, but instead I’ll just sketch the main idea. Let *n* be the number of flips and *k* be the number of heads. Suppose Frank is a scoundrel, flipping the coin until he reaches the desired significance and then stopping. Imagine listing all the possible pairs (*n,k*) at which he might stop. If you just told Betsy that Frank had stopped at one of those points, but not which one, then you’d be telling Betsy no information at all (since Frank is guaranteed to stop eventually). With that information, therefore, her posterior probability distribution would be the same as her prior. But that posterior probability distribution is also a weighted average of the posterior probability distributions corresponding to each of the possible pairs (*n,k*), with weights given by the probability that Frank stops at each of those points. Since the weighted average comes out the same as the prior, some terms in the average must give a probability of the coin being fair which is greater than the prior (and some must be less).

Incidentally, in case you’re wondering, Betsy and Frank are my parents’ names, which fortuitously have the same initials as Bayesian and frequentist. My father Frank probably did learn statistics from a frequentist point of view (for which he deserves pity, not blame), but he would certainly never behave like a scoundrel.

Related to this is of course publication bias: an article with an interesting result is more likely to get published. There is even a journal which officially says that papers without surprising results are less likely to get published, even if there is otherwise nothing wrong with them. I’m not sure if he did it himself or was discussing work by someone else, but Prasenjit Saha once told me about a study of 3-sigma results in the astronomical literature. After a few years, many more of them had gone away than one would expect based on unbiased publication.

Have you lost a P at the right of Pr[P]?

Pr[P] = 0.95 δ(P-0.5) + 0.05 P

E. T. Jaynes does a similar calculation, demonstrating that (even from a Bayesian point of view) perspectives can be driven in opposite directions depending on the priors. In his example, the data is D := “Mr N. has gone on TV with a sensational claim that a commonly used drug is unsafe”. Three viewers differ on their prior assessment of the safety of the drug, and on the reliability of Mr N. One then shows that the difference between their posteriors diverges. Jaynes then argues that this is one of the reasons why, over time, you get a polarized environment (think, any significant issue affecting society today).

Very interesting. Two observations. First, surprise 1 doesn’t require a Bayesian, merely an advocate of the Likelihood Principle (see Birnbaum or Edwards or, particularly, Royall (1997) Statistical Evidence.)

Second, my interpretation of surprise 2 is that the longer it takes the scoundrel to get the result he was looking for, the stronger the evidence for fairness. Eventually the evidence is strong enough to exceed any prior.

Antonio — No, there’s not a P missing there. The 0.05 is the flat (i.e., constant, independent of P) prior on unfair coins.

Could you provide the details of how you applied Bayes rule here? I’m trying to calculate it myself without success. Thank you.

Hi, I’m a new subscriber to your blog =) Just wondering, your surprise #1 is that when Betsy was calculating her posterior, she didn’t need to use the information whether Frank was a scoundrel or not; she just needed the fact that there were 1583 heads in 3022 flips. But at the end when you sketch the proof for your theorem, you did use that fact that Betsy has a posterior distribution that is identical to the prior, because she knew beforehand that Frank was a scoundrel…

so is there a contradiction between what you said?

Also, could you provide references for further reading? =)

I’m sorry, but I don’t think I understand the claimed contradiction. The claim at the beginning is that Betsy need not care about the procedure by which Frank gathered his data when assessing the posterior probability. She needs a prior on the properties of the coin, but she has no need to peer into Frank’s head and assess his motivation. And in the proof at the end, I’m not assuming that Betsy knows Frank to be a scoundrel; I’m assuming that we, looking at the whole situation from the outside, know he’s a scoundrel. Betsy does her analysis with blissful indifference to the question.

For further reading, I can recommend a couple of blogs that say intelligent things on how to interpret and work with probabilities (where by “intelligent” I mean “generally in agreement with me”): Andrew Jaffe’s “Leaves on the Line” and Allen Downey’s ‘Probably Overthinking It”. Allen has a book called Think Stats, which is probably a good resource for thinking about this sort of thing, although I confess I haven’t read it. And there’s E.T. Jaynes’s book Probability Theory: The Language of Science.