Recently, I’ve been spending a lot of time with the peer review process. I’ve been revising a couple of papers in response to comments from the referees, and at roughly the same time I’ve been acting as referee for a couple of articles.

Some journals send each article just to a single referee, while others use multiple referees per article. In the latter case the author and the editor can get some idea of how well the process works by how consistent the different referee reports are with each other. In particular, if one referee says that a paper has a fundamental flaw that absolutely must be fixed, while the other notices nothing wrong, it’s natural to wonder what’s going on.

In fact, by keeping track of how often this sort of thing occurs, you can estimate how good referees are at spotting problems and hence how well the peer review process works. Let’s consider just problems that are so severe as to prevent acceptance of the paper (ignoring minor corrections and suggestions, which are often a big part of referees’ reports). Suppose that the typical referee has a probability p of finding each such problem. If the journal sends the paper to two referees, then the following things might happen:

- Both referees miss the problem. This happens with probability (1-p)
^{2}. - One referee finds the problem and the other misses it. The probability for this is 2p(1-p).
- Both referees find the problem. The probability here is p
^{2}.

A journal editor, or an author who’s written a bunch of papers, can easily estimate the ratio of the last two options: when one referee finds a problem, how often does the other referee find the same problem? From that ratio you can solve for p and know how well the referees are doing.

In my experience, I’d say that it’s at least as common for just one referee to find a problem as it is for both referees to find it. That means that the typical referee has at best a 2/3 chance of finding any given problem. And that means that the probability of both referees missing a problem (if there is one) is at least (1/3)^{2}, or 1/9. I’ll leave it up to you to decide whether you think that’s a good or bad success rate.

The main thing that made me think of this is that in one referee report I sent off recently, I pointed to what I think is a major, must-fix error, and I’m willing to bet the other referee won’t mention it. That’s not because the other referee is a slacker — I don’t know who it is, or even whether this particular journal uses multiple referees. It’s because the paper happens to have a problem in an area that I’m unusually picky about. It’s bad luck for the authors that they drew me as a referee for this paper. (But, assuming that I’m right to be picky about this issue — which naturally I think I am — it’s good for the community as a whole.)

This sort of calculation shows up in other places, by the way. I first heard of something like this when my friend Max Tegmark was finishing his Ph.D. dissertation. He got two of us to read the same chapter and find typos. By counting the number of typos we each found, and the number that both of us found, he worked out how many undiscovered typos there must be. I don’t remember the number, but it wasn’t small. On a less trivial level, I think that wildlife biologists use essentially this technique to assess how efficient their population counting methods are: If you know how many bears you found, and how many times you found the same bear twice, you can work out how many total bears there are.

I vote for “total bears” as a peer reviewing concept.