Some folks had the nice idea of looking at the data from the Iran election returns for signs of election fraud. In particular, they look at the last and second-to-last digits of the totals for different candidates in different districts, to see if these data are uniformly distributed, as you’d expect. Regarding the last digit, they conclude
The ministry provided data for 29 provinces, and we examined the number of votes each of the four main candidates — Ahmadinejad, Mousavi, Karroubi and Mohsen Rezai — is reported to have received in each of the provinces — a total of 116 numbers.
The numbers look suspicious. We find too many 7s and not enough 5s in the last digit. We expect each digit (0, 1, 2, and so on) to appear at the end of 10 percent of the vote counts. But in Iran’s provincial results, the digit 7 appears 17 percent of the time, and only 4 percent of the results end in the number 5. Two such departures from the average — a spike of 17 percent or more in one digit and a drop to 4 percent or less in another — are extremely unlikely. Fewer than four in a hundred non-fraudulent elections would produce such numbers.
The calculations are correct. There’s about a 20% chance of getting a downward fluctuation as large as the one seen, about a 10% chance of getting an upward fluctuation as large as the one seen, and about a 3.5% chance of getting both simultaneously.
The authors then go on to consider patterns in the last two digits.
Psychologists have also found that humans have trouble generating non-adjacent digits (such as 64 or 17, as opposed to 23) as frequently as one would expect in a sequence of random numbers. To check for deviations of this type, we examined the pairs of last and second-to-last digits in Iran’s vote counts. On average, if the results had not been manipulated, 70 percent of these pairs should consist of distinct, non-adjacent digits.
Not so in the data from Iran: Only 62 percent of the pairs contain non-adjacent digits. This may not sound so different from 70 percent, but the probability that a fair election would produce a difference this large is less than 4.2 percent.
Each of these tests alone is of marginal statistical significance, it seems to me, but in combination they start to look significant. But I don’t think that’s a fair conclusion to draw. It seems to me that this analysis is an example of the classic error of a posteriori statistical significance. (This fallacy must have a catchy name, but I can’t come up with it now. If you know it, please tell me.)
This error goes like this: you notice a surprising pattern in your data, and then you calculate how unlikely that particular pattern is to have arisen. When that probability is low, you conclude that there’s something funny going on. The problem is that there are many different ways in which your data could look funny, and the probability that one of them will occur is much larger than the probability that a particular one of them will occur. In fact, in a large data set, you’re pretty much guaranteed to find some sort of anomaly that, taken in isolation, looks extremely unlikely.
In this case, there are lots of things that one could have calculated instead of the probabilities for these particular outcomes. For instance, we could have looked at the number of times the last two digits were identical, or the number of times they differed by two, three, or any given number. Odds are that at least one of those would have looked surprising, even if there’s nothing funny going on.
By the way, this is an issue that I’m worrying a lot about these days in a completely different context. There are a number of claims of anomalous patterns in observations of the cosmic microwave background radiation. It’s of great interest to know whether these anomalies are real, but any attempt to quantify their statistical significance runs into exactly the same problem.
We run into this kind of problem all the time in high-energy physics “bump hunts”, where a hypothetical particle has a predicted decay mode but not a specific mass, so the hypothesis being tested isn’t completely specified. These days we do a lot of computer simulations to see how often random fluctuations would give a peak with the observed area and any mass within the acceptance range to assess the statistical significance–there’s some good discussion of this on Tomasso Dorigo’s blog. However, this is still a pretty well constrained hypothesis where we know exactly what alternatives to test, so it doesn’t generalize to the problem where there are no limits on what patterns people look at.
I gather in data mining this problem is known as “data dredging”. I don’t know of a catchy generic name for the problem, but it seems common enough that it ought to have one.
However, in the case of the Iranian elections, my understanding is that there are specific properties that have been determined to have good power for discriminating between genuine and fradulent counts, and that the (better) statistical analyses are only looking at those specific properties, not dredging for any old random correlations. But most of what I know on the subject is from Andrew Gelman’s blog…
Thanks for this information! I was working just from that Op-Ed piece, but the extra information at Andrew Gelman’s blog (http://www.stat.columbia.edu/~gelman/blog/) does make things look more convincing, at least at a quick glance.
I make some of the same points here – http://alchemytoday.com/2009/06/24/is-the-devil-in-the-digits/ – with some examples here – http://alchemytoday.com/2009/06/25/more-on-that-devil/
Looking for a rare event in the last digit is not equivalent to looking for one high frequency and on low frequency digit. It’s pretty easy to think of a few hundred events that are equally rare and would support the same conclusion.