The Neyman-Scott paradox

I’m mostly posting this to let you know about Peter Coles’s nice exposition of something called the Neyman-Scott paradox. If you like thinking about probability and statistics, and in particular about the difference between Bayesian and frequentist ways of looking at things (and who doesn’t like that?), you should read it.

You should read the comments too, which have some actual defenders of the frequentist point of view. Personally, I’m terrible at characterizing frequentist arguments, because I don’t understand how those people think. To be honest, I think that Peter is a bit unfair to the frequentists, for reasons that you’ll see if you read my comment on his post. Briefly, he seems to suggest that “the frequentist approach” to this problem is not what actual frequentists would do.

The Neyman-Scott paradox is a somewhat artificial problem, although Peter argues that it’s not as artificial as some people seem to think. But the essential features of it are contained in a very common situation, familiar to anyone who’s studied statistics, namely estimating the variance of a set of random numbers.

Suppose that you have a set of measurements x1, …, xm. They’re all drawn from the same probability distribution, which has an unknown mean and variance. Your job is to estimate the mean and variance.

The standard procedure for doing this is worked out in statistics textbooks all over the place. You estimate the mean simply by averaging together all the measurements, and then you estimate the variance as

variance

That is, you add up the squared deviations from the (estimated) mean, and divide by – 1.

If nobody had taught you otherwise, you might be inclined to divide by m instead of – 1. After all, the variance is supposed to be the mean of the squared deviations. But dividing by m leads to a biased estimate: on average, it’s a bit too small. Dividing by – 1 gives an unbiased estimate.

In my experience, if a scientist knows one fact about statistics, it’s this: divide by – 1 to get the variance.

Suppose that you know that the numbers are normally distributed (a.k.a. Gaussian). Then you can find the maximum-likelihood estimator of the mean and variance. Maximum-likelihood estimators are often (but not always) used in traditional (“frequentist”) statistics, so this seems like it might be a sensible thing to do. But in this case, the maximum-likelihood estimator turns out to be that bad, biased one, with m instead of m – 1 in the denominator.

The Neyman-Scott paradox just takes that observation and presents it in very strong terms. First, they set m = 2, so that the difference between the two estimators will be most dramatic. Then they imagine many repetitions of the experiment (that is, many pairs of data points, with different means but the same variance). Often, when you repeat an experiment many times, you expect the errors to get smaller, but in this case, because the error in question is bias rather than noise (that is, because it shifts the answer consistently one way), repeating the experiment doesn’t help. So you end up in a situation where you might have expected the maximum-likelihood estimate to be good, but it’s terrible.

Bayesian methods give a thoroughly sensible answer. Peter shows that in detail for the Neyman-Scott case. You can work it out for other cases if you really want to. Of course the Bayesian calculation has to give sensible results, because the set of techniques known as “Bayesian methods” really consist of nothing more than consistently applying the rules of probability theory.

As I’ve suggested, it’s unfair to say that frequentist methods are bad because the maximum-likelihood estimator is bad. Frequentists know that the maximum-likelihood estimator doesn’t always do what they want, and they don’t use it in cases like this. In this case, a frequentist would choose a different estimator. The problem is in the word “choose”: the decision of what estimator to use can seem mysterious and arbitrary, at least to me. Sometimes there’s a clear best choice, and sometimes there isn’t. Bayesian methods, on the other hand, don’t require a choice of estimator. You use the information at your disposal to work out the probability distribution for whatever you’re interested in, and that’s the answer.

(Yes, “the information at your disposal” includes the dreaded prior. Frequentists point that out as if it were a crushing argument against the Bayesian approach, but it’s actually a feature, not a bug.)

Ranking the election forecasters

In Slate, Jordan Ellenberg asks How will we know if Nate Silver was right?

It’s a good question. If you have a bunch of models that make probabilistic predictions, is there any way to tell which one was right? Every model will predict some probability for the outcome that actually occurs. As long as that probability is nonzero, how can you say the model was wrong?

Essentially every question that a scientist asks is of this form. Because measurements always have some uncertainty, you can virtually never say that the probability of any given outcome is exactly zero, so how can you ever rule anything out?

The answer, of course, is statistics. You don’t rule things out with absolute certainty, but you rule them out with high confidence if they fit the data badly. And “fit the data badly” essentially means “have a low probability of occurring.”

So Ellenberg proposes that all the modelers publish detailed probabilities for all possible outcomes (specifically, all possible combinations of victories by the candidates in each state). Once we know the outcome, the one who assigned the highest probability to it is the best.

In statistics terminology, what he’s proposing is simply ranking the models by likelihood. That is indeed a standard thing to do, and if I had to come up with something, it’s what I’d suggest too. In this case, though, it’s probably not going to give a definitive answer, simply because all the forecasters will probably have comparable probabilities for the one outcome that will occur.

All of those probabilities will be low, because there are lots of possible outcomes, and any given one is unlikely. That doesn’t matter. What matters is whether they’re all similar. If 538 predicts a probability of 0.8%, and Princeton Election Consortium predicts 0.0000005%, then I agree that 538 wins. But what if the two predictions are 0.8% and 0.5%? The larger number still wins, but how strong is that evidence?

The way to answer that question is to use a technique called reasoning (or as some old-fashioned people insist on calling it, Bayesian reasoning). Bayes’s theorem gives a way of turning those likelihoods into posterior probabilities, which are the probabilities that any given model is correct, given the evidence. The answer depends on the prior probabilities — how likely you thought each model was before the data came in. If, as I suspect, the likelihoods come out comparable to each other, then the final outcome depends strongly on the prior probabilities. That is, the new information won’t change your mind all that much.

If things turn out that way, then Ellenberg’s proposal won’t answer the question, but that’s because there won’t be any good way to answer the question. The Bayesian analysis is the correct one, and if it says that the posterior distribution depends strongly on the prior, then that means that the available data don’t tell you who’s better, and there’s nothing you can do about it.

 

Gerrymandering without gerrymanderers

The Washington Post has a nice graphic explaining how gerrymandering works.

imrs

If the people drawing the district lines prefer one party, they can draw the lines as on the right: they pack as many opponents as possible into a few districts, so that those opponents win those districts in huge landslides. Then they spread out their people to win the other districts by slight majorities.

The solution people generally propose is to put the district-drawing power into the hands of non-partisan people or groups. While I think this is certainly a good idea, it’s worth mentioning that you can get “unfair” results even without deliberate gerrymandering. In particular, if members of the political parties happen to cluster in different ways, then even a nonpartisan system of drawing districts can lead to one party being overrepresented.

I don’t propose to dig into the details of this as it affects US politics. Briefly, the US House of Representatives is more Republican than it “should” be: the fraction of representatives who are Republican is more than the nationwide fraction of votes for republicans in House races. Similar statements are true for various states’ US House delegations and for state legislatures, sometimes favoring the Democrats. No doubt you can find a lot more about this if you dig around a bit. Instead, I just want to illustrate with a made-up example how you can get gerrymandering-like results even if no one is deliberately gerrymandering.

To forestall any misunderstanding, let me be 100% clear: I am not saying that there is no deliberate gerrymandering in US politics. I am saying that it need not be the whole explanation, and that even if we implemented a nonpartisan redistricting system, some disparity could remain.

I should also add that nothing I’m going to say is original to me. On the contrary, people who think about this stuff have known about it forever. But a lot of my politically-aware friends, who know all about deliberate gerrymandering, haven’t thought about the ways that “automatic gerrymandering” can happen.

Imagine a country called Squareland, whose population is distributed evenly throughout a square area. Half the population are Whigs, and half are Mugwumps. As it happens, the two political parties are unevenly distributed, with more Whigs in some areas and more Mugwumps in others:
popdist1new
The bluer a region is, the more Whigs live there. But remember, each region has the same total number of people, and the total number of Whigs and Mugwumps nationwide are the same.

You ask a nonpartisan group to divide this region up into 400 Congressional districts. They’re not trying to help one party or another, so they decide to go for simple, compact districts:

popdist1gridnew

Each district has an election and comes out either Whig or Mugwump, depending on who has a majority in the local population. In this particular example, you get 197 Mugwumps and 203 Whigs. pretty good.

Now suppose that the people are distributed differently:

popdist2new

Note that the blue regions are much more compact. There’s lots of dull red area, which is majority-Mugwump, but no bright red extreme Mugwump majorities. The blue regions, on the other hand, are extreme. The result is that there’s a lot more red area than blue, even though there are equal numbers of red folks and blue folks.

Use the same districts:

popdist2gridnew

This time, the Mugwumps win 245 seats, and the Whigs get only 155. Nobody deliberately drew district lines to disenfranchise the Whigs, but it happened anyway. And the reason it happened is very similar to what you’d get with deliberate gerrymandering: the Whigs got concentrated into a small number of districts with big majorities.

Once again, I’m not saying that this is what’s happened in the US House of Representatives. On the contrary, the evidence of deliberate gerrymandering is very strong, and given the incentives, it would be quite surprising if it did not occur. But if members of one party tend to “clump” more strongly than members of the other party, then this sort of effect can certainly occur, and could form a part of the discrepancy we see.

 

UR is a place “Where Great Research Meets Great Teaching”

According to the Wall Street Journal (paywall, unfortunately).

Screenshot 2016-10-07 14.12.51Most of the various college rankings strike me as somewhat silly, but I’ll make an exception for this one, because it says something nice about my home institution, which coincides very well with my own impression. Lots of places claim to be good at both teaching and research, but my experience at UR is that we really mean it. The faculty are excited both about their scholarship and about working closely with undergraduates, and the university’s reward structure conveys that both are valued.

I always say that student-faculty research is the best thing about this place.  I’m glad the WSJ agrees.

In case you’re wondering, the data used to generate this list consisted of two pieces: number of research papers per faculty member, and a student survey asking about faculty accessibility and opportunities for collaborative learning. A school had to do well on both to make the list, although I don’t think they said exact recipe for combining them.

Eclipse!

In case you don’t know, there’s a total eclipse coming to North America next year, on August 21 to be precise. You can find lots of information about it by Googling, naturally.

You’ll only be able to see a total eclipse along a certain path through the US. I wanted to know where on the path I should go. In particular, where is the weather most likely to be clear on that day?

I asked my brother, who’s a climate scientist, where to go to find this out. He pointed me to the cloud fraction data from NASA’s MODIS TERRA/AQUA sensor, which contains whole-earth cloud cover maps. Here’s the answer:

usa

The red curve is the middle of the path of totality of the eclipse. The eclipse will be total over a band of maybe 30-50 miles on either side. The colors show average cloud cover. Black is good and white is bad.

(In case you’re dying to know the details, here they are. I grabbed the data for the past 10 years for the time around August 21. The most sensible data product to use seems to be the 8-day averages. August 21 is right on the edge of one of those 8-day windows, so I took the two eight-day windows on either side, and averaged together those 20 maps.)

Here’s another way to look at it: a graph of average cloud cover along the path of totality, across the US. The red dashed lines show state boundaries. The path clips little corners of Kansas, Illinois, and North Carolina, which I didn’t bother to mark.

 

longgraph

After making my map, I found this one, which reaches similar conclusions.

 

cloudmaps

Hotel rooms along the path of totality are already hard to find. Make your plans right away!

We still don’t know if there have been alien civilizations

Pedants (a group in which I have occasionally been included) often complain that nobody uses the phrase “beg the question” correctly anymore. It’s supposed to refer to the logical fallacy of circular reasoning — that is, of assuming the very conclusion for which you are arguing. Because the phrase is so often used to mean other things, you can’t use it in this traditional sense anymore, at least not if you want to be understood.

I’ve never found this to be a big problem, because the traditional meaning isn’t something I want to talk about very often. Until today.

The article headlined Yes, There Have Been Aliens in today’s New York Times is the purest example of question-begging I’ve seen in a long time. The central claim is that “we now have enough information to conclude that they [alien civilizations] almost certainly existed at some point in cosmic history.”

The authors use a stripped-down version of the Drake equation, which is the classic way to talk about the number of alien civilizations out there. The Drake equation gives the expected number of alien civilizations in our Galaxy in terms of a bunch of probabilities and related numbers, such as the fraction of all stars that have planets and the fraction of planets on which life evolves. Of course, we don’t know some of these numbers, particularly that last one, so we can’t draw robust conclusions.

The authors estimate that “unless the probability for evolving a civilization on a habitable-zone planet is less than one in 10 billion trillion, then we are not the first” such civilization. Based on this number, they conclude that ” the degree of pessimism required to doubt the existence, at some point in time, of an advanced extraterrestrial civilization borders on the irrational.”

Nonsense. It’s not the least bit irrational to believe that this probability is so low. We have precisely no evidence as to the value of the probability in question. Any conclusion you draw from this value is based solely on your prior (evidence-free) estimate of the probability.

I mean the phrase “evidence-free” in a precise Bayesian sense: All nonzero values of that probability are equally consistent with the world we observe around us, so no observation causes us to prefer any value over another.

They’d revoke my Bayesian card if I didn’t point out that there’s no problem with the fact that your conclusions depend on your prior probabilities. All probabilities do (with the possible exception of statements about pure mathematics and logic). But it’s absurd to say that it’s “irrational” to believe that the probability is below a certain value, when your assessment of that probability is determined entirely by your prior beliefs, with no contribution from actual evidence.

This sort of argument is occasionally known as “proof by Goldberger’s method“:

The proof is by the method of reductio ad asburdum. Suppose the result is false. Why, that’s absurd! QED.

 

Electability update

As I mentioned before, a fair amount of conversation about US presidential politics, especially at this time in the election cycle, is speculation about the “electability” of various candidates. If your views are aligned with one party or the other, so that you care more about which party wins than which individual wins, it’s natural to throw your support to the candidate you think is most electable. The problem is that you may not be very good at assessing electability.

I suggested that electability should be thought of as a conditional probability: given that candidate X secures his/her party’s nomination, how likely is the candidate to win the general election? The odds offered by the betting markets give assessments of the probabilities of nomination and of victory in the general election. By Bayes’s theorem, the ratio of the two is the electability.

Here’s an updated version of the table from my last post, giving the candidates’ probabilities:

PartyCandidateNomination ProbabilityElection ProbabilityElectability
DemocratClinton70.54463
DemocratSanders28.519.568
RepublicanBush8.53.541
RepublicanCruz13.55.40
RepublicanRubio32.51546
RepublicanTrump47.5
29.562

As before, these are numbers from PredictIt, which is a betting market where you can go wager real money.

If you use numbers from PredictWise, they look quite different:

PartyCandidateNomination ProbabilityElection ProbabilityElectability
DemocratClinton845363
DemocratSanders16850
RepublicanBush7343
RepublicanCruz8225
RepublicanRubio321341
RepublicanTrump511835

PredictWise aggregates information from various sources, including multiple betting markets as well as polling data. I don’t know which one is better. I do know that if you think PredictIt is wrong about any of these numbers, then you can go there and place a bet. Since PredictWise is an aggregate, there’s no correspondingly obvious way to make money off of it. If you do think the PredictWise numbers are way off, then it’s probably worth looking around at the various betting markets to see if there are bets you should be making: since PredictWise got its values in large part from these markets, there may be.

To me, the most interesting numbers are Trump’s. Many of my lefty friends are salivating over the prospect of his getting the nomination, because they think he’s unelectable. PredictIt disagrees, but PredictWise agrees. I don’t know what to make of that, but it remains true that, if you’re confident Trump is unelectable, you have a chance to make some money over on PredictIt.

My old friend John Stalker, who is an extremely smart guy, made a comment on my previous post that’s worth reading. He raises one technical issue and one broader issue.

The technical point is that whether you can make money off of these bets depends on the bid-ask spread (that is, the difference in prices to buy or sell contracts). That’s quite right.  I would add that you should also consider the opportunity cost: if you make these bets, you’re tying up your money until August (for bets on the nomination) or November (for bets on the general election). In deciding whether a bet is worthwhile, you should compare it to whatever investment you would otherwise have made with that money.

John’s broader claim is that “electability” as that term is generally understood in this context means something different from the conditional probabilities I’m calculating:

I suspect that by the term “electability” most people mean the candidate’s chances of success in the general election assuming voters’ current perceptions of them remain unchanged, rather than their chances in a world where those views have changed enough for them to have won the primary.

You should read the rest yourself.

I think that I disagree, at least for the purposes that I’m primarily interested in. As I mentioned, I’m thinking about my friends who hope that Trump gets the nomination because it’ll sweep a Democrat into the White House. I think that they mean (or at least, they should mean) precisely the conditional probability I’ve calculated. I think that they’re claiming that a world in which Trump gets the nomination (with whatever other events or changes go along with that) is a world in which the Democrat wins the Presidency. That’s what my conditional probabilities are about.

But as I said, John’s an extremely smart guy, so maybe he’s right and I’m wrong.

Horgan on Bayes

John Horgan has a piece at Scientific American‘s site entitled “Bayes’s Theorem: What’s the Big Deal?” The article’s conceit is that, after hearing people touting Bayesian reasoning to him for many years, he finally decided to learn what it was all about and explain it to his readers.

His explanation is not bad at first. He gets a lot of it from this piece by Eliezer Yudkowsky, which is very good but very long. (It does have jokes sprinkled through it, so keep reading!) Both Yudkowsky and Horgan emphasize that Bayes’s theorem is actually rather obvious. Horgan:

This example [of the probability of false positives in medical tests] suggests that the Bayesians are right: the world would indeed be a better place if more people—or at least more health-care consumers and providers–adopted Bayesian reasoning.

On the other hand, Bayes’ theorem is just a codification of common sense. As Yudkowsky writes toward the end of his tutorial: “By this point, Bayes’ theorem may seem blatantly obvious or even tautological, rather than exciting and new. If so, this introduction has entirely succeeded in its purpose.”

That’s right! Bayesian reasoning is simply the (unique) correct way to reason quantitatively about probabilities, in situations where the experimental evidence doesn’t let you draw conclusions with mathematical certainty (i.e., pretty much all situations).

Unfortunately, Horgan eventually goes off the rails:

The potential for Bayes abuse begins with P(B), your initial estimate of the probability of your belief, often called the “prior.” In the cancer-test example above, we were given a nice, precise prior of one percent, or .01, for the prevalence of cancer. In the real world, experts disagree over how to diagnose and count cancers. Your prior will often consist of a range of probabilities rather than a single number.

In many cases, estimating the prior is just guesswork, allowing subjective factors to creep into your calculations. You might be guessing the probability of something that–unlike cancer—does not even exist, such as strings, multiverses, inflation or God. You might then cite dubious evidence to support your dubious belief. In this way, Bayes’ theorem can promote pseudoscience and superstition as well as reason.

The problem he’s talking about is, to use a cliche, not a bug but a feature. When the evidence doesn’t prove, with mathematical certainty, whether a statement is true or false (i.e., pretty much always), your conclusions must depend on your subjective assessment of the prior probability. To expect the evidence to do more than that is to expect the impossible.

In the example Horgan is using, suppose that a cancer test is given with known rates of false positives and false negatives. The patient tests positive. In order to interpret that result and decide how likely the patient is to have cancer, you need a prior probability. If you don’t have one based on data from prior studies, you have to use a subjective one.

The doctor and patient in such a situation will, inevitably, decide what to do next based on some combination of the test result and their subjective prior probabilities. The only choice they have is whether do it unconsciously or consciously.

The second paragraph quoted above is simply nonsense. If you apply Bayesian reasoning to any of those things that may or may not exist, you will reach conclusions that combine your prior belief with the evidence. I have no idea in what sense doing this “promote[s] pseudoscience.” More importantly, I have no idea what alternative Horgan would have us choose.

Here’s the worst part of the piece:

Embedded in Bayes’ theorem is a moral message: If you aren’t scrupulous in seeking alternative explanations for your evidence, the evidence will just confirm what you already believe. Scientists often fail to heed this dictum, which helps explains why so many scientific claims turn out to be erroneous. Bayesians claim that their methods can help scientists overcome confirmation bias and produce more reliable results, but I have my doubts.

Horgan doesn’t cite any examples of erroneous claims that can be blamed on Bayesian reasoning. In fact, this statement seems to me to be nearly the exact opposite of the truth.

There’s been a lot  angst in the past few years about non-replicable scientific findings. One of the main contributors to this problem, as far as I can tell, is that scientists are not using Bayesian reasoning: they are interpreting p-values as if they told us whether various hypotheses are true or not, without folding in any prior information.

The world is getting better, in at least one way

I’m checking the page proofs for an article that my student Haonan Liu and I have had accepted for publication in Physical Review D. I’ve worked with dozens of undergraduates on research projects over the years, and this is by far the most substantial work ever done by any of them. Huge congratulations to Haonan! (And to my friends on admissions committees for physics Ph.D. programs: look out for this guy’s application.)

Incidentally, we submitted this article before the Open Journal for Astrophysics opened its doors, so this isn’t the one I referred to in my last post. That one isn’t finished yet.

Along with the page proofs come a few comments and queries from the editors, to make sure that the published version of the article looks correct. That document says, in part,

The editors now encourage insertion of article titles in references to journal articles and e-prints.

If you’re not in physics or astronomy, this sentence probably seems strange: how could you possibly not include the titles of articles? If you do work in physics or astronomy, you’ve probably gotten used to the fact that we generally don’t give titles in citations, but this is an incredibly stupid thing. When you’re reading a paper, and you have to decide if it’s worth the bother of looking up a cited article, the title might actually be useful information! Other disciplines include titles. I’ve never understood why we don’t. Thank you to Physical Review for this bit of sanity.

Here’s what a bit of the bibliography originally looked like:

Screenshot 2015-12-30 15.48.15

 

Now it’ll be

Screenshot 2015-12-30 15.48.33

Much better!

Of course, the standard LaTeX styles used for formatting articles for publication in physics journals don’t include article titles, so including them at this stage actually took a bit of effort on my part, but I was glad to do it. I hope other journals follow this practice. Maybe I’ll mention it to someone on the board of the Open Journal.

The Open Journal of Astrophysics

I’m pleased to point out that the Open Journal of Astrophysics is now open for submissions. Editor-in-chief Peter Coles has all the details. I’m a member of the editorial board of this new journal, although I confess that I have done nothing to help with it so far.

The journal performs peer review like other scholarly journals. Articles in it are published on the arxiv (where most astrophysics articles get posted anyway). The journal does not have the overhead of publishing in the traditional way, so it is free for both authors and readers.

I find the economics of traditional scholarly journals utterly baffling. As Coles observes, “The only useful function that journals provide is peer review, and we in the research community do that (usually for free) anyway.” I hope that efforts like this one will point the way to a more efficient system. I urge my astrophysics colleagues to submit articles to it.

Now let me confess to a bit of hypocrisy, or at least timidity. I’m hoping to submit an article for publication in the next few weeks, and I’m planning to send it to an established journal, not the Open Journal. The only reason is that I expect to apply for promotion (from Associate Professor to Full Professor) this summer, and I think there’s a significant possibility that some of the people evaluating my application will be more impressed by an established journal, with all the various accoutrements such as impact factors that go along with it.

This is quite possibly the last time in my career that I’ll have to worry about this sort of thing. In general, I care about the opinions of people who have actually read my work and formed a judgment based on its merits. Such people don’t need to rely on things like impact factors, which are a terribly stupid way to evaluate quality. So after this one, I’ll promise to submit future articles to the Open Journal (unless I have coauthors whom I can’t persuade to do it, I guess).