10 Scientific Ideas That Scientists Wish You Would Stop Misusing

That’s the headline of a piece on io9.  I find the headline a bit obnoxious —  we scientists are lecturing you, the unwashed masses, about what you’re doing wrong, when in fact scientists themselves are to blame for at least some of the misunderstandings described. But the actual content is very good.

Sean Carroll says very sensible things about “proof”. Science is mostly about accumulation of evidence, which allows us to update our model of the world via Bayesian reasoning (or as I like to call it “reasoning”).

Jordan Ellsburg takes aim at “statistically significant”:

“Statistically significant” is one of those phrases scientists would love to have a chance to take back and rename. “Significant” suggests importance; but the test of statistical significance, developed by the British statistician R.A. Fisher, doesn’t measure the importance or size of an effect; only whether we are able to distinguish it, using our keenest statistical tools, from zero. “Statistically noticeable” or “Statistically discernable” would be much better.

Well said. The fact that something can be “statistically significant” and simultaneously utterly unimportant is very often lost, particularly in descriptions of medical findings.

This item illustrates what bothers me about the headline of the piece, by the way. It smacks of blaming the victim. Scientists are at least as much to blame as anyone else for talking about “statistically significant” results in a misleading way.

The other items are well worth reading too. I particularly recommend the ones on quantum weirdness and “natural”.

 

Female hurricane names

Supposedly, hurricanes with feminine-sounding names are more deadly than those with male-sounding names. That’s the conclusion of a study published in PNAS. Put me down on the side of skepticism.

I should point out that the published paper discusses two different sets of results: an analysis of the death toll from past hurricanes, and a set of surveys of people’s perceptions of hurricanes based only on their names. The latter study shows that people do perceive hurricanes as milder when they have feminine-sounding names (in the absence of other information). I’m quite prepared to believe that one. The first finding, about actual people actually dying, is the one I don’t believe.

First, this is precisely the sort of result that’s most susceptible to publication bias. (If you checked for this and found nothing, you wouldn’t publish, but if you found something, you would.) This is the main reason that some people claim that “most published research findings are false.”

Add to that the closely related problem that’s sometimes known as p-hacking. This is the practice of testing out multiple hypotheses, data sets, or statistical methods, and only reporting the ones that yield interesting results. P-hacking artificially inflates the statistical significance of your results. From the PNAS paper:

The analyses showed that the change in hurricane fatalities as a function of MFI [a measure of how masculine or feminine a name is perceived to be] was marginal for hurricanes lower in nor- malized damage, indicating no effect of masculinity-femininity of name for less severe storms. For hurricanes higher in nor- malized damage, however, this change was substantial, such that hurricanes with feminine names were much deadlier than those with masculine names.

To summarize, “the first thing we tried didn’t yield an interesting result, but we kept trying until we found something that did.” (To give the authors credit, at least they acknowledge that they did this. That doesn’t make the result right, but it’s more honest than the alternative.)

But here’s the biggest problem. The study used hurricanes from 1950 onwards. Hurricanes were essentially all given female names until 1978, so the masculine names are heavily weighted toward late times. So something else that produces a change in hurricane deadliness over time would mimic  the effect that is seen. An obvious candidate: better early warning systems, which reduce fatalities for any given level of storm severity.

Not surprisingly, other people have pointed out problems like these. The authors of the study have responded. If you want to decide what you think about this, you should definitely read what they have to say. I’ll just point out a few things.

On the most important issue (the question of whether to use pre-1979 all-female-named hurricanes), the authors point out that their data used degree of perceived femininity of names, not simply a male-female binary, so there is information in the pre-1979 names. This is not sufficient to resolve the problem, which is that the pre-1979 data are, on average, very different in MFI from the later data. They then say

Although it is true that if we model the data using only hurricanes since 1979 (n=54) this is too small a sample to obtain a significant interaction, when we model the fatalities of all hurricanes since 1950 using their names’ degree of femininity, the interaction between name-femininity and damage is statistically significant.

This response, of course, simply digs them deeper into the p-hacking hole. Essentially, they’re saying that they used the less reliable data set because the more reliable one didn’t give an interesting result.

The authors give a second, better response to this problem:

We included elapsed years (years since the hurricane) in our modeling and this did not have any significant effect in predicting fatalities. In other words, how long ago the storm occurred did not predict its death toll.

This is potentially a good argument, although the details depend on precisely what tests they performed. The relevant question is whether replacing the MFI with the time since the hurricane gives a model that fits reasonably well (in which case assuming an MFI effect is not necessary). Unfortunately, these results are not presented in the paper, so there’s no way to tell if that’s what was done. The authors do mention that they tried including elapsed time as one of their variables, but without enough specifics to tell what was actually done.

The more I think about it, the stranger it is that this issue is not addressed in detail in the paper. Usually, in this sort of study, you control for other variables that might mimic the signal you’re looking for. In this case, an overall drift in hurricane deadliness with time is by far the most obvious such variable, and yet it’s not included in any of the models for which they compute goodness of fit. It’s very strange that the authors would not include this in their models, and even stranger that the reviewers let them.

Even if BICEP2 is wrong, inflation is still science

Paul Steinhardt played a major role in developing the theory behind cosmological inflation, but he has since turned into one of the theory’s biggest detractors. Sometimes, theorists get so attached to their theories that they become blind proponents of them, so it’s quite commendable for someone to become a critic of a theory that he pioneered. But of course that doesn’t mean that Steinhardt’s specific criticisms are correct.

He’s got a short and fiery manifesto in Nature (not behind a paywall, I think, but if you can’t get to it, let me know). The title and subheading:

Big Bang blunder bursts the multiverse bubble

Premature hype over gravitational waves highlights gaping holes in models for the origins and evolution of the Universe.

For a column like this (as opposed to a research article), the author isn’t necessarily responsible for the title, but in this case the headlines pretty accurately capture the tone of the piece.

The hook for the piece is the controversy surrounding the BICEP2 claim to have detected the signature of gravitational waves from inflation in the cosmic microwave background (CMB) radiation. Since my last post on this, the reasons for doubt have gotten stronger: two preprints have come out giving detailed arguments that the BICEP team have not made a convincing case against the possibility that their signal is due to dust contamination. The BICEP team continues to say everything is fine, but, as far as I know, they have not provided a detailed rebuttal of the arguments in the preprint.

For what it’s worth, I find the doubts raised in these preprints to be significant. I’m not saying the BICEP2 result is definitely not CMB, but there’s significant doubt in my mind. At this point, I would place an even-odds bet that they have not seen CMB, but I wouldn’t make the bet at 5-1 odds.

So I share Steinhardt’s skepticism about the BICEP2 claim, at least to some extent. But he leaps from this to a bunch of ridiculously overblown statements about the validity of inflation as a scientific theory.

The common view is that [inflation] is a highly predictive theory. If that was the case and the detection of gravitational waves was the ‘smoking gun’ proof of inflation, one would think that non-detection means that the theory fails. Such is the nature of normal science. Yet some proponents of inflation who celebrated the BICEP2 announcement already insist that the theory is equally valid whether or not gravitational waves are detected. How is this possible?

The “smoking gun” is a terribly overused metaphor in this context, but here it’s actually helpful to take it quite seriously. A smoking gun is strong evidence that a crime has been committed, but the absence of a smoking gun doesn’t mean there was no crime. That’s exactly the way it is with inflation, and despite what Steinhardt says, this is perfectly consistent with “normal” science. People searched for the Higgs boson for decades before they found it. When a search failed to find it, that didn’t mean that the Higgs didn’t exist or that the standard model (which predicted the existence of the Higgs) wasn’t “normal science.”

Steinhardt knows this perfectly well, and by pretending otherwise he is behaving shamefully.

Steinhardt goes on to say

The answer given by proponents is alarming: the inflationary paradigm is so flexible that it is immune to experimental and observational tests.

Whenever someone attributes an opinion to unnamed people and provides no citation to back up the claim, you should assume you’re being swindled. I know of no “proponent” of inflation who bases his or her support on this rationale.

There is a true statement underlying this claim: inflation is not a unique theory but rather a family of theories. There are many different versions of inflation, which make different predictions. To put it another way, the theory has adjustable parameters. Again, this is a perfectly well-accepted part of “normal science.” If BICEP2 turns out to be right, they will have measured some of the important parameters of the theory.

It’s certainly not true that inflation is immune to tests. To cite just one obvious example, inflation predicts a spatially flat Universe. If we measured the curvature of the Universe and found it to be significantly different from zero, that would be, essentially, a falsification of inflation. As it turns out, inflation passed this test.

I put the word “essentially” in there because what inflation actually predicts is that the probability of getting a curved Universe is extremely low, not that it’s zero. So a measurement of nonzero curvature wouldn’t constitute a mathematical proof that inflation was false. Once again, that’s science. No matter what Popper says, what we get in science is (probabilistic) evidence for or against theories, not black-and-white proof. We use Bayesian reasoning (or as I like to call it, “reasoning”) to draw conclusions from this evidence. A curved Universe would have been extremely strong evidence against inflation.

Part of Steinhardt’s objection to inflation stem from the fact that inflationary models often predict a multiverse. That is, in these theories there are different patches of the Universe with different properties.

Scanning over all possible bubbles in the multi­verse, every­thing that can physically happen does happen an infinite number of times. No experiment can rule out a theory that allows for all possible outcomes. Hence, the paradigm of inflation is unfalsifiable.

This once again ignores the fact that essentially all scientific tests are probabilistic in nature. Because measurements always have some uncertainty, you pretty much never measure anything that allows you to conclude “X is impossible.” Instead, you get measurements that, by means of [Bayesian] reasoning, lead to the conclusion “X is extremely unlikely.” Even if anything can happen in these bubbles, some things happen much more than others, and hence are much more likely. So by observing whether our patch of the Universe fits in with the likely outcomes of inflation or the unlikely ones, we build up evidence for or against the theory. Normal science.

To be fair, I should say that there are technical issues associated with this paradigm. Because inflation often predicts an infinite number of bubbles, there are nontrivial questions about how to calculate probabilities. The buzzword for this is the “measure problem.” To be as charitable as possible to Steinhardt, I suppose I should allow for the possibility that that’s what he’s referring to here, but I don’t think that that’s the most natural reading of the text, and in any case it’s far from clear that the measure problem is as serious as all that.

One final note. As Steinhardt says, future experiments will shed light on the BICEP2 situation, and these experiments will justifiably face heightened scrutiny:

This time, the teams can be assured that the world will be paying close attention. This time, acceptance will require measurements over a range of frequencies to discriminate from foreground effects, as well as tests to rule out other sources of confusion. And this time, the announcements should be made after submission to journals and vetting by expert referees. If there must be a press conference, hopefully the scientific community and the media will demand that it is accompanied by a complete set of documents, including details of the systematic analysis and sufficient data to enable objective verification.

For what it’s worth, I don’t think that people should necessarily wait until results have been refereed before announcing them publicly. In astrophysics, it’s become standard to release preprints publicly before peer review. I know that lots of scientists disagree about this, but on balance I think that that’s a good thing. The doubts that have been raised about BICEP2 could very easily not have been caught by the journal referees. If they’d waited to announce the result publicly until after peer review, we could easily be having the same argument months later, about something that had undergone peer review. Errors are much likely to be caught when the entire community is scrutinizing the results rather than one or two referees.

I should add that Steinhardt is completely right about the “accompanied by a complete set of documents” part.

Update: Peter Coles has a very nice post (as usual) on this. His views and mine are extremely similar.

AAS journals going electronic-only

The American Astronomical Society just announced that they’ll stop producing paper copies of their journals. The Society publishes some of the leading journals in astronomy and astrophysics —  the several different flavors of Astrophysical  Journal (main journal, letters, supplement series) and the Astronomical Journal — so they’re not exactly a bit player.

The days when people actually looked things up in paper copies of journals are long gone, so this change makes a lot of sense to me. One good consequence: if there’s still a stigma associated with online-only journals (i.e., the notion that something online-only can’t be a “real” journal), the conversion of high-profile journals to online-only has to help combat it.

I’ve heard people say that paper copies are the best way to create a permanent archive of the scholarly record — maybe in 100 years, nobody will be able to read all the electronic copies that are out there. Maybe that’s right, but I doubt it. It’s true that old digital information eventually becomes practically unreadable — I threw out a bunch of floppy disks not too long ago, for instance — but the reason I lost that information is  because it’s material that I never tried to preserve in a useful form. Whatever future changes in data storage technology come along, I bet that we can update our electronic scholarly journals accordingly.

The AAS has offered electronic-only subscriptions for a while now, at about 60% the cost of a full (paper+electronic) subscription. The price is not bad compared to other journals, and the profits go to benefit the Society, which I think is a good thing to do. Still, it’s hard for me to see what value the journal is supplying that justifies the costs. The most important thing a journal does is provide peer review, but the actual peer reviewers do it for free.

Replication

I heard (via Sean Carroll) about this piece in Science headlined “Replication Effort Provokes Praise—And ‘Bullying’ Charges.” It’s about efforts to replicate published results in certain areas of psychology.

In general, I think that publication bias and dodgy statistics are real problems in science, so I’d bet that lots of results, particularly those that are called “significant” because they clear the ridiculously weak threshold of 5%, are wrong. Apparently lots of people, particularly in certain parts of psychology, are worried about this. I think it’s great for people to try to replicate past results and find out. (Medical researchers are on the case too, particularly John Ioannidis, who claims that “most published research findings are false.”)

The most striking part of the Science piece is the “bullying” claim. It seems ridiculous on its face for a scientist to complain about other people trying to replicate their results. Isn’t that what science is all about? But I can understand in part what they’re worrying about. You can easily imagine someone trying to replicate your work, doing something wrong (or perhaps just different from what you did), and then publicly shaming you because your results couldn’t be replicated. For instance,

Schnall [the original researcher] contends that Donnellan’s effort [to replicate Schnall’s results] was flawed by a “ceiling effect” that, essentially, discounted subjects’ most severe moral sentiments. “We tried a number of strategies to deal with her ceiling effect concern,” Donnellan counters, “but it did not change the conclusions.” Donnellan and his supporters say that Schnall simply tested too few people to avoid a false positive result. (A colleague of Schnall’s, Oliver Genschow, a psychologist at Ghent University in Belgium, told Science in an e-mail that he has successfully replicated Schnall’s study and plans to publish it.)

The solution, of course, is for Donnellan to describe clearly what he did and how it differs from Schnall’s work. The readers can then decide (using Bayesian reasoning, or as I like to call it, “reasoning”) whether those differences matter and hence how much to discount the original work.

The piece quotes Daniel Kahneman giving an utterly sane point of view:

To reduce professional damage, Kahneman calls for a “replication etiquette,” which he describes in a commentary published with the replications in Social Psychology. For example, he says, “the original authors of papers should be actively involved in replication efforts” and “a demonstrable good-faith effort to achieve the collaboration of the original authors should be a requirement for publishing replications.”

If the two groups work in good faith to do a good replication, it’ll make the final results easier to interpret. If the original group refuses to work with people who are trying to replicate their results, well, everyone is entitled to take that into account when performing (Bayesian) reasoning about whether to believe the original results.

 

Dust or not?

Following the recent rumor, some more useful information has been coming out about questions that some people are raising about whether the BICEP experiment really has seen signs of gravitational waves from inflation in the polarization of the cosmic microwave background radiation. The Washington Post has by far the best news article I’ve seen on the subject: it actually quotes people on the record, rather than repeating vague anonymous speculation.

The original rumor seems to be generally true, in the sense that it accurately described some criticisms that cosmologists were making about the BICEP analysis. The rumor does seem to have exaggerated and/or oversimplified things, and of course whether those criticisms are valid or not remains to be seen.

The best place I know of to get the technical details is this talk by Raphael Flauger. (Unfortunately, the video doesn’t show the slides as he’s talking, so if you want to follow it, download the slides first and try to follow along as he talks.) He argues that the dust models used by the BICEP team are inaccurate for a few reasons, mostly having to do with problems associated with the reason in the original rumor: the BICEP team appears to have used an image in a slide from a talk for part of their model, and they seem (he claims) to have misinterpreted what was in that slide. In addition (he claims), there are other errors associated with digitizing the image rather than using the real data (which BICEP doesn’t have access to). Flauger further claims that when you use a different (better?) dust model, the possible contribution of dust to what BICEP saw gets significantly larger, possibly large enough to explain their signal.

If BICEP has offered a detailed, technical rebuttal to this criticism, I haven’t seen it yet.

My personal assessment, based on obviously incomplete information: Flauger’s arguments seem to me to need serious consideration. BICEP needs to supply a detailed response. As of now, I don’t know whether he’s right or not, but my view has changed somewhat since the original rumor. The available information now does seem to me sufficient to substantially lower my own estimate of the probability that BICEP has seen primordial gravitational waves. I was fairly skeptical all along, but now I’m more skeptical. If you must know, I’d put the probability significantly below 50%.

 

Rumors

The story so far:

  • BICEP2 announces a detection of B modes in the cosmic microwave background (CMB) polarization on large angular scales. If this result is correct, it’s very strong evidence that inflation happened in the very early Universe and is a really big deal. But that “if” part is important: we shouldn’t place too much confidence in this result until it’s independently confirmed.
  • In the blog Résonaances, Adam Falkowski publishes a rumor that an error had been found in the BICEP2 analysis.
  • Various science news outlets pick up the story (particularly this one and this one). They ask the BICEP2 people what they think, and the BICEP2 people vehemently stand by their results.

So what are we supposed to think?

The key claim in the Résonaances post is that the BICEP2 team made an error in modeling Galactic dust. This is potentially important, as an important part of the analysis is testing to make sure that the signal seen in the data is due to the CMB and not to boring, nearby sources such as dust.

Résonaances:

To estimate polarized emission from the galactic dust, BICEP digitized an unpublished 353 GHz map shown by the Planck collaboration at a conference.  However, it seems they misinterpreted the Planck results: that map shows the polarization fraction for all foregrounds, not for the galactic dust only (see the “not CIB subtracted” caveat in the slide). Once you correct for that and rescale the Planck results appropriately, some experts claim that the polarized galactic dust emission can account for most of the BICEP signal.

This looks to me like it might be at least partially true.

There is not a definitive map of polarized Galactic dust emission, so the BICEP team had to cobble together models of dust from different sources. They did so in several different ways: section 9.1 of their paper lists six different dust models. One of these models is based on data from the Planck satellite. It appears that they created the model using a digitized image of a slide from a talk by the Planck people, because the relevant data hadn’t been released in any other form. (Footnote 33 of the paper is the evidence for this last statement, in case you want to check it out.) The evidence does seem to me to support Falkowski’s statement: the image in question explicitly says “not CIB subtracted,” meaning that the data that went into that image includes other stuff besides what the BICEP team wanted. This does seem like a flaw in the construction of this particular model.

But it seems to me that Falkowski greatly overstates the significance of this flaw. For one thing, this is just one of six dust models used in the analysis. It was regarded as in some sense the “best” of them, but the more important point is that the other models yielded similar results. The BICEP team’s claim, as I understand it, is that the entire analysis, taking into account all the models, makes it implausible that dust is the source of the signal. Even if you throw out this model, I don’t think that that claim is significantly weakened.

As I’ve said before, I don’t think that the BICEP team has made a thoroughly convincing case that what they’ve seen can’t be foreground contamination. I think we need more data to answer that question. But even if Falkowski has correctly identified an error in the analysis, I don’t think that it changes the level of doubt all that much.

In the past, I’ve found Résonaances to be a good source of information, but I can’t say I’m thrilled with the way Falkowski handled this.

Animal magnetism

Interesting piece in Nature:

Interference from electronics and AM radio signals can disrupt the internal magnetic compasses of migratory birds, researchers report today in Nature1. The work raises the possibility that cities have significant effects on bird migration patterns.

That’s from a news item. The actual paper is here (possibly paywalled).

There’s strong evidence that some animals (birds, sharks, and bacteria, among others) respond to the Earth’s magnetic field, but the mechanisms by which they sense the field are still quite uncertain in many cases. Physics Today did a nice overview of this about six years ago. I think it’s fascinating that such a simple question remains unsolved.

The new result appears to be that robins do poorly at orienting themselves to Earth’s magnetic field when they’re in an environment with human-generated radio frequency electromagnetic fields, but when you shield them from those fields, they get better. Here’s a figure from the paper:

The dots around the two blue circles show the way the birds oriented themselves when they were inside of a grounded metal shield. The two red circles show what happened when the shield was not grounded. In each case, the arrow at the center is the average of all the directions, and the dashed circle shows the threshold for a significant deviation from random (5% significance, I believe). The graphs below are the field strengths with and without grounding, as functions of frequency.

These results barely exceed the 5% threshold, but the paper gives results of other similar experiments that show the same pattern.

Although the experiment seems to have been well-designed, I have to admit I’m skeptical, for a familiar reason: you should never believe an experiment until it’s been confirmed by a theory. I find it hard to imagine a mechanism for birds to sense magnetic fields that would be disrupted by the weak, low-frequency fields involved here.

The authors acknowledge this:

Any report of an effect of low-frequency electromagnetic fields on a biological system should be subjected to particular scrutiny for at least three reasons. First, such claims in the past have often proved difficult to reproduce. Second, animal studies are commonly used to evaluate human health risks and have contributed to guidelines for human exposures. Third, “seemingly implausible effects require stronger proof”.

Here’s what they say about mechanisms:

The biophysical mechanism that would allow such extraordinarily weak, broadband electromagnetic noise to affect a biological system is far from clear. The energies involved are tiny compared to the thermal energy, kBT, but the effects might be explained if hyperfine interactions in light-induced radical pairs or large clusters of iron-containing particles are involved.

The “tiny compared to the thermal energy” part is the really puzzling thing. If these electromagnetic fields are having an effect inside the system, they must do it by something absorbing photons (since that’s all an electromagnetic field is). But the energy of a photon at these frequencies is tiny in comparison to the thermal energy sloshing around a biological system anyway, so how could there be an effect?

The first of these two mechanisms seems to refer to one of the proposed mechanisms for magnetoreception in birds, which Wikipedia describes as follows:

According to one model, cryptochrome, when exposed to blue light, becomes activated to form a pair of two radicals (molecules with a single unpaired electron) where the spins of the two unpaired electrons are correlated. The surrounding magnetic field affects the kind of this correlation (parallel or anti-parallel), and this in turn affects the length of time cryptochrome stays in its activated state.

I think that this mechanism involves tiny energy differences between quantum states of a system, depending on how the electron spins are oriented. If the energy differences are tiny enough, then I guess low-frequency EM fields could disrupt the effect. But if the energy differences are that small, then I don’t understand why normal thermal fluctuations don’t mess it up all the time. I guess that for this mechanism to work, the electrons have to be shielded from thermal fluctuations, but external EM fields could still sneak in and mess them up. I guess that might be possible, but I’d want to see the details.

I completely don’t get what the authors are talking about when they refer to “large clusters of iron-containing particles”. I can’t see any conceivable way such particles could be affected by weak oscillating fields of the sort described here.

I have no idea whether you should believe this result or not. I hope that others will attempt to replicate it. If it’s real, it’s got to be a big clue about the interesting puzzle of how birds feel magnetic fields.

What are you going to do with that?

Check out the new blog by the noted cosmologist Lloyd Knox. Once per month, he’ll interview  someone with a degree in physics about their career choices and experiences. The first interview, with a medical device physicist, is up.

I think this is a great idea. People don’t have a clear sense of the variety of things that you can do with a physics education. I frequently have parents of prospective students ask me, in a worried tone, whether their children will be able to get jobs after majoring in physics. The answer, of course, is yes: people with physics degrees do very well in the job market. The American Institute of Physics has lots of statistics on this, such as these:


 

(both from this document). But the statistics don’t give a good sense of the variety of things people do with their physics degrees, so Lloyd’s plan to let people tell their stories sounds great.

 

Important If True

Some 19th-century skeptic is supposed to have said that all churches should be required to bear the inscription “Important If True” above their doors. (Google seems to credit Alexander William Kinglake, whoever he was.) That’s pretty much what I think about the big announcement yesterday of the measurements of  cosmic microwave background polarization by BICEP2.

This result has gotten a lot of news coverage, which is fair enough: if it holds up, it’s a very big deal. But personally, I’m still laying quite a bit of stress on the “if true” part of “important if true.” I don’t mean this as any sort of criticism of the people behind the experiment: they’ve accomplished an amazing feat. But this is an incredibly difficult observation, and at this point I can’t manage to regard the results as more than an extremely exciting suggestion of something that might turn out to be true.

Incidentally, I have to point out the most striking quotation I saw in any of the news reports. My old friend Max Tegmark is quoted in the New York Times as saying

I think that if this stays true, it will go down as one of the greatest discoveries in the history of science.

A big thumbs-up to Max for the first clause: lots of people who should know better are leaving that out (unless nefarious editors are to blame). But the main clause of the sentence is frankly ludicrous. It’s natural (and even endearing) that Max is excited about this result, but this isn’t natural selection, or quantum mechanics, or conservation of energy, or the existence of atoms, to name just a few of the “greatest discoveries in the history of science.”

I’ll say a bit about why this result is important, then a bit about why I’m still skeptical. Finally, since the only way to think coherently about any of this stuff is with Bayesian reasoning, I’ll say something about that.

Important

I’m not going to try to explain the science in detail right now. (Other people have.)  But briefly, it goes like this. For about 30 years now, cosmologists have suspected that the Universe went through a brief period known as “inflation” at very early times, perhaps as early as 10-35 seconds after the Big Bang.. During inflation, the Universe expanded extremely — almost inconceivably — rapidly. According to the theory, many of the most important properties of the Universe as it exists today originate during inflation.

Quite a bit of indirect evidence supporting the idea of inflation has accumulated over the years. It’s the best theory anyone has come up with for the early Universe. But we’re still far from certain that inflation actually happened.

For quite a while now, people have known about a potentially testable prediction of inflation. During the inflationary period, there should have been gravitational waves (ripples in spacetime) flying around. Those gravitational waves should leave an imprint that can still be seen much later, specifically in observations of  the cosmic microwave background radiation (the oldest light in the Universe). To be specific, the polarization of this radiation (i.e., the orientation of the electromagnetic waves we observe) should vary across the sky in a way that has a particular sort of geometric pattern. In the jargon of the field, we should expect to see B-mode microwave background polarization on large angular scales.

That’s what BICEP2 appears to have observed.

If this is correct, it’s a much more direct confirmation of inflation than anything we’ve seen before. It’s very hard to think of any alternative scenario that would produce the same pattern as inflation, so if this pattern is really seen, then it’s very strong evidence in favor of inflation. (The standard metaphor here is “smoking gun.”)

If True

(Let me repeat that I don’t mean the following as any sort of criticism of the BICEP2 team. I don’t think they’ve done anything wrong; I just think that these experiments are hard! It’s pretty much inevitable that the first detection of something like this would leave room for doubt. It’s very possible that these doubts will turn out to be unfounded.)

One big worry in this field is foreground contamination. We look at the microwave background through a haze of nearby stuff, mostly stuff in our own Galaxy. An essential part of this business is to distinguish the primordial radiation from these local contaminants. One of the best ways to do this is to observe the radiation at multiple frequencies. The microwave background signal has a known spectrum — that is, the relative amplitudes at different frequencies are fixed — which is different from the spectra of various contaminants.

The (main) data set used to derive the new results was taken at one frequency, which doesn’t allow for this sort of spectral discrimination. The authors of the paper do use additional data at other frequencies, but I’ll be much happier once those data get stronger.

I should say that the authors do give several lines of argument suggesting that foregrounds aren’t the main source of the signal they see, and at least some other people I respect don’t seem as worried about foregrounds as I am, so maybe I’m wrong to be worried about this. We will get more foreground information soon, e.g., from the Planck satellite, so time will tell.

There are other hints of odd things in the data, which may not mean anything. Matt Strassler lays out a couple. One more thing someone pointed out (can’t immediately track down who): the E-type polarization significantly exceeds predictions in precisely the region (l=50 or so) where the B signal is most significant. The E signal is larger / easier to measure than the  B signal. Is this a hint of something wrong?

I’m actually more worried about the problem of “unknown unknowns.” The team has done an excellent job of testing for a wide variety of systematic errors and biases, but I worry that there’s something they (and we) haven’t thought of yet. That seems unfair: how can I ding them for something that nobody’s even thought of? But nonetheless I worry about it.

The solution to that last problem is for another experiment to confirm the results using different equipment and analysis techniques. That’ll happen eventually, so once again, time will tell.

(Digression: I always thought it odd that people mocked Donald Rumsfeld for talking about “unknown unknowns.” I think it was the smartest thing he ever said.)

 What Bayes has to say

This section is probably mostly for Allen Downey, but if you’re not Allen, you’re welcome to read it anyway.

My campaign to rename “Bayesian reasoning” with the more accurate label “correct reasoning” hasn’t gotten off the ground for some reason, but the fact remains that Bayesian probabilities are the only coherent way to think about situations like this (and practically everything else!) where we don’t have enough information to be 100% sure.

This paper is definitely evidence in favor of inflation.

P1 = P(BICEP2 observes what it did | inflation happened)

is significantly greater than

P2 = P(BICEP2 observes what it did | inflation didn’t happen)

so your estimate of the probability that inflation happened should go up based on this new information.

The question is how much it should go up. I’m not going to try to be quantitative here, but I do think there are a couple of observations worth making.

First, all the stuff in the previous section goes into one’s assessment of P2. Without the possibility of foregrounds or undiagnosed systematic errors messing things up, the P2 would be extremely tiny. Your assessment of how likely those problems are is what determines your value of P2 and hence the strength of the evidence.

But there’s more to it than just that. “Inflation” is not just a theory; it’s a family of theories. In particular, it’s possible for inflation to have happened at different energy scales (essentially, different times after the Big Bang), which leads to different predictions for the B-mode amplitude. The amplitude BICEP2 detected is very close to the upper limit on what would have been possible, based on previous information. In fact, in the simplest models, the amplitude BICEP2 sees is inconsistent with previous data; to make everything fit, you have to go to slightly more complicated models. (For the cognoscenti, I’m saying that you seem to need some running of the spectral index to make BICEP2’s amplitude consistent with TT observations.) That makes P1 effectively smaller, reducing the strength of the evidence for inflation.

What I’m saying here is that the tension between BICEP2 and other sources of information makes it more likely that there’s something wrong.

Formally, instead of talking about a single number P1, you should talk about

P1(r,…) = P(BICEP2 observes what it did | r, …).

Here r is the amplitude of the signal produced in inflation and … refer to the additional parameters introduced by the fact that you have to make the model more complicated.

Then the probability that shows up in a Bayesian evidence calculation is the integral of P1(r,…) times a prior probability on the parameters. The thing is that the values of r where P1(r,…) is large are precisely those that have low prior probability (because they’re disfavored by previous data). Also, the more complicated models (with those extra “…” parameters) are in my opinion less likely a priori than simple models of inflation.

So I claim, when properly integrated over the priors, P1 isn’t as large as you might have thought, and so the evidence for inflation isn’t as high as it might seem.

Of course, it’s hard to be quantitative about this. I could make up some numbers, but they’d just be illustrative, so I don’t think they’d add much to the argument.