538 on p-hacking

August 24th, 2015

Christie Aschwanden has a piece on fivethirtyeight.com about the supposed “crisis” in science. She writes about recent high-profile results that turned out to be wrong, flaws in peer review, and bad statistics in published papers.

By far the best part of the article is the applet that lets you engage in your own p-hacking. It’s a great way to illustrate what p-hacking is and why it’s a problem. The idea is to take a bunch of data on performance of the US economy over time, and examine whether it has done better under Democrats or Republicans. There are multiple different measures of the economy one might choose to focus on, and multiple ways one might quantify levels of Democratic or Republican power. The applet lets you make different choices and determines whether there’s a statistically significant effect. By fiddling around for a few minutes, you can easily get a “significant” result in either direction.

Go and play around with it for a few minutes.

The rest of the article has some valuable observations, but it’s a bit of a hodgepodge. Curmudgeon that I am, I have to complain about a couple of things.

Here’s a longish quote from the article:

P-hacking is generally thought of as cheating, but what if we made it compulsory instead? If the purpose of studies is to push the frontiers of knowledge, then perhaps playing around with different methods shouldn’t be thought of as a dirty trick, but encouraged as a way of exploring boundaries. A recent project spearheaded by Brian Nosek, a founder of the nonprofit Center for Open Science, offered a clever way to do this.

Nosek’s team invited researchers to take part in a crowdsourcing data analysis project. The setup was simple. Participants were all given the same data set and prompt: Do soccer referees give more red cards to dark-skinned players than light-skinned ones? They were then asked to submit their analytical approach for feedback from other teams before diving into the analysis.

Twenty-nine teams with a total of 61 analysts took part. The researchers used a wide variety of methods, ranging — for those of you interested in the methodological gore — from simple linear regression techniques to complex multilevel regressions and Bayesian approaches. They also made different decisions about which secondary variables to use in their analyses.

Despite analyzing the same data, the researchers got a variety of results. Twenty teams concluded that soccer referees gave more red cards to dark-skinned players, and nine teams found no significant relationship between skin color and red cards.

 

truth-vigilantes-soccer-calls2

The variability in results wasn’t due to fraud or sloppy work. These were highly competent analysts who were motivated to find the truth, said Eric Luis Uhlmann, a psychologist at the Insead business school in Singapore and one of the project leaders. Even the most skilled researchers must make subjective choices that have a huge impact on the result they find.

But these disparate results don’t mean that studies can’t inch us toward truth. “On the one hand, our study shows that results are heavily reliant on analytic choices,” Uhlmann told me. “On the other hand, it also suggests there’s a there there. It’s hard to look at that data and say there’s no bias against dark-skinned players.” Similarly, most of the permutations you could test in the study of politics and the economy produced, at best, only weak effects, which suggests that if there’s a relationship between the number of Democrats or Republicans in office and the economy, it’s not a strong one.

The last paragraph is simply appalling. This is precisely the sort of conclusion you can’t draw. Some methods got marginally “significant” results — if you define “significance” by the ridiculously weak 5% threshold — and others didn’t. The reason p-hacking is a problem is that people may be choosing their methods (either consciously or otherwise) to lead to their preferred conclusion. If that’s really a problem, then you can’t draw any valid conclusion from the fact that these analyses tended to go one way.

As long as I’m whining, there’s this:

Take, for instance, naive realism — the idea that whatever belief you hold, you believe it because it’s true.

Naive realism means different things to psychologists and philosophers, but this isn’t either of them.

Anyway, despite my complaining, there’s some good stuff in here.

How strong is the scientific consensus on climate change?

August 4th, 2015

There is overwhelming consensus among scientists that climate change is real and is caused largely by human activity. When people want to emphasize this fact, they often cite the finding that 97% of climate scientists agree with the consensus view.

I’ve always been dubious about that claim, but in the opposite way from the climate-change skeptics: I find it hard to believe that the figure can be as low as 97%. I’m not a climate scientist myself, but whenever I talk to one, I get the impression that the consensus is much stronger than that.

97% may sound like a large number, but a scientific paradigm strongly supported by a wide variety of forms of evidence will typically garner essentially 100% consensus among experts. That’s one of the great things about science: you can gather evidence that, for all practical purposes, completely settles a question. My impression was that human-caused climate change was pretty much there.

So I was interested to see this article by James Powell, which claims that the 97% figure is a gross understatement. The author estimates the actual figure to be well over 99.9%. A few caveats: I haven’t gone over the paper in great detail, this is not my area of expertise, and the paper is still under peer review. But I find this result quite easy to believe, both because the final number sounds plausible to me and because the main methodological point in the paper strikes me as unquestionably correct.

Powell points out that the study that led to the 97% figure was derived in a way that excluded a large number of articles from consideration:

 

Cook et al. (2013) used the Web of Science to review the titles and abstracts of peer-reviewed articles from 1991-2011 with the keywords “global climate change” and “global warming.” With no reason to suppose otherwise, the reader assumes from the title of their article, “Quantifying the consensus on anthropogenic global warming [AGW] in the scientific literature,” that they had measured the scientific consensus on AGW using consensus in its dictionary definition and common understanding: the extent to which scientists agree with or accept the theory. But that is not how CEA used consensus.

Instead, CEA defined consensus to include only abstracts in which the authors “express[ed] an opinion on AGW (p. 1).” No matter how clearly an abstract revealed that its author accepts AGW, if it did “not address or mention the cause of global warming (p. 3),” CEA classified the abstract as having “no position” and omitted it from their calculation of the consensus. Of the 11944 articles in their database, 7930 (66.4%) were labeled as taking no position. If AGW is the ruling paradigm of climate science, as CEA set out to show, then rather than having no position, the vast majority of authors in that category must accept the theory. I return to this point below.
CEA went on to report that “Among abstracts expressing a position on AGW, 97.1% endorsed the consensus position that humans are causing global warming (p. 6, italics added).” Thus, the now widely adopted “97% consensus” refers not to what scientists accept, the conventional meaning, but to whether they used language that met the more restrictive CEA definition.

This strikes me as a completely valid critique. When there is a consensus on something, people tend not to mention it at all. Articles in physics journals pretty much never state explicitly in the abstract that, say, quantum mechanics is a good theory, or that gravity is what makes things fall.

I particularly recommend the paper’s section on plate tectonics as an explicit example of how the above methodology would lead to a false conclusion.

Let me be clear: I haven’t studied this paper enough to vouch for the complete correctness the methodology contained in it. But it looks to me like the author has made a convincing case that the 97% figure is a great understatement.

 

Prescription: retire the words “prescriptivist” and “descriptivist”

July 29th, 2015

If, like me, you like arguments about English grammar and usage, you frequently come across the distinction between “prescriptivists” and “descriptivists.” Supposedly, prescriptivists are people who think that grammatical rules are absolute and unchanging. Descriptivists, on the other hand, supposedly think that there’s no use in rules that tell people how they should speak and write, and that all that matters is the way people actually do speak and write. As the story goes, descriptivists regard prescriptivists as persnickety schoolmarms, while prescriptivists regard descriptivists as people with no standards who are causing the language to go to hell in a handbasket.

In fact, of course, these terms are mostly mere invective. If you call someone either of these names, you’re almost always trying to tar that person with an absurd set of beliefs that nobody actually holds.

A prescriptivist, in the usual telling, is someone who thinks that all grammatical rules are set in stone forever. (Anyone who actually believed this would presumably have to speak only in Old English, or better yet, proto-Indo-European.) A descriptivist, on the other hand, is supposedly someone who thinks that all ways of writing and speaking are equally valid, and that we therefore shouldn’t teach students to write standard English grammar.

I can’t claim that there is nobody who believes either of these caricatures — I’m old enough to have learned that there’s someone out there who believes pretty much any foolish thing you can think of — but I do claim that practically nobody you’re likely to encounter when wading into a usage debate fits them.

I was reminded of all this when listening to an interview with Bryan Garner, author of Modern American Usage, which is by far the best usage guide I know of. Incidentally, I’m not the only one in my household who likes Garner’s book: my dog Nora also devoured it.

nora-garner

Like many people, I first learned about Garner in David Foster Wallace’s essay “Authority and American Usage,” which originally appeared in Harper’s and was reprinted in his book Consider the Lobster. Wallace’s essay is very long and shaggy, but it’s quite entertaining (if you like that sort of thing) and has some insightful observations. The essay is nominally a review of Garner’s usage guide, but it turns into a meditation on the nature of grammatical rules and their role in society and education.

From his book, Garner strikes me as a clear thinker about lexicographic matters, so I was disappointed to hear him go in for the most simpleminded straw-man caricatures of the hated descriptivists in that interview.

Garner’s scorn is mostly reserved for Steven Pinker. Pinker is pretty much the arch-descriptivist, in the minds of those people for whom that is a term of invective. But that hasn’t stopped him from writing a usage guide, in which he espouses some (but not all) of the standard prescriptivist rules. Pinker’s and Garner’s approaches actually have something in common: both try to give reasons for the various rules they advocate, rather than simply issuing fiats. But because Garner thinks of Pinker as a loosy-goosy descriptivist, he can’t bring himself to engage with Pinker’s actual arguments.

Garner says that Pinker has “flip-flopped,” and that his new book is  “a confused book, because he’s trying to be prescriptivist while at the same time being descriptivist.” As it turns out, what he means by this is that Pinker has declined to nestle himself into the pigeonhole that Garner has designated for him. I’ve read all of Pinker’s general-audience books on language — his first such book, The Language Instinct, may be the best pop-science book I’ve ever read — and I don’t see the new one as contradicting the previous ones. Pinker has never espoused the straw-man position that all ways of writing are equally good, or that there’s no point in trying to teach people write more effectively. Garner thinks that that’s what a descriptivist believes, and so he can’t be bothered to check.

The Language Instinct has a chapter on “language mavens,” which is the place to go for Pinker’s views on prescriptive grammatical rules. (That chapter is essentially reproduced as an essay published in the New Republic.) Garner has evidently read this chapter, as he mockingly summarizes Pinker’s view as “You shouldn’t criticize the way people use language any more than you should criticize how whales emit their moans,” which is a direct reference to an analogy found in this chapter. But he either deliberately or carelessly misleads the reader about its meaning.

Pinker is not saying that there are no rules that can help people improve their writing. Rather, he’s making the simple but important point that scientists are more interested in studying language as a natural system (how do people talk) than in how they should talk.

So there is no contradiction, after all, in saying that every normal person can speak grammatically (in the sense of systematically) and ungrammatically (in the sense of nonprescriptively), just as there is no contradiction in saying that a taxi obeys the laws of  physics but breaks the laws of Massachusetts.

Pinker is a descriptivist because, as a scientist, he’s more interested in the first kind of rules than the second kind. It doesn’t follow from this that he thinks the second kind don’t or shouldn’t exist. A physicist is more interested in studying the laws of physics than the laws of Massachusetts, but you can’t conclude from this that he’s an anarchist.

(Pinker does unleash a great deal of scorn on the language mavens, not for saying that there are guidelines that can help you improve your writing, but for saying specific stupid things, which he documents thoroughly and, in my opinion, convincingly.)

Although Pinker is the only so-called descriptivist Garner mentions by name, he does tar other people by saying, “There was this view, in the mid-20th century, that we should not try to change the dialect into which somebody was born.” He doesn’t indicate who those people were, but my best guess is that this is a reference to the controversy that arose over Webster’s Third in the 1950s. If so, it sounds as if Garner (like Wallace) is buying into a mythologized version of that controversy.

It seems to me that the habit of lumping people into the “prescriptive” and “descriptive” categories is responsible for Garner’s inability to pay attention to what Pinker et al. are actually saying (and for various other silly things he says in this interview). All sane people agree with the prescriptivists that some ways of writing are more effective than others and that it’s worthwhile to try to teach people what those ways are. All sane people agree with the descriptivists that some specific things written by some language “experts” are stupid, and that at least some prescriptive rules are mere shibboleths, signaling membership in an elite group rather than enhancing clarity. All of the interesting questions arise after you acknowledge that common ground, but if you start by dividing people up according to a false dichotomy, you never get to them.

Hence my prescription.

 

It’s still not rocket science

July 28th, 2015

In the last couple of days, I’ve seen a little flareup of interest on social media in the “reactionless drive” that supposedly generates thrust without expelling any sort of propellant. This was impossible a year ago, and it’s still impossible.

OK, it’s not literally impossible in the mathematical sense, but it’s close enough. Such a device would violate the law of conservation of momentum, which is an incredibly well-tested part of physics. Any reasonable application of reasoning (or as some people insist on calling it, Bayesian reasoning) says, with overwhelmingly high probability, that conservation of momentum is right and this result is wrong.

Extraordinary claims require extraordinary evidence, never believe an experiment until it’s been confirmed by a theory, etc.

The reason for the recent flareup seems to be that another group has replicated the original group’s results. They actually do seem to have done a better job. In particular, they did the experiment in a vacuum. Bizarrely, the original experimenters went to great lengths to describe the vacuum chamber in which they did their experiment, and then noted, in a way that was easy for a reader to miss, that the experiments were done “at ambient pressure.” That’s important, because stray air currents were a plausible source of error that could have explained the tiny thrust they found.

The main thing to note about the new experiment is that they are appropriately circumspect in describing their results. In particular, they make clear that what they’re seeing is almost certainly some sort of undiagnosed effect of ordinary (momentum-conserving) physics, not a revolutionary reactionless drive.

We identified the magnetic interaction of the power feeding lines going to and from the liquid metal contacts as the most important possible side-effect that is not fully characterized yet. Our test campaign can not confirm or refute the claims of the EMDrive …

Just because I like it, let me repeat what my old friend John Baez said about the original claim a year ago. The original researchers speculated that they were seeing some sort of effect due to interactions with the “quantum vacuum virtual plasma.” As John put it,

 “Quantum vacuum virtual plasma” is something you’d say if you failed a course in quantum field theory and then smoked too much weed.

I’ll take Bayes over Popper any day

July 26th, 2015

A provocative article appeared on the arxiv last month:

 

Inflation, evidence and falsifiability

Giulia Gubitosi, Macarena Lagos, Joao Magueijo, Rupert Allison

(Submitted on 30 Jun 2015)
In this paper we consider the issue of paradigm evaluation by applying Bayes’ theorem along the following nested chain of progressively more complex structures: i) parameter estimation (within a model), ii) model selection and comparison (within a paradigm), iii) paradigm evaluation … Whilst raising no objections to the standard application of the procedure at the two lowest levels, we argue that it should receive an essential modification when evaluating paradigms, in view of the issue of falsifiability. By considering toy models we illustrate how unfalsifiable models and paradigms are always favoured by the Bayes factor … We propose a measure of falsifiability (which we term predictivity), and a prior to be incorporated into the Bayesian framework, suitably penalising unfalsifiability …

(I’ve abbreviated the abstract.)

Ewan Cameron and Peter Coles have good critiques of the article. Cameron notes specific problems with the details, while Coles takes a broader view. Personally, I’m more interested in the sort of issues that Coles raises, although I recommend reading both.

The nub of the paper’s argument is that the method of Bayesian inference does not “suitably penalise” theories that are unfalsifiable. My first reaction, like Coles’s, is not to care much, because the idea that falsifiability is essential to science is largely a fairy tale. As Coles puts it,

In fact, evidence neither confirms nor discounts a theory; it either makes the theory more probable (supports it) or makes it less probable (undermines it). For a theory to be scientific it must be capable having its probability influenced in this way, i.e. amenable to being altered by incoming data “i.e. evidence”. The right criterion for a scientific theory is therefore not falsifiability but testability.

Here’s pretty much the same thing, in my words:

For rhetorical purposes if nothing else, it’s nice to have a clean way of describing what makes a hypothesis scientific, so that we can state succinctly why, say, astrology doesn’t count.  Popperian falsifiability nicely meets that need, which is probably part of the reason scientists like it.  Since I’m asking you to reject it, I should offer up a replacement.  The Bayesian way of looking at things does supply a natural replacement for falsifiability, although I don’t know of a catchy one-word name for it.  To me, what makes a hypothesis scientific is that it is amenable to evidence.  That just means that we can imagine experiments whose results would drive the probability of the hypothesis arbitrarily close to one, and (possibly different) experiments that would drive the probability arbitrarily close to zero.

Sean Carroll is also worth reading on this point.

The problem with the Gubitosi et al. article is not merely that the emphasis on falsifiability is misplaced, but that the authors reason backwards from the conclusion they want to reach, rather than letting logic guide them to a conclusion. Because Bayesian inference doesn’t “suitably” penalize the theories they want to penalize, it “should” be replaced by something that does.

Bayes’s theorem is undisputedly true (that’s what the word “theorem” means), and conclusions derived from it are therefore also true. (That’s what I mean when use the phrase “Bayesian reasoning, or as I like to call it, ‘reasoning’.) To be precise, Bayesian inference is the provably correct way to draw probabilistic conclusions in cases where your data do not provide a conclusion with 100% logical certainty (i.e., pretty much all cases outside of pure mathematics and logic).

When reading this paper, it’s worthwhile keeping track of all of the places where words like “should” appear, and asking yourself what is meant by those statements. Are they moral statements? Aesthetic ones? And in any case, recall Hume’s famous dictum that you can’t reason from “is” to “ought”: those “should” statements are not, and by their nature cannot be, supported by the reasoning that leads up to them.

In particular, Gubitosi et al. are sad that the data don’t sufficiently disfavor the inflationary paradigm, which they regard as unfalsifiable. But their sadness is irrelevant. The Universe may have been born in an inflationary epoch, even if the inflation paradigm does not meet their desired falsifiability criterion. The way you should decide how likely that is is Bayesian inference.

How did I not know about this?

June 26th, 2015

I’ve been programming in IDL for a couple of decades. How did I not know about this bizarre behavior of its random number generator?

 

Screenshot 2015-06-25 10.57.07

Apparently, this is something people know about, but somehow I’d missed it for all this time.

That fake chocolate weight-loss study

June 17th, 2015

I’m a little late getting to this, but in case you haven’t heard about it, here it is.

A journalist named John Bohannon did a stunt recently in which he and some coauthors “published” a “study” that “showed” that chocolate caused weight loss. (The reasons for the scare quotes will become apparent.) The work was picked up by a bunch of news outlets. Bohannon wrote about the whole thing on io9. It’s also been picked up in a bunch of other places, including the BBC radio program More or Less (which I’ve mentioned before a few times).

The idea was to do a study that was shoddy in precisely the ways that many “real” studies are, get it published in a low-quality journal, and see if they could get it picked up by credulous journalists.

My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

There is an interesting question of journalistic ethics here. Bohannon calls himself a journalist, but he deliberately introduced bad science into the mediasphere with the specific intent of deception. Is it OK for a journalist to do that, if his motives are pure? I don’t know.

I don’t want to focus on that sort of issue, because I don’t have anything non-obvious to say. Instead, I want to dig a bit into the details of what Bohannon et al. did. Although Bohannon’s io9 post is well worth reading and gets the big picture largerly right, it’s wrong or misleading in a few ways, which happen to be the sort of thing I care about.

Bohannon et al.  recruited a group of subjects and divided them into three groups: a control group, a group that was put on a low-carb diet, and a group that was put on a low-carb diet but also told to eat a certain amount of chocolate each day. The chocolate group lost weight faster than the other groups. The result was “statistically significant,” in the usual meaning of that term — the p-value was below 0.05.

So what was wrong with this study? As Bohannon explains it in his io9 post,

Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.

The conventional cutoff for being “significant” is 0.05, which means that there is just a 5 percent chance that your result is a random fluctuation. The more lottery tickets, the better your chances of getting a false positive. So how many tickets do you need to buy?

P(winning) = 1 – (1 – p)n

With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works.” Or they drop “outlier” data points.

Sadly, even in this piece, whose purpose is to debunk bad statistics, Bohannon repeats the usual incredibly common error. A p-value of 0.05 does not mean that “there is just a 5 percent chance that your result is a random fluctuation.” It means that, if you assume that nothing but random fluctuations are at work, there’s a 5% chance of getting results as extreme as you did. A (frequentist) p-value is incapable of telling you anything about the probability of any given hypothesis (such as “your result is a random fluctuation”).

One other quibble: the parenthetical remark about the measurements not being independent is literally true but misleading. The fact that the measurements aren’t independent means that the probability of a false positive “could be even higher”, but it could also be lower. In fact, the latter seems more likely to me. (The probability goes down if the measurements are positively correlated with each other, and up if they’re anticorrelated.)

The other thing that’s worth focusing on is the number of subjects in the study, which was incredibly small (15 across all three groups). Bohannon suggests (in the first sentence quoted above) that this is part of the reason they got a false positive, and other pieces I’ve read on this say the same thing. But it’s not true. The reason they got a false positive was p-hacking (buying many lottery tickets), which would have worked just as well with a larger number of subjects. If you had more subjects, the random fluctuations would have gotten smaller, but the level of fluctuation required for statistical “significance” would have gone down as well. By definition, the odds of any one “lottery ticket” winning is 5%, whether you have a lot of subjects or a few.

It’s true that with fewer subjects the effect size (i.e., the number of extra pounds lost, on average) is likely to be larger, but the published article went to great lengths to downplay the effect size (e.g., not mentioning it in the abstract, which is often all anyone reads).

Let me repeat that I think that Bohannon’s description of what he did is well worth reading and has a lot that’s right at the macro-scale, even though I wish that he’d gotten the above details right.

Does the inside of a brick exist?

June 6th, 2015

In Surely You’re Joking, Mr. Feynman, Richard Feynman tells a story of sitting in on a philosophy seminar and being asked by the instructor whether he thought that an electron was an “essential object.”

Well, now I was in trouble. I admitted that I hadn’t read the book, so I had no idea of what Whitehead meant by the phrase; I had only come to watch. “But,” I said, “I’ll try to answer the professor’s question if you will first answer a question from me, so I can have a better idea of what ‘essential object’ means. Is a brick an essential object?”

What I had intended to do was to find out whether they thought theoretical constructs were essential objects. The electron is a theory that we use; it is so useful in understanding the way nature works that we can almost call it real. I wanted to make the idea of a theory clear by analogy. In the case of the brick, my next question was going to be, “What about the inside of the brick?”–and I would then point out that no one has ever seen the inside of a brick. Every time you break the brick, you only see the surface. That the brick has an inside is a simple theory which helps us understand things better. The theory of electrons is analogous. So I began by asking, “Is a brick an essential object?”

The way he tells the story (which, of course, need not be presumed to be 100% accurate), he never got to the followup question, because the philosophers got bogged down in an argument over the first question.

I was reminded of this when I read A Crisis at the Edge of Physics , by Adam Frank and Marcelo Gleiser, in tomorrow’s New York Times. The article is a pretty good overview of some of the recent hand-wringing over certain areas of theoretical physics that seem, to some people, to be straying too far from experimental testability. (Frank and Gleiser mention a silly article by my old Ph.D. adviser that waxes particularly melodramatic on this subject.)

From the Times piece:

If a theory successfully explains what we can detect but does so by positing entities that we can’t detect (like other universes or the hyperdimensional superstrings of string theory) then what is the status of these posited entities? Should we consider them as real as the verified particles of the standard model? How are scientific claims about them any different from any other untestable — but useful — explanations of reality?

These entities are, it seems to me, not fundamentally different from the inside of Feynman’s brick, or from an electron for that matter. No one has ever seen an electron, or the inside of a brick, or the core of the Earth, for that matter. We believe that those things are real, because they’re essential parts of a theory that we believe in. We believe in that theory because it makes a lot of successful predictions. If string theory or theories that predict a multiverse someday produce a rich set of confirmed predictions, then the entities contained on those theories will have as much claim to reality as electrons do.

Just to be clear, that hasn’t happened yet, and it may never happen. But it’s just wrong to say that these theories represent a fundamental retreat from the scientific method, just because they contain unobservable entities. (To be fair, Frank and Gleiser don’t say this, but many other people do.) Most interesting theories contain unobservable entities!

Are political scientists really this delicate?

May 29th, 2015

I’ve been reading some news coverage about the now-retracted paper published in Science, which purported to show that voters’ opinions on same-sex marriage could be altered by conversations with gay canvassers. Some of the things the senior author, Donald Green, said in one article struck me as very odd, from my perspective in the natural sciences. I wonder if the culture in political science is really that different?

Here’s the first quote:

“It’s a very delicate situation when a senior scholar makes a move to look at a junior scholar’s data set,” Dr. Green said. “This is his career, and if I reach in and grab it, it may seem like I’m boxing him out.”

In case you don’t have your scorecard handy, Dr. Green is the senior author of the paper. There’s only one other author, a graduate student named Michael LaCour. LaCour did all of the actual work on the study (or at least he said he did — that’s the point of the retraction).

In physics, it’s bizarre to imagine that one of the two authors of a paper would feel any delicacy about asking to see the data the paper is based on. If a paper has many authors, then of course not every author will actually look at the data, but with only two authors, it would be extremely strange for one of them not to look. Is it really that different in political science?

Later in the same article, we find this:

Money seemed ample for the undertaking — and Dr. Green did not ask where exactly it was coming from.

“Michael said he had hundreds of thousands in grant money, and, yes, in retrospect, I could have asked about that,” Dr. Green said. “But it’s a delicate matter to ask another scholar the exact method through which they’re paying for their work.”

Delicacy again! This one is, if anything, even more incomprehensible to me. I can’t imagine having my name on a paper presenting the results of research without knowing where the funding came from. For one thing, in my field the funding source is always acknowledged in the paper.

In both cases, Green is treating this as someone else’s work that he has nothing do do with. If that were true, then asking to see the raw data would be presumptuous (although in my world asking about the funding source would not). But he’s one of only two authors on the paper — it’s (supposedly) his work too.

It seems to me that there are two possibilities:

  1. The folkways of political scientists are even more different from those of natural scientists than I had realized.
  2. Green is saying ridiculous things to pretend that he wasn’t grossly negligent.

I don’t know which one is right.

 

A significant level of snark

May 27th, 2015

I learned via Peter Coles of this list of ways that scientists try to spin results that don’t reach the standard-but-arbitrary threshold of statistical significance. The compiler, Matthew Hankins, says

You don’t need to play the significance testing game – there are better methods, like quoting the effect size with a confidence interval – but if you do, the rules are simple: the result is either significant or it isn’t.

The following list is culled from peer-reviewed journal articles in which (a) the authors set themselves the threshold of 0.05 for significance, (b) failed to achieve that threshold value for p and (c) described it in such a way as to make it seem more interesting.

The list begins like this:

(barely) not statistically significant (p=0.052)
a barely detectable statistically significant difference (p=0.073)
a borderline significant trend (p=0.09)
a certain trend toward significance (p=0.08)
a clear tendency to significance (p=0.052)
a clear trend (p<0.09)
a clear, strong trend (p=0.09)
a considerable trend toward significance (p=0.069)
a decreasing trend (p=0.09)
a definite trend (p=0.08)
a distinct trend toward significance (p=0.07)

And goes on at considerable length.

Hankins doesn’t provide sources for these, so I can’t rule out the possibility that some are quoted out of context in a way that makes them sound worse than they are. Still, if you like snickering at statistical solecisms, snicker away.

I would like to note one quasi-serious point. The ones that talk about a “trend,” and especially “a trend toward significance,” are much worse than the ones that merely use language such as “marginally significant.” In the latter case, the authors are merely acknowledging that the usual threshold for “significance” (p=0.05) is arbitrary. Hankins says that, having agreed to play the significance game, you have to follow its rules, but that seems like excessive pedantry to me. The “trend” language, on the other hand, suggests either a deep misunderstanding of how statistics work or an active attempt to mislead.

Hankins:

For example, “a trend towards significance” expresses non-significance as some sort of motion towards significance, which it isn’t: there is no ‘trend’, in any direction, and nowhere for the trend to be ‘towards’.

This is exactly right. The only thing a p-value does is tell you about the probability that results like the ones you saw could have occurred by chance. Under that hypothesis, a low p-value occurred due to a chance fluctuation and will (with high probability) revert to higher values if you gather more data.

The “trend” language suggests, either deliberately or accidentally, that the results are marching toward significance and will get there if only we can gather more data. But that’s only true if the effect you’re looking for is really there, which is precisely what we don’t know yet. (If we knew that, we wouldn’t need the data.) If it’s not there, then there will be no trend; rather, you’ll get regression to more typical (higher / less “significant”) p-values.