I should point out that the published paper discusses two different sets of results: an analysis of the death toll from past hurricanes, and a set of surveys of people’s perceptions of hurricanes based only on their names. The latter study shows that people do perceive hurricanes as milder when they have feminine-sounding names (in the absence of other information). I’m quite prepared to believe that one. The first finding, about actual people actually dying, is the one I don’t believe.
First, this is precisely the sort of result that’s most susceptible to publication bias. (If you checked for this and found nothing, you wouldn’t publish, but if you found something, you would.) This is the main reason that some people claim that “most published research findings are false.”
Add to that the closely related problem that’s sometimes known as p-hacking. This is the practice of testing out multiple hypotheses, data sets, or statistical methods, and only reporting the ones that yield interesting results. P-hacking artificially inflates the statistical significance of your results. From the PNAS paper:
The analyses showed that the change in hurricane fatalities as a function of MFI [a measure of how masculine or feminine a name is perceived to be] was marginal for hurricanes lower in nor- malized damage, indicating no effect of masculinity-femininity of name for less severe storms. For hurricanes higher in nor- malized damage, however, this change was substantial, such that hurricanes with feminine names were much deadlier than those with masculine names.
To summarize, “the first thing we tried didn’t yield an interesting result, but we kept trying until we found something that did.” (To give the authors credit, at least they acknowledge that they did this. That doesn’t make the result right, but it’s more honest than the alternative.)
But here’s the biggest problem. The study used hurricanes from 1950 onwards. Hurricanes were essentially all given female names until 1978, so the masculine names are heavily weighted toward late times. So something else that produces a change in hurricane deadliness over time would mimic the effect that is seen. An obvious candidate: better early warning systems, which reduce fatalities for any given level of storm severity.
Not surprisingly, other people have pointed out problems like these. The authors of the study have responded. If you want to decide what you think about this, you should definitely read what they have to say. I’ll just point out a few things.
On the most important issue (the question of whether to use pre-1979 all-female-named hurricanes), the authors point out that their data used degree of perceived femininity of names, not simply a male-female binary, so there is information in the pre-1979 names. This is not sufficient to resolve the problem, which is that the pre-1979 data are, on average, very different in MFI from the later data. They then say
Although it is true that if we model the data using only hurricanes since 1979 (n=54) this is too small a sample to obtain a significant interaction, when we model the fatalities of all hurricanes since 1950 using their names’ degree of femininity, the interaction between name-femininity and damage is statistically significant.
This response, of course, simply digs them deeper into the p-hacking hole. Essentially, they’re saying that they used the less reliable data set because the more reliable one didn’t give an interesting result.
The authors give a second, better response to this problem:
We included elapsed years (years since the hurricane) in our modeling and this did not have any significant effect in predicting fatalities. In other words, how long ago the storm occurred did not predict its death toll.
This is potentially a good argument, although the details depend on precisely what tests they performed. The relevant question is whether replacing the MFI with the time since the hurricane gives a model that fits reasonably well (in which case assuming an MFI effect is not necessary). Unfortunately, these results are not presented in the paper, so there’s no way to tell if that’s what was done. The authors do mention that they tried including elapsed time as one of their variables, but without enough specifics to tell what was actually done.
The more I think about it, the stranger it is that this issue is not addressed in detail in the paper. Usually, in this sort of study, you control for other variables that might mimic the signal you’re looking for. In this case, an overall drift in hurricane deadliness with time is by far the most obvious such variable, and yet it’s not included in any of the models for which they compute goodness of fit. It’s very strange that the authors would not include this in their models, and even stranger that the reviewers let them.