A study of gender bias claimed that orchestras that held “blind auditions,” in which performers were hidden behind a screen so judges couldn’t determine their sex, were 50 percent more likely to hire women. Here’s how the Guardian reported the results back in 2013:

Bias cannot be avoided, we just can’t help ourselves. Research shows that we apply different standards when we compare men and women. While explicit discrimination certainly exists, perhaps the more arduous task is to eliminate our implicit biases — the ones we don’t even realise we have…

In the 1970s and 1980s, orchestras began using blind auditions. Candidates are situated on a stage behind a screen to play for a jury that cannot see them. In some orchestras, blind auditions are used just for the preliminary selection while others use it all the way to the end, until a hiring decision is made.

Even when the screen is only used for the preliminary round, it has a powerful impact; researchers have determined that this step alone makes it 50% more likely that a woman will advance to the finals. And the screen has also been demonstrated to be the source of a surge in the number of women being offered positions.

The study by two economists from Harvard and Princeton was published in 2000 and has been cited around 1500 times since, making it one of the most-cited papers in the field of gender bias. It has appeared in newspapers and TED talks. It was discussed by Malcolm Gladwell and, according to Christina Hoff Sommers, was even referenced in a dissent by Justice Ruth Bader Ginsburg.

Earlier this year, a data scientist named Jonatan Schaumburg-Müller Pallesen wrote a post on Medium criticizing the study. Here was his conclusion:

So, in conclusion, this study presents no statistically significant evidence that blind auditions increase the chances of female applicants. In my reading, the unadjusted results seem to weakly indicate the opposite, that male applicants have a slightly increased chance in blind auditions; but this advantage disappears with controls.

His post caught the attention of a statistician at Columbia University named Andrew Gelman who went back and read through the study himself. Gelman confirmed Pallesen’s conclusions, i.e. the studies big claims don’t hold up to scrutiny.

This is not very impressive at all. Some fine words but the punchline seems to be that the data are too noisy to form any strong conclusions. And the bit about the point estimates being “economically significant”—that doesn’t mean anything at all. That’s just what you get when you have a small sample and noisy data, you get noisy estimates so you can get big numbers.

As for the claim that women’s chances of being hired improved by 50 percent in blind auditions, Gelman initially couldn’t find any support for it in the study but later, in the comments to his own post, he identified it: “there’s no way you should put that sort of claim in the conclusion of your paper unless you give the standard error. And if you look at the numbers in that table, you’ll see that the standard errors for these differences are large.” He also adds:

And one problem with the paper is the expectation, among research articles, to present strong conclusions. You can see it right there: the authors make some statements and then immediately qualify them (the results are not statistically significant, they go in both directions, etc.), but then they can’t resist making strong conclusions and presenting mysterious numbers like that “50 percent” thing. And of course the generally positive reception on this paper would just seem to retroactively validate the strategy of taking weak or noisy data and making strong claims.

Someone in the comments to Gelman’s post suggested that political correctness could be the reason the paper was published by the journal that published it in 2000. Gelman countered that economic journals are known for being willing to favor results that undercut politically expected findings, at least more so than journals in other fields. That may be true but the real issue is how this study became a leading example of gender bias cited 1500 times. In other words, political correctness may not explain the publication of the paper but I’d bet it explains the frequency of citations.

Christina Hoff Sommers wrote a piece about all of this for the Wall Street Journal earlier this week. Her piece is behind the WSJ paywall but her AEI video on the same topic is not.