Tech Firms Hire ‘Red Teams.’ Scientists Should, Too

Another botched peer review—this one involving a controversial study of police killings—shows how devil’s advocates could improve the scientific process.

The recent retraction of a research paper which claimed to find no link between police killings and the race of the victims was a story tailor-made for today’s fights over cancel culture.

First, the authors asked for the paper to be withdrawn, both because they’d been “careless when describing the inferences that could be made from our data” and because of how others had interpreted the work. (In particular they pointed to recent op-ed in The Wall Street Journal with the headline, “The Myth of Systemic Police Racism.”) Then, after two days of predictable blowback from those decrying what they saw as left-wing censorship, the authors tried to clarify: “People were incorrectly concluding that we retracted due to either political pressure or the political views of those citing the paper,” they wrote in an amended statement.

No, the authors said, the real reason they retracted the paper was because it contained a serious mistake. In fact, that mistake—a misstatement of its central finding—had been caught soon after the paper’s initial publication in the Proceedings of the National Academy of Sciences in July 2019, and was formally corrected in April of this year. At that point, the authors acknowledged their error—sort of—while insisting that their main conclusions held. That the eventual retraction came only after the paper became a flashpoint in the debate over race and policing in the wake of George Floyd’s murder … well, let’s agree that the retraction happened.

The real culprit here, however, is not woke politics but inept peer review. The publication process at PNAS failed to catch a glaring problem; if the reviewers had spotted it, the paper on police killings would have turned out much differently—and led to far less controversy.

The missed error, as Princeton researchers Dean Knox and Jonathan Mummolo have written, amounted to what social scientists call “selection on the dependent variable,” which they describe as “only examining cases where events of interest occur.” According to Knox and Mummolo, the PNAS paper failed to account for the possibility (a strong one, as it happens) that Blacks are much more likely than whites to experience non-fatal encounters with police that escalate to deadly force. Basically, the article was a numerator without a denominator.

When the paper’s authors finally called for its retraction, they admitted that their study couldn’t make any claims about race and fatal shootings at the hands of police. “The mistake we made was drawing inferences about the broader population of civilians who interact with police rather than restricting our conclusions to the population of civilians who were fatally shot by the police,” they wrote.

They also lamented the fact that conservatives—particularly Heather Mac Donald, of the Manhattan Institute—had seized on their work to argue a point that the flawed science didn’t support. That admission prompted Mac Donald and others to declare victory in victimhood. (The paper’s senior author, Joseph Cesario, has pushed back on Mac Donald’s characterization, claiming in The Wall Street Journal that, despite the wording of his original statement, the decision to retract “had nothing to do with [her] claims.”)

As we and others have written many times, peer review—the way journals ask researchers to perform it, anyway—is not designed to catch fraud. It’s also vulnerable to rigging and doesn’t go so well when done in haste. Editors and publishers tend to admit these problems only under duress—i.e., when a well-publicized retraction happens—and then hope that we believe their claims that such colossal blunders are somehow “the system is working the way it should.” But their protestations only serve as an acknowledgement that the standard system doesn’t work, and that we must instead rely upon the more informal sort of peer review that happens to a paper after it gets published. The internet has enabled such post-publication peer review, as it is known, to happen with more speed, on sites like PubPeer.com. In some cases, though—as with the PNAS paper described above—the resolution of this after-the-fact assessment comes much too late, after a mistaken claim has already made the rounds.

So how might journals do things better? As Daniël Lakens, of Eindhoven University of Technology in the Netherlands, and his colleagues have argued, researchers should embrace a “Red Team challenge” approach to peer review. Just as software companies hire hackers to probe their products for potential gaps in the security, a journal might recruit a team of scientific devil’s advocates: subject-matter specialists and methodologists who will look for “holes and errors in ongoing work and … challenge dominant assumptions, with the goal of improving project quality,” Lakens wrote in Nature recently. After all, he added, science is only as robust as the strongest critique it can handle.

So here’s some advice for scientists and journals: If you’re thinking of publishing a paper on a controversial topic, don’t simply rely on your conventional review process—bring in a Red Team to probe for vulnerabilities. The study-hackers should be experts in the given field, with a stronger-than-usual background in statistics and a nose for identifying potential problems before publication, when they can be addressed. They should be, whenever possible—and, researchers, get ready to clutch your pearls—likely to disagree with your paper’s conclusions. Anticipating the responses of your critics is op-ed writing 101.

Until then, scientists can do what Lakens and his colleagues have done: in May, they launched a red team challenge for a manuscript by a colleague, Nicholas Coles, a social psychologist at Harvard, with each of five scientists given a $200 stipend to hunt for potential problems with the unpublished article, plus an additional $100 for each “critical problem” they uncovered. The project, which wrapped up this month, was meant to serve as a useful case study of the role red teams might play in science.”

If you’re thinking of publishing a paper on a controversial topic, don’t simply rely on your conventional review process—bring in a Red Team to probe for vulnerabilities.

The five critics came back with 107 potential errors, of which 18 were judged (by a neutral arbiter) to be significant. Of those, Lakens says, five were major problems, including “two previously unknown limitations of a key manipulation, inadequacies in the design and description of the power analysis, an incorrectly reported statistical test in the supplemental materials, and a lack of information about the sample in the manuscript.” Problems, in other words, that would have been deeply troubling had they surfaced after publication.

In light of the comments, Coles has decided to shelve the paper for the moment. “Instead of putting the final touches on my submission cover letter, I am back at the drawing board—fixing the fixable, designing a follow-up study to address the unfixable, and considering what role Red Teams can play in science more broadly,” he wrote recently.

Lakens says he’s planning to employ a Red Team to vet his own meta-analysis (a study of studies) on the topic of gender discrimination. It’s with controversial topics, in particular, that he sees the approach as being most useful for journals and researchers. “You would not insure a trip to the grocery store tomorrow, but you would consider travel insurance for a round the world trip,” he said. “It is all about the cost-benefit analysis for us as well. I leave it to others to decide whose research is important enough for a Red Team.”

That’s a critical point. Even before the murder of George Floyd, it was entirely predictable that a study of whether police officers kill Blacks more often than whites was likely to garner a lot of scrutiny. Given that resources are always scarce, it makes more sense to deploy the most comprehensive, time-consuming forms of peer review in cases where the findings matter most.

Researchers joke about the hated Reviewer #2 (or #3, depending on your meme); the one who’s always asking for more experiments, recommending vast revisions, and in general holding up your progress, whether toward publishing a paper or getting tenure. Without a doubt, there are jerks in science, and not all critiques are well-intentioned. But if we strip away the nastiness of Reviewer #2s, and the notion that their quibbles amount to spiteful sabotage, they start to look a bit like Red-Team leaders. Their more vigorous approach to doing peer review could help clean up the scientific record by making sure fewer incorrect conclusions are published. Isn’t that worth the effort?