200 Researchers, 5 Hypotheses, No Consistent Answers

Just how much wisdom is there in the scientific crowd?

If science is an objective means of seeking truth, it’s also one that requires human judgments. Let’s say you’re a psychologist with a hypothesis: People understand that they may be biased in unconscious ways against stigmatized groups; they will admit this if you ask them. That might seem like a pretty straightforward idea—one that’s either true or not. But the best way to test it isn’t necessarily obvious. First, what do you mean by negative stereotypes? Which stigmatized groups are you talking about? How would you measure the extent to which people are aware of their implicit attitudes, and how would you gauge their willingness to disclose them?

These questions could be answered in many different ways; these, in turn, may lead to vastly different findings. A new crowdsourced experiment—involving more than 15,000 subjects and 200 researchers in more than two dozen countries—proves that point. When various research teams designed their own means of testing the very same set of research questions, they came up with divergent, and in some cases opposing, results.

The crowdsourced study is a dramatic demonstration of an idea that’s been widely discussed in light of the reproducibility crisis—the notion that subjective decisions researchers make while designing their studies can have an enormous impact on their observed results. Whether through p-hacking or via the choices they make as they wander the garden of forking paths, researchers may intentionally or inadvertently nudge their results toward a particular conclusion.

The new paper’s senior author, psychologist Eric Uhlmann at INSEAD in Singapore, had previously spearheaded a study that gave a single data set to 29 research teams and asked them to use it to answer a simple research question: “Do soccer referees give more red cards to dark-skinned players than light-skinned ones?” Despite analyzing identical data, none of the teams came up with exactly the same answer. In that case, though, the groups’ findings did generally point in the same direction.

The red card study showed how decisions about how to analyze data could influence the results, but Uhlmann also wondered about the many other decisions that go into a study’s design. So he initiated this latest study, an even larger and more ambitious one, which will be published in The Psychological Bulletin (data and materials are shared openly online). The project started with five hypotheses that had already been tested experimentally but on which results had not yet been published.

Aside from the hypothesis about implicit associations described above, these concerned things like how people respond to aggressive negotiating tactics or what factors could make them more willing to accept the use of performance-enhancing drugs among athletes. Uhlmann and his colleagues presented the same research questions to more than a dozen research teams without telling them anything about the original study or what it had found.

The teams then independently created their own experiments to test the hypotheses under some common parameters. The studies would have to be carried out online, with participants in each drawn at random from a common pool. Each research design was run twice: once on subjects pulled from Amazon’s Mechanical Turk and then again on a fresh set of subjects found through a survey company called Pure Profile.

The published study materials show how much variation there was across research designs. In testing the first hypothesis, for example, that people are aware of their unconscious biases, one team simply asked participants to rate their agreement with the following statement: “Regardless of my explicit (i.e. conscious) beliefs about social equality, I believe I possess automatic (i.e. unconscious) negative associations towards members of stigmatized social groups.” Based on responses to this question, they concluded that the hypothesis was false: People do not report an awareness of having implicit negative stereotypes.

Another team tested the same hypothesis by asking subjects to self-identify with a political party and then to rank their feelings about a hypothetical member of the opposition party. Using this approach, they found that people are very willing to report their own negative stereotypes. Meanwhile, a third team showed subjects photos of men and women who were white, black, or overweight (as well as of puppies or kittens) and asked them to rate their “immediate ‘gut level’ reaction towards this person.” Their results also showed that people did indeed cop to having negative associations with people from stigmatized groups.

When the study was over, seven groups had found evidence in favor of the hypothesis, while six had found evidence against it. Taken all together, these data would not support the idea that people recognize and report their own implicit associations. But if you’d seen results from only one group’s design, it would have been easy to come to a different conclusion.

The study found a similar pattern for four out of five hypotheses: Different research teams had produced statistically significant effects in opposite directions. Even when a research question produced answers in the same direction, the size of the reported effects were all over the map. Eleven of 13 research teams produced data that clearly supported the hypothesis that extreme offers make people less trusted in a negotiation, for example, while findings from the other two were suggestive of the same idea. But some groups found that an extreme offer had a very large effect on trust, while others found that the effect was only minor.

The moral of the story here is that one specific study doesn’t mean very much, says Anna Dreber, an economist at the Stockholm School of Economics and an author on the project. “We researchers need to be way more careful now in how we say, ‘I’ve tested the hypothesis.’ You need to say, ‘I’ve tested it in this very specific way.’ Whether it generalizes to other settings is up to more research to show.”

This problem—and this approach to demonstrating it—isn’t unique to social psychology. One recent project similarly asked 70 teams to test nine hypotheses using the same data set of functional magnetic resonance images. No two teams used the exact same approach, and their results varied as you might expect.

If one were judging only by the outcomes of these projects, it might be reasonable to guess that the scientific literature would be a thicket of opposing findings. (If different research groups often arrive at different answers to the same questions, then the journals should be filled with contradictions.) Instead, the opposite is true. Journals are full of studies that confirm the existence of a hypothesized effect, while null results are squirreled away in a file drawer. Think of the results described above on the implicit-bias hypothesis: Half the groups found evidence in favor and half found evidence against. If this work had been carried out in the wilds of scientific publishing, the former would have taken root in formal papers, while the rest would have been buried and ignored.

The demonstration from Uhlmann and colleagues suggests that hypotheses should be tested in diverse and transparent ways. “We need to do more studies trying to look at the same idea with different methods,” says Dorothy Bishop, a psychologist at the University of Oxford. That way, you can “really clarify how solid it is before you’re jumping up and down and making a big dance about it.”

The results certainly argue for humility, Uhlmann says. “We have to be careful what we say in the article, what our university says in the press release, what we say in the media interviews. We need to be cautious about what we claim.” The incentives push toward making big claims, but good science probably means slowing down and exercising more caution.

Slowing down is something that University College London psychologist Uta Frith advocates in a recent essay in Trends in Cognitive Sciences. Frith writes that “the current ‘publish or perish’ culture has a corrupting effect on scientists as well as on science itself.” Pressure to publish many papers, rather than focusing on publishing high-quality ones, stresses researchers and shortchanges the science, she says. “Fast science leads to cutting corners and has almost certainly contributed to the reproducibility crisis,” she writes. Her antidote? “Slow science,” which focuses on the “bigger aims of science” as a method of truth seeking. One way to promote slow science, she says, would be for researchers to find inspiration in the practice of grand-crus viticulturists, who take pains to limit their own wine production so as to maintain its maximum quality.

Bishop has made a similar proposal, for scientists to restrict their own output. “In order to develop a theory, you need a mountain of observations, and I think we’ve often had rather small numbers of observations,” she says. “And then we’ve leapt ahead to theorize prematurely when it would have been better if we had explored the range of situations under which those observations were obtained.” If there’s one lesson to be drawn from the five-hypothesis study, it’s that science is a process, and it’s one that takes time.