To Clean Up Comments, Let AI Tell Users Their Words Are Trash

It won’t solve everything, but a new study suggests real-time automated feedback could help make the internet a less toxic place.

Comment sections have long acted like the wiry garbage cans of news websites, collecting the worst and slimiest of human thought. Thoughtful reactions get mixed in with off-topic offal, personal attacks, and the enticing suggestions to “learn how to make over $7,000 a month by working from home online!” (So goes the old adage: never read the comments.) Things got so bad in the last decade that many websites put the kibosh on comments altogether, trading the hope of lively, interactive debate for the promise of peace and quiet.

But while some people ran away screaming, others leapt in with a mission to make the comment section better. Today, dozens of newsrooms use commenting platforms like Coral and OpenWeb that aim to keep problematic discourse at bay with a combination of human chaperones and algorithmic tools. (When WIRED added comments back to the website earlier this year, we turned to Coral.) These tools work to flag and categorize potentially harmful comments before a human can review them, helping to manage the workload and reduce the visibility of toxic content.

Another approach that’s gained steam is to give commenters automated feedback, encouraging them to rethink a toxic comment before they hit publish. A new study looks at how effective these self-editing prompts can be. The study, conducted by OpenWeb and Google’s AI conversation platform, Perspective API, involved over 400,000 comments on news websites, like AOL, RT, and Newsweek, which tested a real-time feedback feature in their comment sections. Rather than automatically rejecting a comment that violated community standards, the algorithm would first prompt commenters with a warning message: “Let’s keep the conversation civil. Please remove any inappropriate language from your comment,” or “Some members of the community may find your comment inappropriate. Try Again?” Another group of commenters served as a control, and saw no such intervention message.

The study found that for about a third of commenters, seeing the intervention did cause them to revise their comments. Jigsaw, the group at Google that makes Perspective API, says that jibes with previous research, including a study it did with Coral, which found that 36 percent of people edited toxic language in a comment when prompted. Another experiment—from The Southeast Missourian, which also uses Perspective’s software—found that giving real-time feedback to commenters reduced the number of comments considered “very toxic” by 96 percent.

The ways people revised their comments weren’t always positive, though. In the OpenWeb study, about half of people who chose to edit their comment did so to remove or replace the toxic language, or to reshape the comment entirely. Those people seemed both to understand why the original comment got flagged, and acknowledge that they could rewrite it in a nicer way. But about a quarter of those who revised their comment did so to navigate around the toxicity filter, by changing the spelling or spacing of an offensive word to try to skirt algorithmic detection. The rest changed the wrong part of the comment, seeming to not understand what was wrong with the original version, or revised their comment to respond directly to the feature itself (e.g. “Take your censorship and stuff it”).

As algorithmic moderation has become more common, language adaptations have followed in their footsteps. People learn that specific words—say, “cuck”— trip up the filter, and start to write them differently (“c u c k”) or invent new words altogether. After the death of death of Ahmaud Arbery in February, for example, Vice reported that some white supremacist groups online began to use the word “jogger” in place of better-known racial slurs. Those patterns largely escape algorithmic filters, and can make it harder to police intentionally offensive language online.

Ido Goldberg, OpenWeb’s SVP of product, says this kind of adaptive behavior was one of the main concerns in designing their real-time feedback feature. “There’s this window for abuse that’s open to try to trick the system,” he says. “Obviously we did see some of that, but not as much as we thought.” Rather than use the warning messages as a way to game the moderation system, most users who saw interventions didn’t change their comments at all. Thirty-six percent of users who saw the intervention posted their comment anyway, without making any edits. (The intervention message acted as a warning, not a barrier to posting.) Another 18 percent posted their comment, unedited, after refreshing the page, suggesting that they took the warning as a block. Another 12 percent simply gave up, abandoning the effort and not posting at all.

While gentle nudges work for some, they do little to influence those who show up in the comments to intentionally write something racist, sexist, violent, or extreme. Flagging those comments won’t make a troll stop, scratch their head, and reconsider if they could say it a little more nicely. But Nadav Shoval, OpenWeb’s cofounder and CEO, believes that the number of genuine trolls—that is, people who write nasty things on the internet like it’s their calling—has been greatly exaggerated. He believes that most offensive comments come from people who are usually well-intentioned but occasionally have a flare-up of emotion that, when amplified, incentivizes more inflammatory behavior. There’s some evidence to support that, too: In a blog post published on Monday, Jigsaw referenced an earlier study it did with Wikipedia, where it found that the majority of offensive content came from people who did not have a history of trolling.

The subjects of OpenWeb’s study aren’t representative of the wider internet, and 400,000 comments is a fraction of what gets posted daily to platforms like Facebook or Reddit. But this kind of pre-emptive approach has caught on among those bigger platforms, too. Instagram, for example, built a machine learning model to detect messages on its platform that look like bullying. Before someone posts a mean comment, the platform can prompt them to write it more nicely; it can also proactively hide these types of toxic comments from users who have turned on its offensive comment filter.

Pre-emptive approaches also relieve some of the pressure from moderators and other community members to clean up some of the mess of comments. Many websites rely on community policing to flag problematic comments, in addition to algorithmic and human moderation. An approach that puts more emphasis on convincing people to self-edit themselves before they post takes a step toward changing behavior norms on a particular website long-term.

While the real-time feedback feature is still an experiment, OpenWeb has started rolling it out to more news organizations to see if the approach can work across different platforms with different needs. Shoval believes that by giving people the chance to police themselves, their behavior will start to change so that less strenuous moderation is needed in the future. It’s a rosy view of the internet. But his approach could leave room for people to make their voices heard without reaching for the most extreme, hurtful, and toxic language first.