Machines Beat Humans on a Reading Test. But Do They Understand?

A tool known as BERT can now beat humans on advanced reading-comprehension tests. But it’s also revealed how far AI has to go.

In the fall of 2017, Sam Bowman, a computational linguist at New York University, figured that computers still weren’t very good at understanding the written word. Sure, they had become decent at simulating that understanding in certain narrow domains, like automatic translation or sentiment analysis (for example, determining if a sentence sounds “mean or nice,” he said). But Bowman wanted measurable evidence of the genuine article: bona fide, human-style reading comprehension in English. So he came up with a test.

In an April 2018 paper coauthored with collaborators from the University of Washington and DeepMind, the Google-owned artificial intelligence company, Bowman introduced a battery of nine reading-comprehension tasks for computers called GLUE (General Language Understanding Evaluation). The test was designed as “a fairly representative sample of what the research community thought were interesting challenges,” said Bowman, but also “pretty straightforward for humans.” For example, one task asks whether a sentence is true based on information offered in a preceding sentence. If you can tell that “President Trump landed in Iraq for the start of a seven-day visit” implies that “President Trump is on an overseas visit,” you’ve just passed.

The machines bombed. Even state-of-the-art neural networks scored no higher than 69 out of 100 across all nine tasks: a D-plus, in letter grade terms. Bowman and his coauthors weren’t surprised. Neural networks — layers of computational connections built in a crude approximation of how neurons communicate within mammalian brains — had shown promise in the field of “natural language processing” (NLP), but the researchers weren’t convinced that these systems were learning anything substantial about language itself. And GLUE seemed to prove it. “These early results indicate that solving GLUE is beyond the capabilities of current models and methods,” Bowman and his coauthors wrote.

Their appraisal would be short-lived. In October of 2018, Google introduced a new method nicknamed BERT (Bidirectional Encoder Representations from Transformers). It produced a GLUE score of 80.5. On this brand-new benchmark designed to measure machines’ real understanding of natural language — or to expose their lack thereof — the machines had jumped from a D-plus to a B-minus in just six months.

“That was definitely the ‘oh, crap’ moment,” Bowman recalled, using a more colorful interjection. “The general reaction in the field was incredulity. BERT was getting numbers on many of the tasks that were close to what we thought would be the limit of how well you could do.” Indeed, GLUE didn’t even bother to include human baseline scores before BERT; by the time Bowman and one of his Ph.D. students added them to GLUE in February 2019, they lasted just a few months before a BERT-based system from Microsoft beat them.

As of this writing, nearly every position on the GLUE leaderboard is occupied by a system that incorporates, extends or optimizes BERT. Five of these systems outrank human performance.

But is AI actually starting to understand our language — or is it just getting better at gaming our systems? As BERT-based neural networks have taken benchmarks like GLUE by storm, new evaluation methods have emerged that seem to paint these powerful NLP systems as computational versions of Clever Hans, the early 20th-century horse who seemed smart enough to do arithmetic, but who was actually just following unconscious cues from his trainer.

“We know we’re somewhere in the gray area between solving language in a very boring, narrow sense, and solving AI,” Bowman said. “The general reaction of the field was: Why did this happen? What does this mean? What do we do now?”

Writing Their Own Rules

In the famous Chinese Room thought experiment, a non-Chinese-speaking person sits in a room furnished with many rulebooks. Taken together, these rulebooks perfectly specify how to take any incoming sequence of Chinese symbols and craft an appropriate response. A person outside slips questions written in Chinese under the door. The person inside consults the rulebooks, then sends back perfectly coherent answers in Chinese.

The thought experiment has been used to argue that, no matter how it might appear from the outside, the person inside the room can’t be said to have any true understanding of Chinese. Still, even a simulacrum of understanding has been a good enough goal for natural language processing.

The only problem is that perfect rulebooks don’t exist, because natural language is far too complex and haphazard to be reduced to a rigid set of specifications. Take syntax, for example: the rules (and rules of thumb) that define how words group into meaningful sentences. The phrase “colorless green ideas sleep furiously” has perfect syntax, but any natural speaker knows it’s nonsense. What prewritten rulebook could capture this “unwritten” fact about natural language — or innumerable others?

NLP researchers have tried to square this circle by having neural networks write their own makeshift rulebooks, in a process called pretraining.

Before 2018, one of NLP’s main pretraining tools was something like a dictionary. Known as word embeddings, this dictionary encoded associations between words as numbers in a way that deep neural networks could accept as input — akin to giving the person inside a Chinese room a crude vocabulary book to work with. But a neural network pretrained with word embeddings is still blind to the meaning of words at the sentence level. “It would think that ‘a man bit the dog’ and ‘a dog bit the man’ are exactly the same thing,” said Tal Linzen, a computational linguist at Johns Hopkins University.

A better method would use pretraining to equip the network with richer rulebooks — not just for vocabulary, but for syntax and context as well — before training it to perform a specific NLP task. In early 2018, researchers at OpenAI, the University of San Francisco, the Allen Institute for Artificial Intelligence and the University of Washington simultaneously discovered a clever way to approximate this feat. Instead of pretraining just the first layer of a network with word embeddings, the researchers began training entire neural networks on a broader basic task called language modeling.

“The simplest kind of language model is: I’m going to read a bunch of words and then try to predict the next word,” explained Myle Ott, a research scientist at Facebook. “If I say, ‘George Bush was born in,’ the model now has to predict the next word in that sentence.”

These deep pretrained language models could be produced relatively efficiently. Researchers simply fed their neural networks massive amounts of written text copied from freely available sources like Wikipedia — billions of words, preformatted into grammatically correct sentences — and let the networks derive next-word predictions on their own. In essence, it was like asking the person inside a Chinese room to write all his own rules, using only the incoming Chinese messages for reference.

“The great thing about this approach is it turns out that the model learns a ton of stuff about syntax,” Ott said.

What’s more, these pretrained neural networks could then apply their richer representations of language to the job of learning an unrelated, more specific NLP task, a process called fine-tuning.

“You can take the model from the pretraining stage and kind of adapt it for whatever actual task you care about,” Ott explained. “And when you do that, you get much better results than if you had just started with your end task in the first place.”

Indeed, in June of 2018, when OpenAI unveiled a neural network called GPT, which included a language model pretrained on nearly a billion words (sourced from 11,038 digital books) for an entire month, its GLUE score of 72.8 immediately took the top spot on the leaderboard. Still, Sam Bowman assumed that the field had a long way to go before any system could even begin to approach human-level performance.

Then BERT appeared.

A Powerful Recipe

So what exactly is BERT?

First, it’s not a fully trained neural network capable of besting human performance right out of the box. Instead, said Bowman, BERT is “a very precise recipe for pretraining a neural network.” Just as a baker can follow a recipe to reliably produce a delicious prebaked pie crust — which can then be used to make many different kinds of pie, from blueberry to spinach quiche — Google researchers developed BERT’s recipe to serve as an ideal foundation for “baking” neural networks (that is, fine-tuning them) to do well on many different natural language processing tasks. Google also open-sourced BERT’s code, which means that other researchers don’t have to repeat the recipe from scratch — they can just download BERT as-is, like buying a prebaked pie crust from the supermarket.

If BERT is essentially a recipe, what’s the ingredient list? “It’s the result of three things coming together to really make things click,” said Omer Levy, a research scientist at Facebook who has analyzed BERT’s inner workings.

The first is a pretrained language model, those reference books in our Chinese room. The second is the ability to figure out which features of a sentence are most important.

In 2017, an engineer at Google Brain named Jakob Uszkoreit was working on ways to accelerate Google’s language-understanding efforts. He noticed that state-of-the-art neural networks also suffered from a built-in constraint: They all looked through the sequence of words one by one. This “sequentiality” seemed to match intuitions of how humans actually read written sentences. But Uszkoreit wondered if “it might be the case that understanding language in a linear, sequential fashion is suboptimal,” he said.

Uszkoreit and his collaborators devised a new architecture for neural networks focused on “attention,” a mechanism that lets each layer of the network assign more weight to some specific features of the input than to others. This new attention-focused architecture, called a transformer, could take a sentence like “a dog bites the man” as input and encode each word in many different ways in parallel. For example, a transformer might connect “bites” and “man” together as verb and object, while ignoring “a”; at the same time, it could connect “bites” and “dog” together as verb and subject, while mostly ignoring “the.”

The nonsequential nature of the transformer represented sentences in a more expressive form, which Uszkoreit calls treelike. Each layer of the neural network makes multiple, parallel connections between certain words while ignoring others — akin to a student diagramming a sentence in elementary school. These connections are often drawn between words that may not actually sit next to each other in the sentence. “Those structures effectively look like a number of trees that are overlaid,” Uszkoreit explained.

This treelike representation of sentences gave transformers a powerful way to model contextual meaning, and also to efficiently learn associations between words that might be far away from each other in complex sentences. “It’s a bit counterintuitive,” Uszkoreit said, “but it is rooted in results from linguistics, which has for a long time looked at treelike models of language.”

Finally, the third ingredient in BERT’s recipe takes nonlinear reading one step further.

Unlike other pretrained language models, many of which are created by having neural networks read terabytes of text from left to right, BERT’s model reads left to right and right to left at the same time, and learns to predict words in the middle that have been randomly masked from view. For example, BERT might accept as input a sentence like “George Bush was [……..] in Connecticut in 1946” and predict the masked word in the middle of the sentence (in this case, “born”) by parsing the text from both directions. “This bidirectionality is conditioning a neural network to try to get as much information as it can out of any subset of words,” Uszkoreit said.

The Mad-Libs-esque pretraining task that BERT uses — called masked-language modeling — isn’t new. In fact, it’s been used as a tool for assessing language comprehension in humans for decades. For Google, it also offered a practical way of enabling bidirectionality in neural networks, as opposed to the unidirectional pretraining methods that had previously dominated the field. “Before BERT, unidirectional language modeling was the standard, even though it is an unnecessarily restrictive constraint,” said Kenton Lee, a research scientist at Google.

Each of these three ingredients — a deep pretrained language model, attention and bidirectionality — existed independently before BERT. But until Google released its recipe in late 2018, no one had combined them in such a powerful way.

Refining the Recipe

Like any good recipe, BERT was soon adapted by cooks to their own tastes. In the spring of 2019, there was a period “when Microsoft and Alibaba were leapfrogging each other week by week, continuing to tune their models and trade places at the number one spot on the leaderboard,” Bowman recalled. When an improved version of BERT called RoBERTa first came on the scene in August, the DeepMind researcher Sebastian Ruder dryly noted the occasion in his widely read NLP newsletter: “Another month, another state-of-the-art pretrained language model.”

BERT’s “pie crust” incorporates a number of structural design decisions that affect how well it works. These include the size of the neural network being baked, the amount of pretraining data, how that pretraining data is masked and how long the neural network gets to train on it. Subsequent recipes like RoBERTa result from researchers tweaking these design decisions, much like chefs refining a dish.

In RoBERTa’s case, researchers at Facebook and the University of Washington increased some ingredients (more pretraining data, longer input sequences, more training time), took one away (a “next sentence prediction” task, originally included in BERT, that actually degraded performance) and modified another (they made the masked-language pretraining task harder). The result? First place on GLUE — briefly. Six weeks later, researchers from Microsoft and the University of Maryland added their own tweaks to RoBERTa and eked out a new win. As of this writing, yet another model called ALBERT, short for “A Lite BERT,” has taken GLUE’s top spot by further adjusting BERT’s basic design.

“We’re still figuring out what recipes work and which ones don’t,” said Facebook’s Ott, who worked on RoBERTa.

Still, just as perfecting your pie-baking technique isn’t likely to teach you the principles of chemistry, incrementally optimizing BERT doesn’t necessarily impart much theoretical knowledge about advancing NLP. “I’ll be perfectly honest with you: I don’t follow these papers, because they are extremely boring to me,” said Linzen, the computational linguist from Johns Hopkins. “There is a scientific puzzle there,” he grants, but it doesn’t lie in figuring out how to make BERT and all its spawn smarter, or even in figuring out how they got smart in the first place. Instead, “we are trying to understand to what extent these models are really understanding language,” he said, and not “picking up weird tricks that happen to work on the data sets that we commonly evaluate our models on.”

In other words: BERT is doing something right. But what if it’s for the wrong reasons?

Clever but Not Smart

In July 2019, two researchers from Taiwan’s National Cheng Kung University used BERT to achieve an impressive result on a relatively obscure natural language understanding benchmark called the argument reasoning comprehension task. Performing the task requires selecting the appropriate implicit premise (called a warrant) that will back up a reason for arguing some claim. For example, to argue that “smoking causes cancer” (the claim) because “scientific studies have shown a link between smoking and cancer” (the reason), you need to presume that “scientific studies are credible” (the warrant), as opposed to “scientific studies are expensive” (which may be true, but makes no sense in the context of the argument). Got all that?

If not, don’t worry. Even human beings don’t do particularly well on this task without practice: The average baseline score for an untrained person is 80 out of 100. BERT got 77 — “surprising,” in the authors’ understated opinion.

But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing. An article in The Gradient, a machine-learning magazine published out of the Stanford Artificial Intelligence Laboratory, compared BERT to Clever Hans, the horse with the phony powers of arithmetic.

In another paper called “Right for the Wrong Reasons,” Linzen and his coauthors published evidence that BERT’s high performance on certain GLUE tasks might also be attributed to spurious cues in the training data for those tasks. (The paper included an alternative data set designed to specifically expose the kind of shortcut that Linzen suspected BERT was using on GLUE. The data set’s name: Heuristic Analysis for Natural-Language-Inference Systems, or HANS.)

So is BERT, and all of its benchmark-busting siblings, essentially a sham? Bowman agrees with Linzen that some of GLUE’s training data is messy — shot through with subtle biases introduced by the humans who created it, all of which are potentially exploitable by a powerful BERT-based neural network. “There’s no single ‘cheap trick’ that will let it solve everything [in GLUE], but there are lots of shortcuts it can take that will really help,” Bowman said, “and the model can pick up on those shortcuts.” But he doesn’t think BERT’s foundation is built on sand, either. “It seems like we have a model that has really learned something substantial about language,” he said. “But it’s definitely not understanding English in a comprehensive and robust way.”

According to Yejin Choi, a computer scientist at the University of Washington and the Allen Institute, one way to encourage progress toward robust understanding is to focus not just on building a better BERT, but also on designing better benchmarks and training data that lower the possibility of Clever Hans–style cheating. Her work explores an approach called adversarial filtering, which uses algorithms to scan NLP training data sets and remove examples that are overly repetitive or that otherwise introduce spurious cues for a neural network to pick up on. After this adversarial filtering, “BERT’s performance can reduce significantly,” she said, while “human performance does not drop so much.”

Still, some NLP researchers believe that even with better training, neural language models may still face a fundamental obstacle to real understanding. Even with its powerful pretraining, BERT is not designed to perfectly model language in general. Instead, after fine-tuning, it models “a specific NLP task, or even a specific data set for that task,” said Anna Rogers, a computational linguist at the Text Machine Lab at the University of Massachusetts, Lowell. And it’s likely that no training data set, no matter how comprehensively designed or carefully filtered, can capture all the edge cases and unforeseen inputs that humans effortlessly cope with when we use natural language.

Bowman points out that it’s hard to know how we would ever be fully convinced that a neural network achieves anything like real understanding. Standardized tests, after all, are supposed to reveal something intrinsic and generalizable about the test-taker’s knowledge. But as anyone who has taken an SAT prep course knows, tests can be gamed. “We have a hard time making tests that are hard enough and trick-proof enough that solving [them] really convinces us that we’ve fully solved some aspect of AI or language technology,” he said.

Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that’s specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?

“That’s a good analogy,” Bowman said. “We figured out how to solve the LSAT and the MCAT, and we might not actually be qualified to be doctors and lawyers.” Still, he added, this seems to be the way that artificial intelligence research moves forward. “Chess felt like a serious test of intelligence until we figured out how to write a chess program,” he said. “We’re definitely in an era where the goal is to keep coming up with harder problems that represent language understanding, and keep figuring out how to solve those problems.”