Why We Shouldn’t Trust AI to Tell Us What We Feel

There is no good evidence that facial expressions reveal a person’s emotional state. But big tech companies want you to believe otherwise.

At a remote outpost in the mountainous highlands of Papua New Guinea, a young American psychologist named Paul Ekman arrived with a collection of flashcards and a new theory. It was 1967, and Ekman had heard that the Fore people of Okapa were so isolated from the wider world that they would be his ideal test subjects.

Like Western researchers before him, Ekman had come to Papua New Guinea to extract data from the indigenous community. He was gathering evidence to bolster a controversial hypothesis: that all humans exhibit a small number of universal emotions, or affects, that are innate and the same all over the world. For more than half a century, this claim has remained contentious, disputed among psychologists, anthropologists, and technologists. Nonetheless, it became a seed for a growing market that will be worth an estimated $56 billion by 2024. This is the story of how affect recognition came to be part of the artificial-intelligence industry, and the problems that presents.

When Ekman arrived in the tropics of Okapa, he ran experiments to assess how the Fore recognized emotions. Because the Fore had minimal contact with Westerners and mass media, Ekman had theorized that their recognition and display of core expressions would prove that such expressions were universal. His method was simple. He would show them flash cards of facial expressions and see if they described the emotion as he did. In Ekman’s own words, “All I was doing was showing funny pictures.” But Ekman had no training in Fore history, language, culture, or politics. His attempts to conduct his flash-card experiments using translators floundered; he and his subjects were exhausted by the process, which he described as like pulling teeth. Ekman left Papua New Guinea, frustrated by his first attempt at cross-cultural research on emotional expression. But this would be just the beginning.

*This article is adapted from Crawford’s recent book.*

Today affect-recognition tools can be found in national-security systems and at airports, in education and hiring start-ups, in software that purports to detect psychiatric illness and policing programs that claim to predict violence. The claim that a person’s interior state can be accurately assessed by analyzing that person’s face is premised on shaky evidence. A 2019 systematic review of the scientific literature on inferring emotions from facial movements, led by the psychologist and neuroscientist Lisa Feldman Barrett, found there is no reliable evidence that you can accurately predict someone’s emotional state in this manner. “It is not possible to confidently infer happiness from a smile, anger from a scowl, or sadness from a frown, as much of current technology tries to do when applying what are mistakenly believed to be the scientific facts,” the study concludes. So why has the idea that there is a small set of universal emotions, readily interpreted from a person’s face, become so accepted in the AI field?

To understand that requires tracing the complex history and incentives behind how these ideas developed, long before AI emotion-detection tools were built into the infrastructure of everyday life.

The idea of automated affect recognition is as compelling as it is lucrative. Technology companies have captured immense volumes of surface-level imagery of human expressions—including billions of Instagram selfies, Pinterest portraits, TikTok videos, and Flickr photos. Much like facial recognition, affect recognition has become part of the core infrastructure of many platforms, from the biggest tech companies to small start-ups.

Whereas facial recognition attempts to identify a particular individual, affect recognition aims to detect and classify emotions by analyzing any face. These systems already influence how people behave and how social institutions operate, despite a lack of substantial scientific evidence that they work. Automated affect-detection systems are now widely deployed, particularly in hiring. The AI hiring company HireVue, which can list Goldman Sachs, Intel, and Unilever among its clients, uses machine learning to infer people’s suitability for a job. In 2014, the company launched its AI system to extract microexpressions, tone of voice, and other variables from video job interviews, which it used to compare job applicants against a company’s top performers. After considerable criticism from scholars and civil-rights groups, it dropped facial analysis in 2021, but kept vocal tone as an assessment criterion. In January 2016, Apple acquired the start-up Emotient, which claimed to have produced software capable of detecting emotions from images of faces. Perhaps the largest of these start-ups is Affectiva, a company based in Boston that emerged from academic work done at MIT.

Affectiva has coded a variety of emotion-related applications, primarily using deep-learning techniques. These approaches include detecting distracted and “risky” drivers on roads and measuring consumers’ emotional responses to advertising. The company has built what it calls the world’s largest emotion database, made up of more than 10 million people’s expressions from 87 countries. Its monumental collection of videos was hand-labeled by crowdworkers based primarily in Cairo.

Outside the start-up sector, AI giants such as Amazon, Microsoft, and IBM have all designed systems for emotion detection. Microsoft offers perceived emotion detection in its Face API, identifying “anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise,” while Amazon’s Rekognition tool similarly proclaims that it can identify what it characterizes as “all seven emotions” and “measure how these things change over time, such as constructing a timeline of the emotions of an actor.”

Emotion-recognition systems share a similar set of blueprints and founding assumptions: that there is a small number of distinct and universal emotional categories, that we involuntarily reveal these emotions on our faces, and that they can be detected by machines. These articles of faith are so accepted in some fields that it can seem strange even to notice them, let alone question them. But if we look at how emotions came to be taxonomized—neatly ordered and labeled—we see that questions lie in wait at every corner.

Ekman’s research began with a fortunate encounter with Silvan Tomkins, then an established psychologist at Princeton who had published the first volume of his magnum opus, Affect Imagery Consciousness, in 1962. Tomkins’s work on affect had a huge influence on Ekman, who devoted much of his career to studying its implications. One aspect in particular played an outsize role: the idea that if affects are an innate set of evolutionary responses, they would be universal and thus recognizable across cultures. This desire for universality has an important bearing on why this theory is widely applied in AI emotion-recognition systems today. The theory could be applied everywhere, a simplification of complexity that was easily replicable at scale.

In the introduction to Affect Imagery Consciousness, Tomkins framed his theory of biologically based universal affects as one addressing an acute crisis of human sovereignty. He was challenging the development of behaviorism and psychoanalysis, two schools of thought that he believed treated consciousness as a mere by-product that was in service to other forces. He noted that human consciousness had “been challenged and reduced again and again, first by Copernicus”—who displaced man from the center of the universe—“then by Darwin”—whose theory of evolution shattered the idea that humans were created in the image of a Christian God—“and most of all by Freud”—who decentered human consciousness and reason as the driving forces behind our motivations. Tomkins continued, “The paradox of maximal control over nature and minimal control over human nature is in part a derivative of the neglect of the role of consciousness as a control mechanism.” To put it simply, consciousness tells us little about why we feel and act the way we do. This is a crucial claim for all sorts of later applications of affect theory, which stress the inability of humans to recognize both the feeling and the expression of affects. If we as humans are incapable of truly detecting what we are feeling, then perhaps AI systems can do it for us?

Tomkins’s theory of affects was his way to address the problem of human motivation. He argued that motivation was governed by two systems: affects and drives. Tomkins proposed that drives tend to be closely associated with immediate biological needs, such as hunger and thirst. They are instrumental; the pain of hunger can be remedied with food. But the primary system governing human motivation and behavior is that of affects, involving positive and negative feelings. Affects, which play the most important role in human motivation, amplify drive signals, but they are much more complex. For example, it is difficult to know the precise causes that lead a baby to cry, expressing the distress-anguish affect.

How can we know anything about a system in which the connections between cause and effect, stimulus and response, are so tenuous and uncertain? Tomkins proposed an answer: “The primary affects . . . seem to be innately related in a one-to-one fashion with an organ system which is extraordinarily visible”—namely, the face. He found precedents for this emphasis on facial expression in two works published in the 19th century: Charles Darwin’s The Expression of the Emotions in Man and Animals, from 1872, and an obscure volume by the French neurologist Guillaume-Benjamin-Amand Duchenne de Boulogne from 1862.

Tomkins assumed that the facial display of affects was a universal human trait. “Affects,” Tomkins believed, “are sets of muscle, vascular, and glandular responses located in the face and also widely distributed through the body, which generate sensory feedback . . . These organized sets of responses are triggered at subcortical centers where specific ‘programs’ for each distinct affect are stored”—a very early use of a computational metaphor for a human system. But Tomkins acknowledged that the interpretation of affective displays depends on individual, social, and cultural factors. He admitted that there were very different “dialects” of facial language in different societies. Even the forefather of affect research raised the possibility that interpreting facial displays depends on social and cultural context.

Given that facial expressions are culturally variable, using them to train machine-learning systems would inevitably mix together all sorts of different contexts, signals, and expectations. The problem for Ekman, and later for the field of computer vision, was how to reconcile these tensions.

During the mid-1960s, opportunity knocked at Ekman’s door in the form of a large grant from what is now called the Defense Advanced Research Projects Agency (DARPA), a research arm of the Department of Defense. DARPA’s sizable financial support allowed Ekman to begin his first studies to prove universality in facial expression. In general, these studies followed a design that would be copied in early AI labs. He largely duplicated Tomkins’s methods, even using Tomkins’s photographs to test subjects from Chile, Argentina, Brazil, the United States, and Japan. Subjects were presented with photographs of posed facial expressions, selected by the designers as exemplifying or expressing a particularly “pure” affect, such as fear, surprise, anger, happiness, sadness, and disgust. Subjects were then asked to choose among these affect categories and label the posed image. The analysis measured the degree to which the labels chosen by subjects correlated with those chosen by the designers.

From the start, the methodology had problems. Ekman’s forced-choice response format would be later criticized for alerting subjects to the connections that designers had already made between facial expressions and emotions. Further, the fact that these emotions were faked would raise questions about the validity of the results.

The idea that interior states can be reliably inferred from external signs has a long history. It stems in part from the history of physiognomy, which was premised on studying a person’s facial features for indications of his character. Aristotle believed that “it is possible to judge men’s character from their physical appearance . . . for it has been assumed that body and soul are affected together.” The Greeks also used physiognomy as an early form of racial classification, applied to “the genus man itself, dividing him into races, in so far as they differ in appearance and in character (for instance Egyptians, Thracians, and Scythians).”

Physiognomy in Western culture reached a high point during the 18th and 19th centuries, when it was seen as part of the anatomical sciences. A key figure in this tradition was the Swiss pastor Johann Kaspar Lavater, who wrote Essays on Physiognomy: For the Promotion of Knowledge and the Love of Mankind, originally published in German in 1789. Lavater took the approaches of physiognomy and blended them with the latest scientific knowledge. He believed that bone structure was an underlying connection between physical appearance and character type. If facial expressions were fleeting, skulls seemed to offer a more solid material for physiognomic inferences. Skull measurement was a popular technique in race science, and was used to support nationalism, white supremacy, and xenophobia. This work was infamously elaborated on throughout the 19th century by phrenologists such as Franz Joseph Gall and Johann Gaspar Spurzheim, as well as in scientific criminology through the work of Cesare Lombroso.

But it was the French neurologist Duchenne, described by Ekman as a “marvelously gifted observer,” who codified the use of photography and other technical means in the study of human faces. In Mécanisme de la physionomie humaine, Duchenne laid important foundations for both Darwin and Ekman, connecting older ideas from physiognomy and phrenology with more modern investigations into physiology and psychology. He replaced vague assertions about character with a more limited investigation into expression and interior mental and emotional states.

Duchenne worked in Paris at the Salpêtrière asylum, which housed up to 5,000 people with a wide range of mental illnesses and neurological conditions. Some would become his subjects for distressing experiments, part of the long tradition of medical and technological experimentation on the most vulnerable, those who cannot refuse. Duchenne, who was little known in the scientific community, decided to develop techniques of electrical shocks to stimulate isolated muscle movements in people’s faces. His aim was to build a more complete anatomical and physiological understanding of the face. Duchenne used these methods to bridge the new psychological science and the much older study of physiognomic signs, or passions. He relied on the latest photographic advancements, such as collodion processing, which allowed for much shorter exposure times, enabling Duchenne to freeze fleeting muscular movements and facial expressions in images.

Even at these early stages, the faces were never natural or socially occurring human expressions but simulations produced by the brute application of electricity to the muscles. Regardless, Duchenne believed that the use of photography and other technical systems would transform the squishy business of representation into something objective and evidentiary, more suitable for scientific study. Darwin praised Duchenne’s “magnificent photographs” and included reproductions in his own work.

Plates from *Mécanisme de la physionomie humaine*. (U.S. National Library of Medicine)

Ekman would follow Duchenne in placing photography at the center of his experimental practice. He believed that slow-motion photography was essential to his approach, because many facial expressions operate at the limits of human perception. The aim was to find so-called microexpressions—tiny muscle movements in the face.

One of Ekman’s ambitious plans in his early research was to codify a system for detecting and analyzing facial expressions. In 1971, he co-published a description of what he called the Facial Affect Scoring Technique (FAST).

Relying on posed photographs, the approach used six basic emotional types largely derived from Ekman’s intuitions. But FAST soon ran into problems when other scientists encountered facial expressions not included in its typology. So Ekman decided to ground his next measurement tool in facial musculature, harkening back to Duchenne’s original electroshock studies. Ekman identified roughly 40 distinct muscular contractions on the face and called the basic components of each facial expression an “action unit.” After some testing and validation, Ekman and Wallace Friesen published the Facial Action Coding System (FACS) in 1978; updated editions continue to be widely used.

Despite its financial success, FACS was very labor-intensive to use. Ekman wrote that it took 75 to 100 hours to train users in the FACS methodology, and an hour to score a single minute of facial footage. This challenge presented exactly the type of opportunity that the emerging field of computer vision was hungry to take on.

As work into the use of computers in affect recognition began to take shape, researchers recognized the need for a collection of standardized images to experiment with. A 1992 National Science Foundation report co-written by Ekman recommended that “a readily accessible, multimedia database shared by the diverse facial research community would be an important resource for the resolution and extension of issues concerning facial understanding.” Within a year, the Department of Defense began funding a program to collect facial photographs. By the end of the decade, machine-learning researchers had started to assemble, label, and make public the data sets that drive much of today’s machine-learning research. Academic labs and companies worked on parallel projects, creating scores of photo databases. For example, researchers in a lab in Sweden created Karolinska Directed Emotional Faces. This database comprises images of individuals portraying posed emotional expressions corresponding to Ekman’s categories. They’ve made their faces into the shapes that accord with six basic emotional states: joy, anger, disgust, sadness, surprise, and fear. When looking at these training sets, it is difficult to not be struck by a sense of pantomime: Incredible surprise! Abundant joy! Paralyzing fear! These subjects are literally making machine-readable emotion.

Facial expressions from the Cohn-Kanade data set: joy, anger, disgust, sadness, surprise, and fear. (Courtesy of Jeffrey Cohn)

As the field grew in scale and complexity, so did the types of photographs used in affect recognition. Researchers began using the FACS system to label data generated not from posed expressions but rather from spontaneous facial expressions, sometimes gathered outside of laboratory conditions. Ekman’s work had a profound and wide-ranging influence. The New York Times described Ekman as “the world’s most famous face reader,” and Time named him one of the 100 most influential people in the world. He would eventually consult with clients as disparate as the Dalai Lama, the FBI, the CIA, the Secret Service, and the animation studio Pixar, which wanted to create more lifelike renderings of cartoon faces. His ideas became part of popular culture, included in best sellers such as Malcolm Gladwell’s Blink and a television drama, Lie to Me, on which Ekman was a consultant for the lead character’s role, apparently loosely based on him.

His business prospered: Ekman sold techniques of deception detection to agencies such as the Transportation Security Administration, which used them to develop the Screening of Passengers by Observation Techniques (SPOT) program. SPOT has been used to monitor air travelers’ facial expressions since the September 11 attacks, in an attempt to “automatically” detect terrorists. The system uses a set of 94 criteria, all of which are allegedly signs of stress, fear, or deception. But looking for these responses means that some groups are immediately disadvantaged. Anyone who is stressed, is uncomfortable under questioning, or has had negative experiences with police and border guards can score higher. This creates its own forms of racial profiling. The SPOT program has been criticized by the Government Accountability Office and civil-liberties groups for its racial bias and lack of scientific methodology. Despite its $900 million price tag, there is no evidence that it has produced clear successes.

As Ekman’s fame spread, so did the skepticism of his work, with critiques emerging from a number of fields. An early critic was the cultural anthropologist Margaret Mead, who debated Ekman on the question of the universality of emotions in the late 1960s. Mead was unconvinced by Ekman’s belief in universal, biological determinants of behavior that exist separately from highly conditioned cultural factors.

Scientists from different fields joined the chorus over the decades. In more recent years, the psychologists James Russell and José-Miguel Fernández-Dols have shown that the most basic aspects of the science remain uncertain. Perhaps the foremost critic of Ekman’s theory is the historian of science Ruth Leys, who sees a fundamental circularity in Ekman’s method. The posed or simulated photographs he used were assumed to express a set of basic affective states that were, Leys wrote, “already free of cultural influence.” These photographs were then used to elicit labels from different populations to demonstrate the universality of facial expressions. The psychologist and neuroscientist Lisa Feldman Barrett puts itbluntly: “Companies can say whatever they want, but the data are clear. They can detect a scowl, but that’s not the same thing as detecting anger.”

More troubling still is that in the field of the study of emotions, researchers have not reached consensus about what an emotion actually is. What emotions are, how they are formulated within us and expressed, what their physiological or neurobiological functions could be, their relation to stimuli—all of this remains stubbornly unsettled. Why, with so many critiques, has the approach of “reading emotions” from a person’s face endured? Since the 1960s, driven by significant Department of Defense funding, multiple systems have been developed that are more and more accurate at measuring facial movements. Ekman’s theory seemed ideal for computer vision because it could be automated at scale. The theory fit what the tools could do.

Powerful institutional and corporate investments have been made based on perceived validity of Ekman’s theories and methodologies. Recognizing that emotions are not easily classified, or that they’re not reliably detectable from facial expressions, could undermine an expanding industry. Many machine-learning papers cite Ekman as though these issues are resolved, before directly proceeding into engineering challenges. The more complex issues of context, conditioning, relationality, and culture are often ignored. Ekman himself has said he is concerned about how his ideas are being commercialized, but when he’s written to tech companies asking for evidence that their emotion-recognition programs work, he has received no reply.

Instead of trying to build more systems that group expressions into machine-readable categories, we should question the origins of those categories themselves, as well as their social and political consequences. For example, these systems are known to flag the speech affects of women, particularly Black women, differently from those of men. A study conducted at the University of Maryland has shown that some facial recognition software interprets Black faces as having more negative emotions than white faces, specifically registering them as angrier and more contemptuous, even when controlling for their degree of smiling.

This is the danger of automating emotion recognition. These tools can take us back to the phrenological past, when spurious claims were used to support existing systems of power. The decades of scientific controversy around inferring emotional states consistently from a person’s face underscores a central point: One-size-fits-all “detection” is not the right approach. Emotions are complicated, and they develop and change in relation to our cultures and histories—all the manifold contexts that live outside the AI frame.

But already, job applicants are judged unfairly because their facial expressions or vocal tones don’t match those of other employees. Students are flagged at school because their faces appear angry, and customers are questioned because their facial cues indicate they may be shoplifters. These are the people who will bear the costs of systems that are not just technically imperfect, but based on questionable methodologies. A narrow taxonomy of emotions—grown from Ekman’s initial experiments—is being coded into machine-learning systems as a proxy for the infinite complexity of emotional experience in the world.