A.I. Doctors Have a Trust Problem

Ethicists argue that A.I.-based medical services need to be evaluated and regulated in the same way as new drugs

Imagine you’re a 59-year old man, and you go to your doctor with chest pains. The doctor thinks it might be a heart attack and orders further tests. Now, imagine you’re a 59-year old woman with the same symptoms. The doctor tells you that you’re probably having a panic attack.

These strikingly different suggestions, however, didn’t come from a doctor, but from a popular health care app called GP at Hand, which uses artificial intelligence to tell you what might be wrong with you based on your symptoms. Babylon Health, which makes the app, is careful not to use the word “diagnose,” describing the app instead as a triage tool. The company batted away concerns about sexism by arguing that it bases its suggestions on “epidemiological data from a huge number of research studies.” Because women are much less likely to suffer heart attacks than men but twice as likely to suffer from anxiety disorders, it argued, the app’s suggestion was correct.

Nonetheless, the story raises difficult questions about how A.I. should be used in health care. The promise of A.I. is that, by analyzing large quantities of data — from patient health care records, laboratory results, scan images, research studies, and DNA databases — it will be possible to create algorithms that make diagnoses and treatment recommendations that are as good, or even better, than those of a human doctor.

“The hope is that if we bring in artificial intelligence, we could make rules that are more objective, and we don’t have to rely on a clinician who hasn’t slept in 24 hours and is just running off caffeine.”

In one example, Moorfields Eye Hospital in London has worked with Google-owned company DeepMind Health to develop an A.I. algorithm trained on thousands of anonymized eye scans. Its 94% accuracy in making correct referral decisions matches that of world-leading eye experts. As Irene Chen, a researcher in computer science at MIT, puts it: “The hope is that if we bring in artificial intelligence, maybe we could make rules that are more objective, and we don’t have to rely on a clinician who hasn’t slept in 24 hours and is just running off caffeine.”

But beneath that good-news story are numerous uncertainties. The first has to do with the data itself. Can we be sure that the raw data on which A.I. tools are trained is completely reliable? It could be that the Babylon app’s differing suggestions for men and women reproduce and reinforce an existing error made by human clinicians. Danton Char, an assistant professor in anesthesiology, perioperative, and pain medicine at Stanford University, who has written about the ethical issues relating to A.I. in health care, points out that the incidence of heart attacks in women has historically been underdiagnosed. Because disease symptoms can vary between men and women, between young and old, and between different ethnic groups, we have to be cautious about training A.I. apps on data sets that are skewed toward a particular demographic; of all the people who have so far had their genomes sequenced, for example, 96% are of white European ethnicity. “That baked-in disparity is probably then reflected in any algorithm that’s trained on that underlying data,” Char says.

Researchers are trying to address the problem. Chen worked on a study that drew on a widely-used patient data set known as MIMIC-III to create predictive algorithms and found that the predictions were more accurate for certain demographic groups than others. By identifying these biases, she hopes, it will be possible to design an A.I. algorithm that eliminates them. Chen believes that when used properly, A.I. can tackle previously neglected problems affecting specific demographic groups, such as the greater maternal mortality rates among black women.

Correcting biases isn’t always straightforward, however. Char argues that data may contain biases that researchers are unaware of. “You have to assume that if you’re going to correct a bias that’s in a massive data set you comprehensively understand the bias and everywhere it manifests,” says Char. “Otherwise, you just create data skewed in a more surreptitious way.”

Any doctor or nurse using an A.I. app to support decision-making, therefore, could be nudged toward making a bad decision. If they disagree with the app’s reasoning — or maybe don’t even understand it — they then have to decide whether to rely on the app or trust their own judgment.

Chen is clear that a recommendation from an A.I. system should be treated as “one piece of information” and adds that “if the clinician doesn’t believe in the recommendation, they should override it.” Yet Ibrahim Habli, a senior lecturer at the University of York in the U.K. that specializes in the safety of digital health systems, warns of the risk of “automation bias,” whereby if a technology works most of the time, professionals learn to rely on it “despite the lack of explanation or understanding of how it works.” Doctors might also feel pressured to explain why they are rejecting the app’s recommendation. “You have these tired, shift-based working clinicians,” Char says. “It’s going to be very hard to muster the effort to contradict these things.”

Susan Leigh Anderson, professor emerita in philosophy at the University of Connecticut, argues that an A.I. tool should always “be able to state the rationale used to justify its behavior.” Without this, she points out, “the clinician would not be able to determine whether his or her view is better or worse than the A.I. system’s.” This would also equip the clinician with the information to explain their decision to the patient.

“Consumers think that because those products are on the market, then surely someone has approved them. Sadly, that’s not the case.”

All this leads to the thorny question of who would be responsible if advice from an A.I. tool leads to catastrophic outcomes? Traditionally, if a clinician uses technology to support decisionmaking, the answer has been simple — it’s the clinician, says Habli, who is “ultimately accountable.” But it’s a much harder call when the app and the clinician are making the decision jointly. Anderson believes that ethically it is the people “who developed the A.I. system and those who have sanctioned its use who should be held responsible.” In legal terms, however, the risk is that making technology companies liable for medical errors could lead to financially disabling medical malpractice suits, deterring them from future innovation.

It poses a headache for regulators, who are only now beginning to catch up with the new landscape of A.I. health care apps. While the Food and Drug Administration (FDA) in the U.S. and the Medicines and Healthcare Products Regulatory Agency in the U.K. have processes that work well for evaluating medical devices, they are not necessarily up to the task of evaluating complex A.I. tools, whether aimed at consumers or at clinicians. Habli is concerned that some technology companies are taking advantage of the regulatory lag to market A.I. apps to the consumer that under deliver. “Consumers think that because those products are on the market, then surely someone has approved them,” he says. “Sadly, that’s not the case.”

Recognizing the difficulties, the FDA is reviewing its approval process for A.I. apps, but questions remain. Should an A.I. tool that works well in one environment be approved for use in another, for example? How rigorously should they be evaluated?

Neelan Das, consultant cardiac and interventional radiologist and lead for A.I. at East Kent Hospitals University NHS Foundation Trust in the U.K., argues that a health care application using A.I. should be subject to the same kind of peer review scrutiny as a new drug “because it’s making a diagnosis and potentially changing your treatment.”

The risk is that enthusiasm for the benefits of A.I. may lead us to adopt it without proper consideration of the problems. Algorithmic systems have already “skewed the American political context”, says Char, arguing we should put safeguards for A.I. in health care in place sooner rather than later: “The consequences of polarizing the civic debate are profound, but it’s quite another thing if it begins to have ramifications in life and death decisions.”