AI Needs Your Data—and You Should Get Paid for It

A new approach to training artificial intelligence algorithms involves paying people to submit medical data, and storing it in a blockchain-protected system.

Robert Chang, a Stanford ophthalmologist, normally stays busy prescribing drops and performing eye surgery. But a few years ago, he decided to jump on a hot new trend in his field: artificial intelligence. Doctors like Chang often rely on eye imaging to track the development of conditions like glaucoma. With enough scans, he reasoned, he might find patterns that could help him better interpret test results.

Gregory Barber covers cryptocurrency, blockchain, and artificial intelligence for WIRED.

That is, if he could get his hands on enough data. Chang embarked on a journey that’s familiar to many medical researchers looking to dabble in machine learning. He started with his own patients, but that wasn’t nearly enough, since training AI algorithms can require thousands or even millions of data points. He filled out grants and appealed to collaborators at other universities. He went to donor registries, where people voluntarily bring their data for researchers to use. But pretty soon he hit a wall. The data he needed was tied up in complicated rules for sharing data. “I was basically begging for data,” Chang says.

Chang thinks he might soon have a workaround to the data problem: patients. He’s working with Dawn Song, a professor at the University of California-Berkeley, to create a secure way for patients to share their data with researchers. It relies on a cloud computing network from Oasis Labs, founded by Song, and is designed so that researchers never see the data, even when it’s used to train AI. To encourage patients to participate, they’ll get paid when their data is used.

That design has implications well beyond healthcare. In California, Governor Gavin Newsom recently proposed a so-called “data dividend” that would transfer wealth from the state’s tech firms to its residents, and US Senator Mark Warner (D-Virginia) has introduced a bill that would require firms to put a price tag on each user’s personal data. The approach rests on a growing belief that the tech industry’s power is rooted in its vast stores of user data. These initiatives would upset that system by declaring that your data is yours, and that companies should pay you to use it, whether it’s your genome or your Facebook ad clicks.

In practice, though, the idea of owning your data quickly starts looking a little … fuzzy. Unlike physical assets like your car or house, your data is shared willy-nilly around the web, merged with other sources and, increasingly, fed through a Russian doll of machine learning models. As the data transmutes form and changes hands, its value becomes anybody’s guess. Plus, the current way data is handled is bound to create conflicting incentives. The priorities I have for valuing my data (say, personal privacy) conflict directly with Facebook’s (fueling ad algorithms).

Song thinks that for data ownership to work, the whole system needs a rethink. Data needs to be controlled by users, but still usable to others. “We can help users to maintain control of their data and at the same time to enable data to be utilized in a privacy preserving way for machine learning models,” she says. Health research, Song says, is a good way to start testing those ideas, in part because people are already often paid to participate in clinical studies.

This month, Song and Chang are starting a trial of the system, which they call Kara, at Stanford. Kara uses a technique known as differential privacy, where the ingredients for training an AI system come together with limited visibility to all parties involved. Patients upload pictures of their medical data—say, an eye scan—and medical researchers like Chang submit the AI systems they need data to train. That’s all stored on Oasis’s blockchain-based platform, which encrypts and anonymizes the data. Because all the computations happen within that black box, the researchers never see the data they’re using. The technique also draws on Song’s prior research to help ensure that the software can’t be reverse-engineered after the fact to extract the data used to train it.

Chang thinks that privacy-conscious design could help deal with medicine’s data silos, which prevent data from being shared across institutions. Patients and their doctors might be more willing to upload their data knowing it won’t be visible to anyone else. It would also mean prevent researchers from selling your data to a pharmaceutical company.

Sounds nice in theory, but how do you incentivize people to actually snap pictures of their health records? When it comes to training machine learning systems, not all data is equal. That presents a challenge when it comes to paying people for it. To value the data, Song’s system uses an idea developed by Lloyd Shapley, the Nobel Prize-winning economist, in 1953. Imagine a dataset as a team of players who need to cooperate to arrive at a particular goal. What did each player contribute? It’s not just a matter of picking the MVP, explains James Zou, a professor of biomedical data science at Stanford who isn’t involved in the project. Other data points might act more like team players. Their contribution to overall success may be conditioned on who else is playing.

LEARN MORE

The WIRED Guide to Artificial Intelligence

In a medical study that uses machine learning, there are lots of reasons why your data might be worth more or less than mine, says Zou. Sometimes it’s the quality of the data—a poor quality eye scan might do a disease-detection algorithm more harm than good. Or perhaps your scan displays signs of a rare disease that’s relevant to a study. Other factors are more nebulous. If you want your algorithm to work well on a general population, for example, you’ll want an equally diverse mix of people in your research. So, the Shapley value for someone from a group often left out of clinical studies—say, women of color—might be relatively high in some cases. White men, who are often overrepresented in datasets, could be valued less.

Put it that way and things start to sound a little ethically hairy. It’s not uncommon for people to be paid differently in clinical research, says Govind Persad, a bioethicist at the University of Denver, especially if a study depends on bringing in hard-to-recruit subjects. But he cautions that the incentives need to be designed carefully. Patients will need to have a sense of what they’ll be paid so they don’t get low-balled, and receive solid justifications, grounded in valid research aims, for how their data was valued.

What’s more challenging, Persad notes, is getting the data market to function as intended. That’s been a problem for all sorts of blockchain companies promising user-controlled marketplaces—everything from selling your DNA sequence to “decentralized” forms of eBay. Medical researchers will have concerns about the quality of data and whether the right kinds are available. They’ll also have to navigate restrictions a user might put on how their data can be used. On the other side, patients will need to trust that Oasis’s technology and promised privacy guarantees work as advertised.

The clinical study, Song says, aims to start resolving some of those questions, with Chang’s patients testing the application first. As the marketplace expands, researchers might make calls for specific kinds of data, and Song envisions partnering with doctors or hospitals so that patients aren’t totally alone in figuring out what types of data to upload. Her team is also looking into ways of estimating the value of particular data before the AI systems are trained, so that users know roughly how much they’ll make by giving researchers access.

Wider adoption of the data ownership idea is a ways off, Song admits. Currently, companies mostly get to choose how they store user data, and their business models mostly depend on holding it directly. Companies including Apple have embraced differential privacy as a way to gather data to privately gather data from your iPhone and enable features like Smart Replies without revealing individual personal data. But Facebook’s core ad business, of course, doesn’t work like that. Before any smart math tricks for valuing data are useful, regulators need to sort out rules for how data is stored and shared, says Zou. “There is a gap between the policy community and the technical community on what exactly it means to value data,” he says. “We’re trying to inject more rigor into these policy decisions.”