Data Labeling is China’s Secret Weapon in the Connected Car Battle

It’s a literal arms race.

“All you’re seeing now — all these feats of AI like self-driving cars, interpreting medical images, beating the world champion at Go and so on — these are very narrow intelligences, and they’re really trained for a particular purpose. They’re situations where we can collect a lot of data.”

That’s according to Facebook’s head of AI research, Yann LeCun.

These words highlight the fact that behind the recent efflorescence of shiny AI products lies a much more banal, human reality.

The world’s tech giants often rely on large groups of people to label the data that will be used to train their machine learning algorithms.

‘Labeling data’ just means taking a set of unlabeled data (a phone transcript or a street image, for example) and adding informative, descriptive tags on individual elements like a word or a car.

To help train a natural language processing system, data labelers might add tags to show what a certain word means in different contexts, for instance.

Although such estimates will always be generic and reductive, data science types have helpfully broken down the allocation of time in machine learning projects as follows:

If a machine learning algorithm is fed with a high volume of accurately-labeled training data, it can be used in the “real world” for tasks including computer vision in driverless cars.

Data labeling takes a lot of time and it’s an important task, especially when the data will be used to train an autonomous car. The acceptable margin of error is pretty minuscule, I’m sure we’ll all agree.

For all its advancements, AI is still very much as artificial as its moniker suggests.

Machine learning algorithms do not learn in the same ways as people. Some scientists are trying to teach AI to learn like a child (this Science Mag article is a good primer), but these efforts are still in their, well, infant stages.

We receive occasional, stark reminders of this fact through stunts like the minor street sign modifications that completely flummoxed computer vision systems, as below:

It’s clear what we see when we look at this vandalized STOP sign, but the neural network classifier thought it was a ‘Speed Limit 100’ sign in almost 100% of tests.

So, as it stands, machine learning needs good data and the most reliable way to procure this is to pay people to sit and annotate images all day.

The more people you have, the more data you can label.

It’s an arms race, but not as we know it.

In China, they speak of the Qiandian Houchang economic model — literally, “front shop, back factory”. Often, this is used to help with the division of labor, capital, and resources within a supply chain.

China has devalued its currency in the past (notably, in the 1980s and 1990s) to make its exports cheaper to foreign countries and spur investment in factories producing mass consumer goods. This also made imports more expensive for Chinese companies, incentivizing them to purchase equipment locally.

Basically, China is the factory in the back and the West is the shop front in this scenario.

Now, China’s ambitions have grown since (and also because of) this period, to the point where Chinese companies want to ‘jump up’ the value chain and own the customer relationship as well as product creation.

Why mention this? Well, the Qiandian Houchang model still holds sway. The difference in modern China is that the robots are at the front and people are in the back.

People are putting in the hard labor so that Chinese cars, digital assistants, and in-store robots can flourish. In an ideal world, China would then export a superior product to the rest of the world.

It is a little reminiscent of the ‘mechanical Turk’ in the 18th century, that chess-playing automaton that wowed the punters and turned out to be a puppet controlled by a tiny man in a hidden compartment below.

I mention this particular example for a reason, believe it or not. Amazon made the quite telling decision to name its crowd-sourced work platform ‘Amazon Mechanical Turk’ in a droll reference to the charming chancer of yore.

In the 17th century, ‘computers’ were people that could perform arithmetical calculations. In the mid-20th century, computers were still people (majoritively women) who handled the number-crunching within companies. It was only later that computers became programmed and digital, and we are still training them today.

We are all part of this same dynamic every day. We use those CAPTCHA forms to ‘prove’ that we are human, and the data is used to make the machines smarter.

One would be surprised at just how manual a lot of AI training still is today, although we do get occasional glimpses behind the curtain.

Last year, Apple, Google, Amazon, and Facebook all had to apologize after they were caught exporting user data and sharing it with third parties.

These tech giants give information, like user conversations with digital assistants, to data annotation companies to improve the accuracy of their AI systems.

At the moment of publication, no-one has found an accurate, cost-effective way to replace the role of human labelers.

And so, back to China.

Rural areas of the country, such as Guizhou, are now home to cavernous data annotation factories.

For locals, it is a tempting profession; the average salary of 3,000 Yuan ($425) per month is three times the average salary in the area. Guizhou’s economic output grew 10.2% last year, making it the fastest-growing province in the country.

Of course, this ‘arms race’ between tech companies isn’t just about simply having more people to perform the labeling part of the process.

It is a good start, nonetheless.

As the owner of one data labeling company in Guizhou said, in an interview with the NY Times,

“We’re the construction workers in the digital world. Our job is to lay one brick after another. But we play an important role in A.I. Without us, they can’t build the skyscrapers.”

Well-known products like Taobao’s visual search (discussed in this very newsletter recently) are trained on data labeled in Alibaba’s warehouses in these rural areas.

For its part, Tencent is working on this giant bunker to store, process, and analyze user data from its ever-popular WeChat app:

China has often availed of a bigger workforce than other countries, of course.

It has also lagged behind the US in a number of key technological areas and is locked in an ongoing battle with Mr Trump’s administration.

The US outsources this manual labor, for a number of reasons. It’s expensive to set up these facilities, train a workforce, and then pay them the pesky minimum wage, for one thing. Much easier to send the work somewhere cheaper, especially if the finished product (lots of labeled data) looks the same either way.

China may be able to turn its erstwhile weaknesses into strengths. Rural areas like Guizhou remain under-developed; data labeling companies bring much-needed jobs and relatively healthy salaries. These salaries pale in comparison to those offered in major cities like Beijing, which provides a further benefit to the tech companies too.

China has skipped some generations of technological development altogether, giving it a head-start on the next big things. Contactless credit cards never really took off there, and they have moved on to smartphone payments. In the West, the incentive to move from contactless card to smartphone payment is much less appealing.

The same applies to autonomous cars; Chinese companies have shifted their focus to building driverless machines after failing to make a serious dent in the global, human-piloted car market.

This latest stage in China’s development only really poses a threat to the American tech giants if Chinese scientists learn to develop more sophisticated microprocessors in the process.

As The Economist reported this week, China is still playing catch-up in a crucial industry that will be worth $575bn by 2022.

While data labeling may seem like a lugubrious, monotonous task with only one useful purpose, it has a role to play at this macro level too.

By taking ownership of the machine learning supply chain from beginning to end, Chinese AI scientists remain close to the inner workings of these intricate, at times opaque, systems.

The sheer weight of numbers in the Chinese workforce will play a vital role in developing the precious commodity of Intellectual Property.