What Siri Could Learn From Thanos

Relying on one voice assistant to rule them all might be expecting too much

To borrow from J.R.R. Tolkien, voice assistants like Siri, Alexa, Cortana, and Google all aspire to be a little bit like Sauron from Lord of the Rings. If you distill all of their differing methodologies down, they ultimately come down to this: One ring (or, if you will, one A.I.) to rule them all, and in the darkness bind them.

That is to say, every voice assistant out there is competing to be your only voice assistant through which you route all of your requests. So if you want to use your Amazon Echo to turn off your Philips Hue smart lights, you say something like, “Alexa, turn off my living room lights” and behind the scenes, Alexa translates the request and routes the command to Philips in a format it can understand.

As of iOS 12, Siri also works this way. Through the power of Siri Shortcuts, you can set up a macro so that, for example, when you say, “Hey Siri, play How Did This Get Made,” the most recent episode of the How Did This Get Made podcast automatically begins playing in your podcast app of choice. And even without setting up a shortcut, you can use Siri almost like a voice-only command line, to say things like, “Hey Siri, using Drafts, create a draft in Inbox using clipboard,” and it’ll save your clipboard contents into a new document in Drafts.

On the surface of things, this seems like a sensible way for us to interact with our devices by voice. If we want to control our phone using our voice, it only makes sense that we give it a human name like Siri, talk to it like a person, and ask it to do what we want, right?

I’d argue that the last three decades of user interface design have proven to us that this isn’t natural. And the reason voice assistants have been relatively slow to catch on to anything but the simplest tasks is that we’re not yet thinking about conversational user interfaces (UIs) in the right way.

To borrow yet another geeky analogy: Instead of thinking like Sauron, we need to start thinking like Thanos. Instead of using one ring to rule them all, we need to treat our voice assistants like Infinity Gems and wield as many at once as possible.hat is a voice assistant? Silicon Valley likes us to think of them as sexy A.I. that live in our phones, but if you strip all that away, they are modern-day iterations of the oldest and simplest user interfaces in computing. Married to speech transcription engines, they’re 21st-century command lines: text-only input fields like the ones used in Unix and disk operating system (DOS) that allow us to give instructions to our computers. When they first started popping up in the mid-1960s, command lines were one of the first ways users could give instructions to their computers without directly coding them. Command lines were revolutionary—instead of forcing us to talk to our computers in ones and zeros, we could suddenly talk to them using syntax that was similar to natural language.

What would you rather bet your life on? A coin toss, or Siri’s ability to play a specific album on Apple Music on the first command?

Similar, but not identical. Type “delete word.doc” into an old DOS prompt and it wouldn’t do anything; type “del word.doc” and it would know what you wanted. And the more complicated the action you wanted to accomplish, the more you needed to be fluent in the exact syntax the command line was looking for, or else it would keel over with an error. This should sound familiar to anyone who has ever asked Siri to, say, create an appointment in their calendar for the third Tuesday in November to visit their doctor at a specific address between 9:15 and 9:35… or really, any other command that isn’t transparently simple.

In other words, command lines were sort of a worst-of-all-worlds approach to how we talk to computers, the Esperanto of the ASCII computing age. The commands you typed in sort of looked like English, but they weren’t. Instead, you could only fluently navigate a command line through lots of memorization and trial and error.

That’s a good description of the user experience of our voice assistants, too. Few of us are actually “fluent” at getting the likes of Siri or Alexa to do anything but the simplest tasks. Instead, we blunder through our interactions with these invisible A.I. until we finally figure out what syntax it expects from us, or we give up. (And if you don’t agree it’s that bad, ask yourself: What would you rather bet your life on? A coin toss, or Siri’s ability to play a specific album on Apple Music on the first command?)

This is, in essence, the uncanny valley of conversational UIs. They sound like humans, they talk like humans, and they have human names, but we can’t understand each other like humans. This is why command lines eventually fell out of favor and were replaced by graphical user interface-based operating systems like Windows and macOS. On these operating systems, discrete apps with their own custom-designed interfaces could all run side by side, each handling a different task: word processing, spreadsheets, movie watching, and so on. Having realized that our computers were bad at understanding us normally, it was almost like we designed graphical user interfaces (GUIs) to be a communication board. By just pointing or clicking at a mutually understood symbol, we and our computers could be in agreement about what we were trying to accomplish.

Perhaps it’s time for the designers of conversational UIs to learn something from the decline of the command line before we abandon our voice assistants (or relegate them to relative obscurity) as an idea before its time. But to get there, Silicon Valley companies are going to have to get over their obsession with creating the one voice assistant to rule them all, and instead, embrace a more diverse, app-like approach.

The problem with expecting the likes of Alexa or Cortana to understand us as a human is that even humans aren’t great at following each other’s instructions. This is why we don’t ask our mailman to give us a mortgage and fix our toilets and lay down mice traps and diagnose our infirmities. We call a banker, a plumber, an exterminator, and a doctor for those things because we understand that it is unreasonable to expect any one person to be an expert in every subject matter.

So why do we expect more from our voice assistants? Siri and Alexa shouldn’t be jack-of-all-trade A.I. They should be invisible operators, routing our calls to voice assistants specifically designed to engage with us on the tasks we’re trying to accomplish. So when we need to send a rent payment, we don’t expect Siri to know how to do it — we say, “Hey Venmo, send $2,000 to my landlord.” Doesn’t it make more sense that when we want to find a phone number in our email, we talk to Gmail directly, not try to route the command through Alexa?

It feels like a lot of the frustrating false returns of voice assistants would be solved if we stopped expecting one voice assistant to understand 100% of our requests. Because in real life, just the context of who you are talking to makes it easier for you both to understand each other and adjust your ability to communicate. Few people would talk to their accountant the same way they talk to their five-year-old, which is exactly why we’re able to make ourselves understood to both. Why should voice assistants be different?

They shouldn’t, which is why Apple and Amazon and Microsoft and Google should stop trying to lock people into their bots. There will always be a place for a jack-of-all-trades voice assistant like Siri or Cortana, but there’s far more opportunity in opening up their conversational platforms to smaller chatbots who can dedicate themselves to specific tasks, just like the Mac and the first IBM PCs opened themselves up from the command line to the world of apps. As Apple knows firsthand with its iOS App Store, there is plenty of profit and glory to be made as a gatekeeper of a computing revolution, which is what voice assistants could still be… if only Silicon Valley and its end users stopped holding our A.I. to a higher communication standard than we do our fellow humans.