Picture a room with grid lines on the walls, and look at the corner. If you’ve ever tried to learn some A.I., you’ll have heard that deep neural networks are “stacks of layers built from linear transformations followed by rectified nonlinearities.” All that jargon hides a simple idea. Our room is just the “graph” of one layer. Each layer takes a grid of squares, and bends it to look like the room. As intimidating as the mathematics of modern A.I. can seem from the outside, behind the scenes we’re mostly just taking the space where your data lives and then doing the room thing over and over

Imagine that by some miracle of the fundamental laws of food, someone discovered that we can make anything we want just by building sandwiches with two ingredients: bread and ham. The phenomenon does not show up if we just stack bread alone. A big stack of bread is just, well, bread. But suppose we find that with the right cooking procedure, a sufficiently large sequence of bread + ham + bread + ham + etc. can become chicken soup, or lasagna or caviar — or foods that no human chef ever thought to create.

If this example seems implausible, good! Most AI researchers did not see it coming, either. But starting around 2012 (or 2009 (or the 1980s (or the 1960s (depending on when you start counting)))), a certain field focused on the mathematics of “how to learn stuff from data” had its own “impossible sandwich” moment. And since then, the “grocery stores” of that field, with all their varied ingredients for different specific purposes, have started to empty and change form under the steady march of the “universal ham.” In a very real sense, that is what all the recent hype surrounding deep learning is all about.

Artificial neural networks are our sandwiches. Deep neural networks are just sandwiches with lots of layers. What counts as “deep”? That doesn’t really matter. To understand the reason for all the recent hype around deep learning, what we really need to understand are (1) the “ingredients,” and (2) the “cooking” procedure.

The “bread” layers are what mathematicians call linear transformations. Imagine space as a stretchy sheet of graph paper with grid lines on it. Any way of messing with space, where things that are lines before keep being lines after, is an example of the bread. So, rotations are bread, because any line keeps being a line when we rotate space. Stretching space by making everything twice as big (or half as big, etc.) is another kind of bread. We could also grab two sides of the space and move our left hand down and our right hand up, so that vertical lines stay vertical, and horizontal lines get tilted and the little grid squares turn into parallelograms. That is called shearing, and it is a kind of bread, too. That is pretty much all the bread there is. And if we stack a bunch of things like that one after another, then since each piece of bread keeps lines as lines, the whole stack will keep lines as lines, too.

So, a stack of bread is not a sandwich. We do not get anywhere just by stacking bread by itself [1–3]. So, what is the “ham”?

It turns out the ham can be almost anything. The important part is that it is some step that bends the grid lines and makes them not be lines anymore. So, a randomly chosen sandwich will rotate/scale/shear space, and then bend some grid lines, and then rotate/scale/shear again, and then bend some more grid lines. If we make the sandwich “deep” enough, we can imagine the original space might get bent around and folded over itself quite a bit [4–5].

What is not yet clear is how on earth we are supposed to accomplish anything with all this.

Deep Neural Nets in the Real World

The issue of real-world usage all comes down to the cooking procedure.

In practice, we do not actually know how to write good neural networks. We know how to train them. To train a neural network means to show it examples of the problem we want it to solve, and then tell it how to change itself a little bit to be more like a machine that would have given the right answer. We do that over and over until the network “learns” to solve the problem.

What this gives us is a new way of interacting with computers — a method that has often been referred to as “software 2.0” [6]. With software 2.0, it is as if you have no idea how to cook lasagna, but you have some old lasagnas sitting around, and you have a magic oven — one that lets you just put a big (i.e “deep”) ham sandwich next to a bunch of lasagnas, then wait awhile, and voila!…it is lasagna. Did you just “cook” lasagna? Sort of — it is not clear. What is clear is that you now have one more lasagna, without you having to know how to cook it.

What this means for computing at large is that we can now solve problems that we do not know how to solve. We can write programs we do not know how to write. To put it mildly, this changes everything.

The real world use cases (and limitations) of this method are more or less what you would expect from the cooking analogy. If we did not have those example lasagnas to show to our universal ham, we could not have made a new one. What this means in practice is that the most directly applicable use cases of deep learning are those for which we have lots and lots of examples. That is, big datasets, cleaned and separated along the boundaries of whatever problem we want to solve. If we want to train a deep neural network to detect cats in pictures, we do not need to know how to write the code to implement a vision system that detects cats, line by line, rule by rule. What we do need are lots and lots of pictures of cats. Plus a big uncooked network, and some time.

Market Adoption of Deep Neural Nets

Deep learning has already been adopted by the market in every industry you can name, and its adoption shows no signs of slowing. There are deep networks in your phone for transcribing speech into text, listening for you to say “Okay Google” or “Hey Siri”. There are networks for detecting human faces in the camera’s visual field and setting its autofocus based on their positions and location. They have learned how to read. They are behind systems like Google Translate, which once contained massive tables of phrases and their likely translations for each pair of languages. They have beaten not only the world champions at Chess and Go, but more impressively (and less often emphasized), they have beaten 30+ years of the best AI researchers’ attempts to devise hand-coded “intelligent” algorithms for playing those games by using domain-specific knowledge — the latest versions of AlphaZero achieve superhuman performance not from built-in knowledge of the individual games, but just by letting the algorithm play against itself over and over and over until it surpasses both the most skilled humans and the best algorithms we humans have conceived.

Even in scientific research, previously insurmountable barriers have started to fall. One of the most fundamental problems in biochemistry is protein folding. This question is fundamental for drug discovery, since a protein’s function in the body is determined by its three-dimensional shape. In December of 2018, Google’s DeepMind released a model called AlphaFold [7–8]. They trained a neural network to predict a protein’s structure from the set of pairwise distances between its residues, and in doing so managed to far outstrip existing methods, leading one scientist to comment that, at first hearing the results, he felt like he and the other academics in his field had been made obsolete [9–10].

While a long time remains until we have learned to surpass human intelligence at all activities, one thing is certain: deep learning is here to stay. And no area of human life is likely to escape unchanged.