Gradient-Based Methods and Back-Propagation
Machine learning starts to sound suspiciously like biology when we train neural networks or learn salient data properties. But how exactly does this type of learning work?
Many problems in the real world don’t have an exact answer, since calculating it would take too long. That’s why many times an approximate solution works just fine. Take mail sorting. We accept the fact that an automatic sorting system reading ZIP codes has an error rate of 1 in 10,000 letters.
Many machine learning problems can be explained as optimization problems. There’s input data – such as images of the address field of a letter – and desired output data – in this case the hopefully correct postal code, plus a function that transfers one into the other. This function can be a simple linear model or a more complex one like a neural network. In the final analysis, though, the modeling method is irrelevant. We can think of it as a black box that ingests input on the left and outputs the results on the right. The crucial fact is that the method can be configured with parameters. Imagine a mixing deck with lots of knobs. What we’re looking for are the settings for just the right knobs to achieve a result with the fewest mistakes possible. The question is: how do we find these optimal settings?
Looking for parameters
One simple strategy would be to try out all possible parameter combinations and writing down how many wrong ZIP codes our black box assigns with each respective set of parameters.The best setting of our array of control knobs would be the one with the fewest errors.
But there’s a catch: the more knobs a method has, the higher the number of configurations we’d have to test. Take an image with 1,000 x 1,000 pixels. A simple linear classification method or perceptron would have about one million parameters which each could have random values. If we allowed for just ten values for each parameter, we’re already looking at 101000000 combinations. Quick reality check: experts estimate the total number of atoms in the universe to be 1085. Even if evaluating a function were lightning fast and we were able to drastically reduce the number of parameters, analyzing every position in this parameter space would take too long. The good news is that there’s a much smarter way than the ”brute force” approach.
The gradient-based method is a very popular way to optimize a function.
First, let’s look at the mathematical basis for this approach. Function graphs help us visualize the calculations. For starters, imagine a very simple black box:
Generally speaking, the black box could have a very complex function, but for this example, we’re sticking to a simple one. Data and parameter (x and w) are fed into the black box and processed via multiplication. Let’s further assume that we want to minimize the output dependent on parameter w.
Using the descending gradient-based method, we feel our way toward the minimum function step by step. With a variant of the algorithm we can ascend in the opposite direction to find the maximum. The name implies which factor we use for each step: the gradient. In this simple example the gradient is the derivative of do/dw which tells us how output o changes when we change, or turn the knob for, w. Since do/dw = x, we can calculate for a given x (e.g. x =1) that the change in output z always is 1, no matter what parameter we choose. This means in turn that the output of our black box increases by 1 if you increase y by 1 – a calculation you can easily replicate on a piece of paper.
This seemingly trivial example has a lot of potential. We can use the gradient do/dw for parameter setting w to calculate in which direction we have to move to minimize the output. In this example, the gradient do/dw was positive, meaning we have to turn the knob for w into the other – negative – direction to minimize the output. We are, in other words, moving down the gradient.
Artificial neurons that make up a neural network are similar to the design of our box:
Input x is now multiplied with parameter w (the so-called edge weight) and then fed into the neuron where the activation function f generates the output. In order to calculate the derivative of dz/dw we have to apply the well-known chain rule (“inner x outer derivative”), i.e. do/dw = f’*x. Taking the popular sigmoid activation, this would be f’(x*w) = s(x*w)(1-s(x*w)), with s being the sigmoid.
Neural networks, though, are usually made up of more than one neuron. What does that mean for the derivative? Let’s look at a simple example:
Now our entire black box consists of two daisy-chained activation functions and total output o2 depends on two parameters w1 and w2. The derivative of output o2 also has to be calculated for those two dimensions, meaning the gradient becomes a vector value with G = (do2/dw1, do2/dw2).
First, let’s look at deriving the output for w2. Regardless how many neurons are chained together, the derivative is a product of the inner and outer derivative of g: do2/dw2 = g’*o1. The derivative of the black box for w1 is a little bit more complex, but easily solved with the chain rule. We’re looking for the derivative of g(f(x*w1)*w2). Let’s move from the outside in:
do2/dw1 = g’*q’, where q’ summarizes the derivative of the entire expression f(x*w1)*w2. This derivative has to be calculated via the chain rule again: q’ = w2*f’*x. Now we’ve found the desired derivative for w1 of the net function o2/dw1 = g’*w2*f’*x.
More complex functions require more analytical work to find the gradient, so a scrap of paper won’t be enough to write down the expression. Luckily, there’s a simple method to visualize the chain rule and building the derivative by simply traversing the function graph. The algorithm is called “back-propagation” and works like this: You calculate the net output for datum x in a “forward pass” and also temporarily store the derivatives of the activation functions. Next you traverse the graph rom back to front in a “backward pass”.
You enter a "1" on the right side of the graph and then collect every edge weight and every stored neuronal derivative by way of multiplication. Once you reach a weight whose impact on the net output you want to calculate, you multiply the gathered product with the input in front of the respective eidge weight (x for w1 or o1 for w2). The result yields the partial derivative, which is finally subtracted from the respective edge weight. This method works even when there are several entry points into a neuron. The collective derivative up to that particular neuron only has to be multiplied with another xi.
This simple approach drastically reduces the time spent on calculating optimal parameters. The example mentioned above called for finding the values for parameters w1 and w2, which minimize the net function. Picture yourself standing on a mountain, trying to find the way down into the valley. You could test a series of adjacent parameter combinations step by step and run the network with the new parameters. Thanks to back-propagation, though, you only have to analyze the net function once and you know which way has the steepest upward (or downward) slope. Changing the weights brings you one step closer toward the locally optimal direction. It’s impossible to know for sure when you’ve arrived at the global optimum, since there are plateaus (where you can’t know which way leads down) or local minima. In practice, this method is capable of processing large amounts of training data and optimizing giant models.
Modern neural networks have several million parameters. Google recently presented an architecture with 137 billion parameters (more on openreview.net). Thanks to to the back-propagation algorithm, these highly complex models can be optimized with a gradient-based method. Modern software libraries such as TensorFlow or Torch allow users to describe such architectures at a high level of abstraction. The libraries then handle building and optimizing the function graphs and defining the gradient automatically.
Since the gradient-based method is so versatile, more and more deep learning systems can be trained “end to end.” Take modern systems for image classification. Normally, they use convolutional neural networks while language processing systems increasingly use recurrent neural networks. Both systems can be trained with the back-propagation algorithm and therefore can be trained with the same algorithm. That makes it possible to build highly complex image captioning systems that receive an image as input and return the correct written description of the image.
It’s worth noting that there are several concepts to improve the gradient-based method we didn’t have the time and space to go into. The optimization process can often be sped up with pulse techniques or methods that additionally use the second derivative.
You can also use regularization techniques such as weight decay to prevent models from simply memorizing training data. And finally, there are many other optimization methods such as reinforcement learning, Bayesian black box optimization oder evolutionary algorithms which work well even when you can’t define the gradient of the model.