Gradient Descent

Yordan Arango

July 10, 2024

The advent of the Artifitial Intelligence (AI) and, specifically, Deep Learning (DL) is not a fact of our post-pandemic times. The perceptron is a concept that dates back to the 1950s with Frank Rosenblatt's pioneering work. So, why is it just now, almost 70 years after Rosenblatt's work, that DL is getting such a powerful wingspan? The answer is brute force, or in other words, high computational power. Basically, DL relies on the trial-and-error methodology until the model achieves an acceptable performance. This technique implies billions or even trillions of computations that were not imagined by Rossenblatt or any computational engineer 20 years ago. In our times, that is not a problem at all.

Remember from previous posts in this blog (Introduction to NNs and Activation Functions in DL) that what a Neural Network (NN) model try to do is, given an input data and its parameters, namely its weights, $w's$, and biases, $b's$, to produce a forecat of a certain variable. For example, a model where given the height and mass of a person, it retrieves the optimal amount of water that this person should consume per day. The question is, how to get the right values for $w's$ and $b's$? i.e, how to train the model? Easy: what the NN algorithm does is, for a certain amount of input data, to test a huge amount of combinations of the parameters $w's$ and $b's$, then compare the outputs of the model to the expected result, and finally select the combination that gives the best performance, a.k.a the lowest error. Nothing difficult, to be honest; we just need a good computer: brute force at its finest. That's all folks!

I am lying. That's not all, folks! Unfortunately, there does not exist any computer capable to test all the possible combinations of a NN's parameter values. The reason is simple: the number of possible combinations is infinite. But even, suppose we limit the possible values each parameter can take to just, say, three values. In our days (by 2024) we are speaking about models with billions of parameters, whose values must be saved in files of tens of gigabytes (insane!). Just imagine, how many possible combinations of parameters would exist by allowing just three possible values for every parameter: the number would be simply unimaginable. What can we do at this point?

Fortunetaly, there exists a method that, although it requires a huge number of tests, substantially reduces the amount of possible combinations of parameters values. This method is called Gradient Descent (GD). A popular metaphor or analogy is used to better understand GD. Imagine a blind person climbing in a mountainous region and trying to reach the valley where a river can provide drinkable water (see ). Some entity is warnning him/her the inclination in every direction. Can this information be useful to get the river valley? Yes, it can. But not just that. Hopefully, if the right decisions are taken based on the provided information, the valley, a.k.a the lowest terrain possible, will be reached relatively quickly. Those right decisions will be always to take tiny steps in the direction of the highest negative inclination. In the next section we will review how this analogy compares with the GD technique, why the GD reduces the amount of possible combinations of parameter values and how it is expressed in math terms.

Photo by National Park Service — Yosemite Valley Day Hikes. Gradient descent is a metaphor of getting a mountain valley from its peak by taking steps towards the highest slope. Photo taken from www.nps.gov.

Climbing the error

Remember from the previous section that what we are seeking is to find the best combination of parameters, $w's$ and $b's$, such that the forecasts of the model are as close as possible to the real values. We call this process, to train the model. The key word here, in GD, is error. The error is simply a measure of how far the forecasts are from the real values. So, as you can suposse, all we want is to decrease that error as much as we can. That doesn't necessarily imply getting a 0 error, but achieving the lowest error enough for our purposes. Let's agree that the error is a function of the combination of the model's parameters, i.e, the error changes as the parameters change (while the data ramain the same): \begin{equation} E = f (w^l, b^l) \label{error_function} \end{equation} where,

$$ w^l= \left[\begin{array}{cccc} w^{l}_{11} & w^{l}_{12} & \cdots & w^{l}_{1k} & \cdots & w^{l}_{1K-1} & w^{l}_{1K}\\ w^{l}_{21} & w^{l}_{22} & \cdots & w^{l}_{2k} & \cdots & w^{l}_{2K-1} & w^{l}_{2K}\\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \vdots\\ w^{l}_{j1} & w^{l}_{j2} & \cdots & w^{l}_{jk} & \cdots & w^{l}_{jK-1} & w^{l}_{jK}\\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \vdots\\ w^{l}_{J-11} & w^{l}_{J-12} & \cdots & w^{l}_{J-1k} & \cdots & w^{l}_{J-1K-1} & w^{l}_{J-1K}\\ w^{l}_{J1} & w^{l}_{J2} & \cdots & w^{l}_{Jk} & \cdots & w^{l}_{JK-1} & w^{l}_{JK}\\ \end{array}\right] $$

and

$$ b^l = \left[\begin{array}{c} b^{l}_1 \\ b^{l}_2 \\ \vdots \\ b^{l}_j \\ \vdots \\ b^{l}_{J-1} \\ b^{l}_J\\ \end{array}\right] $$

with $l = \{1, \cdots, L\}$, where $L$ is the number of layers in the net, $J$ is the number of perceptrons in the $l^{th}$ layer, and $K$ is the number of perceptrons in the $(l-1)^{th}$ layer. Thus, we can express $f$ in equation \eqref{error_function} as a function of every element of $w^l$ and $b^l$: $$ E = f (w^1_{11}, w^1_{12}, \cdots, w^1_{JK}, \cdots, w^l_{JK}, \cdots, w^L_{JK}, \cdots, b^1_{1}, b^1_{2}, \cdots, b^1_{J}, \cdots, b^l_{J}, \cdots, b^L_{J}) $$

Note that $f$ can be a function of billions of parameters, so naturally is basically impossible to figure out a math equation for such a function. For practical and graphical purposes, let's reduce our problem to a model with just two parameters, $v_1$ and $v_2$. Thus, the error of the model would look like as:

\begin{equation} E = f (v_1, v_2) \label{error_function_2p} \end{equation} If you are familiar with simple linear regression, you will find also familar the equation \eqref{error_function_2p}. In simple linear regression we train a model with two parameters. Given its simplicity, it is possible to know the equation describing the function $f$ in the equation \eqref{error_function_2p}. This way, we can find the values for $v_1$ and $v_2$ by taking both the derivates of $E$ with respect to $v_1$ and $v_2$, then equating them to zero, and finally solving for each parameter, $v_1$ and $v_2$: $$ \frac{\partial E}{\partial v_1} = 0 $$ $$ \frac{\partial E}{\partial v_2} = 0 $$ This would be the ideal approximation to our problem in DL. Nevertheless, we don't have such a equation for $E$ since neither our problem is linear, as we explained in the previous post (Activation Functions in DL), nor it has two parameters, but millions. However, let's continue with our model with two parameters, whose error also depends on the parameters (equation \eqref{error_function_2p}). As you are gonna see, GD will reveal as a methodology to find the best combination of values for $v_1$ and $v_2$, which will be extensible to models with billions of parameters. For those that have taken a course on numerical methods, they will find this method pretty similar to the Newton-Raphson method suitted to find the roots of a complex equation (Newton's method). Just take into account that the GD doesn't seek to find the 0 error of the model, but the lowest error possible.

The Newton-Raphson method is a method applied to a 2D space in which a variable depends on other. Let's build some intuition in a 3D space by using the equation \eqref{error_function_2p} where the error depends on two other variables, $v_1$ and $v_2$. Then we will generalize to a high dimensional space. Let's suppose the function $f$ on $v_1$ and $v_2$ has the shape shown in .

Graphical example of the error ($E$) of a DL model in termns of its parameters, $v_1$ and $v_2$. In this example, the vertical axis depicts the error of the model. Photo by Michael Nielsen: taken from www.neuralnetworksanddeeplearning.com and edited by the author.

Remember: all we want to do is to find those values for $v_1$ and $v_2$ that are as close as possible to the lowest error. As we explained in the introductory paragraphs with the analogy of the blind person climbing, this is achieved by taking tiny steps towards the steepest negative slope. So, suppose we guess some combination of values for $v_1$ and $v_2$, as depicted in .

To be clear, the index $i$ inside the parentheses indicates the guessed initial value for $v_1$ and $v_2$, as well as the correspondent value for $E$. Thus, $E(i+1)$ means the error of the model evaluated at iteration $i+1$. Or in math natation:

$$ E(i+1) \equiv E(v_1(i+1), v_2(i+1)) $$

If we want to take a step in the direction of the maximum negative slope, we just have to know the gradient of $E$. Indeed, the gradient tells us the change rate of a function in the direction of the highest positive slope. And yes! I know you are wondernig how to compute that if we don't have an equation for (\E\). But, what if I tell you that we don't need to know a general equation for $E = f(v_1, v_2)$ to compute such a gradient. Just to recall, the gradient of a scalar function is a vector funtion/field. In the case of the funtion $f$ in equation \eqref{error_function}, we can define the gradient of $E$, $\nabla E$, as the vector function whose elements are the partial derivative of the function $f$ with respect to each independent variable:

$$ \nabla E = \left[\begin{array}{c} \frac{\partial E}{\partial w^1_{11}} \\ \vdots \\ \frac{\partial E}{\partial w^L_{JK}} \\ \vdots \\ \frac{\partial E}{\partial b^1_{1}} \\ \vdots \\ \frac{\partial E}{\partial b^L_{J}} \\ \end{array}\right] $$

In the case of $E = f(v_1, v_2)$, we have:

$$ \nabla E (v_1, v_2) = \left[\begin{array}{c} \frac{\partial E}{\partial v_1} \\ \frac{\partial E}{\partial v_2} \\ \end{array}\right] $$

Again: I know we don't have a math equation for $E = f(v_1, v_2)$. But, trust me. Soon we will be covering this dilema. For now, let's move forward.

Returning to the , note that the we have drawn a green arrow depicting the negative gradient of $E$ at point $(v_1(i), v_2(i))$. In figure , the new values for $v_1$ and $v_2$ are shown for a tiny step took in the direction of the negative gradient. Note that we are multiplying the gradient $\nabla E$ by a factor $-\eta$. The negative sign is because the gradient points always in the direction in which the function increases; we need to move in the opposite direction, i.e the negative of the gradient. The factor $\eta$ scales the size of the gradient and must selected between zero and one. We call this factor as the learning rate. Note that the size of $\eta$ affects how quickly the algorithm will converge to an optimal combination of the parameters $v_1$ and $v_2$. That's why it is called the learning rate.

Since the geometry of the figure we can infer some facts. First, we can see that:

$$ - \eta \nabla E(i)_1 = v_1(i+1) - v_1(i) $$

$$ - \eta \nabla E(i)_2 = v_2(i+1) - v_2(i) $$

where the subscript of $\nabla E(i)$ indicates the element or the magnitude of the gradient vector in that axis. Solving for $v_1(i+1)$ and $v_1(i+1)$ we arrive to:

$$ v_1(i+1) = v_1(i) - \eta \nabla E(i)_1 $$

$$ v_2(i+1) = v_2(i) - \eta \nabla E(i)_2 $$

This way, we have a method to update the values of the parameters. The point here is that the error in the new combination of parameters is linked to a lower error $(E(i+1))$ than $E(i)$ ().

To summarize, if we can compute $\nabla E(i)$ in $(v_1(i), v_2(i))$, we hopefully will be able to get a new combination of parameters, $v_1(i+1)$ and $v_2(i+1)$, for which $E$ is lower, i.e $E(i+1) < E(i)$.

At this point it's evident that we can iteratively apply the same formulae, until we reach a desired error value that fits our purposes ( ).

A general formulation of the gradient descent

Now that we have gained some intution about GD in a low-dimensional space, let's generalize it for a any dimensional space. Suppose we have a DL model with a high amount of parameters to learn, among $w's$ and $b's$. Now, guess a initial value for every parameter of the model:

$$ w^1_{11}(i) = w^1_{11_{0}} $$ $$ \vdots $$ $$ w^L_{JK}(i) = w^L_{JK_{0}} $$ $$ b^1_{1}(i) = b^1_{1_0} $$ $$ \vdots $$ $$ b^L_{J}(i) = b^L_{J_0} $$ Then, apply the gradient descent technique iteratively: $$ w^1_{11}(i+1) = w^1_{11}(i) - \eta \nabla E(i)_{w^1_{11}} $$ $$ \vdots $$ $$ w^L_{JK}(i+1) = w^L_{JK}(i) - \eta \nabla E(i)_{w^L_{JK}} $$ $$ b^1_{1}(i+1) = b^1_{1}(i) - \eta \nabla E(i)_{b^1_{1}} $$ $$ \vdots $$ $$ b^L_{J}(i+1) = b^L_{J}(i) - \eta \nabla E(i)_{b^L_{J}} $$ Note in that after applying the GD technique multiple times, we get closer and closer to the valley of the function $E = f(v_1, v_2)$. After considering all the mathematical intrincacies behind GD, it is easy to see why the analogy of the blind person trying to get the valley is appropriate. The entity in the analogy that informs the person about the slope of the mountain is the gradient vector of the error in the GD technique. The learning rate in GD, $\eta$, determines the size of the steps taken by the person.

There is a subtle detail to be careful about when applying the GD technique. We mentioned that the steps taken by the blind person must be tiny, which means the learning rate has to be small. Let's build some intuition why considering the opposite case: when the learning rate is large. depicts such a situation. Note that we multiply the gradient of $E$ by the negative of the learning rate, resulting in a large vector. The problem is that such a vector can lead us away from the desired valley of $E$, potentially increasing the error. In our case, we can observe that the error associated to $v_1(i+1)$ and $v_2(i+1)$, $E(i+1)$, is giving $E(i+1) > E(i)$.

Final remarks

As we could see, the gradient descent is a technique allowing us to reduce the amount of tests we do to find the optimum combination of parameters' values of the model. That's why I consider the GD a brute force, but a smart one, technique; because, it tries so many combinations of parameters' values, but always converging to a better one.

Do you remember that we said we didn't need a math equation to compute the gradient of $E$, $\nabla E$? The question is, how can we compute that gradient? Here is an intuition to resolve that. Imagine a corporation where economic loses have been found. The owner of the corporation will ask to the companie's director about those loses. The director will explain that he made some mistakes but not all the loses are explained for those mistakes. So, the director goes to the board of directors and ask for explanations. Each member of the board explains something similar: some mistakes were made, but the loses are not explained at all by them. After that, each executive goes to his subordinates and ask for explanations. And a similar answer is given. This way, the loses are explained by a chain of errors. In the next post about Deep Learning we will be reviewing this metaphore of a technique called Backpropagation, used precisely to compute the gradient of the error of a model.

See you soon!!