Author Avatar

Yordan Arango

In the latest entry of this blog on the topic of Deep Learning (DL) we learned the concept of the Gradient Descent (GD). It was explained that the GD is a technique allowing to update the values of the model's parameters, each time reducing the error of the model. In math notation, the GD can be expressed as follows:

$$ p(i+1) = p(i) - \eta \nabla E(i)_p $$ where \(p\) is any parameter of the model, be it a weight (\(w\)) or a bias (\(b\)) of the model; \(\nabla E(i)_p\) is the component in the \(p\) direction of the error gradient in the iteration \(i\); and \(\eta\) is the learning rate.

Remember from the post on GD that the error is any measure of how far the forecasts of the model are from the real values. Since we do not have an analytical equation that tells us how the error \(E\) varies in terms of the model's parameters (in this case \(p\)), it is not possible to analytically to compute the gradient of \(E\). It is in this point where the backpropagation algorithm gets inside the game, which will allow us to do that calculation.

Specifically, What the backpropagation algorithm is intended to do is to look at the responsability each neuron has in the final error. It is like an enterprise: if there are losses, the owner will require to the manager for explanations. At the same time the manager will ask for explanations to an instance of lower hierarchy. This instance cold be the board of directors. The senior managers would discuss with the supervisors and these with the operators. In summary each would self-attribute a piece of the final error which would explain the losses the company's owner is perceiving. Here, each employee represents a neuron and the hierarchy the layer of the net.

Let's remember some terminology that wi will be using in the backpropagation formulation. From the scheme in Figure we can see that the activation in the layer \(l\) in the \(j^{th}\) position, \(a_j^{l}\), is a function of (1) the weights coming from the layer \(l-1\), i.e \(w_{j1}^{l}, w_{j2}^{l}, \cdots, w_{jk}^{l}, \cdots, w_{jK}^{l}\); (2) the activations in the layer \(l-1\), i.e \(a_1^{l-1}, a_2^{l-1}, \cdots, a_k^{l-1}, \cdots, a_K^{l-1}\); and (3) the bias of the layer \(l\) in the \(j^{th}\) position, \(b_j^{l}\).

Image by the author
Section of a neural net showing the layers \(l-1\) y \(l\), the first of which have \(K\) neurons, while the second one have \(J\) neurons; further are shown the \(K\) weights, \(w_{jk}^l\), connecting the \(K\) neurons of the layer \(l-1\) with the neuron \(j\) in the layer \(l\), \(a_j^l\), as well as the bias \(b_j^l\) linked to this neuron.
The above is: \begin{equation} a_j^l=f\left(a_1^{l-1} \cdot w_{j 1}^l+a_2^{l-1} \cdot w_{j 2}^l+\cdots+a_k^{l-1} \cdot w_{j k}^l+\cdots+a_k^{l-1} \cdot w_{j K}^l+b_j^l\right) \label{aj_function} \end{equation} Note that the function \(f\) is the activation function \(\sigma\) applied to the input \(z_j^{l}\), i.e to the perceptron \(j\) of the layer \(l\). So: \begin{equation} a_j^l = \sigma \left(z_j^{l}\right), \label{aj_function2} \end{equation} \begin{equation} z_j^l = a_1^{l-1} \cdot w_{j 1}^l+a_2^{l-1} \cdot w_{j 2}^l+\cdots+a_k^{l-1} \cdot w_{j k}^l+\cdots+a_k^{l-1} \cdot w_{j K}^l+b_j^l \label{zj} \end{equation} This way, the input \(z^l\) can be thought as the input vector of the layer \(l\) where the component \(j\) of such a vector is the the input \(z^l_j\) to the perceptron \(j\) of the layer \(l\), i.e.: $$ z^l \equiv\left[\begin{array}{c} z_1^l \\ z_2^l \\ \vdots \\ z_j^l \\ \vdots \\ z_J^l \end{array}\right] $$ This vector is compound by \(J\) elements which is the number of perceptrons in the layer \(l\). At the same time, the weight \(w^l\) can be thought as the matrix of weights of the layer \(l\) where the component in the row \(j\) and the column \(k\) is the weight coming from the \(k^{th}\) activation in the layer \(l-1\) towards to the \(j^{th}\) perceptron in the layer \(l\). Thus, $$ w^l \equiv\left[\begin{array}{cccccc} w_{11}^l & w_{12}^l & \cdots & w_{1 k}^l & \cdots & w_{1 K}^l \\ w_{21}^l & w_{22}^l & \cdots & w_{2 k}^l & \cdots & w_{2 K}^l \\ \vdots & \vdots & \ddots & \vdots & \ddots & \cdots \\ w_{j1}^l & w_{j 2}^l & \cdots & w_{j k}^l & \cdots & w_{j K}^l \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ w_{J 1}^l & w_{J 2}^l & \cdots & w_{J k}^l & \cdots & w_{J K}^l \end{array}\right] $$ This matrix has shape of \(J \times K\) where \(J\) is the number of perceptrons in layer \(l\) and \(K\) the number of perceptrons in layer \(l-1\). Note that the column \(k\) refers to all weights coming from activation \(k\) in the layer \(l-1\), connecting to all \(J\) perceptrons in layer \(l\); and row \(j\) refers to all weights connecting to the \(j^{th}\) perceptron in layer \(l\) coming from all \(K\) perceptrons in layer \(l-1\).

Similarly, we can write

$$ a^l \equiv \left[\begin{array}{c} a_1^l\\ a_2^l \\ \vdots \\ a_j^l \\ \vdots \\ a_J^l \end{array}\right]=\sigma\left(\left[\begin{array}{c} z_1^l \\ z_2^l \\ \vdots \\ z_j^l \\ \vdots \\ z_J^l \end{array}\right]\right)=\left[\begin{array}{c} \sigma\left(z_1^l\right) \\ \sigma\left(z_2^l\right) \\ \vdots \\ \left(z_j^l\right) \\ \vdots \\ \sigma\left(z_J^l\right) \end{array}\right] $$ With the previous notation we can write in matrix notation: \begin{equation} z^l = w^l \cdot a^{l-1} + b^l \label{z_matrix_notation} \end{equation} $$ = \left[\begin{array}{cccccc} w_{11}^l \cdot a_1^{l-1}+w_{12}^l \cdot a_2^{l-1}+\cdots+w_{1 k}^l \cdot a_k^{l-1}+\cdots+w_{1 K}^l \cdot a_K^{l-1}+b_1^l \\ w_{21}^l \cdot a_1^{l-1}+w_{22}^l \cdot a_2^{l-1}+\cdots+w_{2 k}^l \cdot a_k^{l-1}+\cdots+w_{2K}^l \cdot a_K^{l-1}+b_2^l \\ \vdots\\ w_{j1}^l \cdot a_1^{l-1}+w_{j2}^l \cdot a_2^{l-1}+\cdots+w_{j k}^l \cdot a_k^{l-1}+\cdots+w_{jK}^l \cdot a_K^{l-1}+b_j^l \\ \vdots\\ w_{J1}^l \cdot a_1^{l-1}+w_{J2}^l \cdot a_2^{l-1}+\cdots+w_{J k}^l \cdot a_k^{l-1}+\cdots+w_{JK}^l \cdot a_K^{l-1}+b_J^l \\ \end{array}\right] $$ Remember that what we are looking for is to find the gradient of the loss function \(C\) (a synonynom of the error \(E\)) given a set of parameters \(w^1\), \(w^2\), \(\cdots\), \(w^l\), \(\cdots\), \(w^L\) and \(b^1\), \(b^2\), \(\cdots\), \(b^l\), \(\cdots\), \(b^L\), as well as a training set \((x_1,y_1)\), \((x_2,y_2)\), \(\cdots\), \((x_n,y_n)\), \(\cdots\), \((x_N, y_N)\) with respect of such a parameters. Note that both \(x\) and \(y\) are vectors of any size and they don't need to be equally sized; and \(N\) is the number of pairs in the training set. It means: \begin{equation} C=f\left(w_{11}^1, \cdots, w_{J_1 K_1}^1, \cdots, w_{J_lK_l}^l, \cdots, w_{J_L K_L}^L, \cdots, b_1^1, \cdots, b_{J_l}^l, \cdots, b_{J_L}^L\right) \label{error_function} \end{equation} So, we want: \begin{equation} \nabla C= \left[ \frac{\partial C}{\partial w_{11}^1}, \frac{\partial C}{\partial w_{12}^1}, \cdots, \frac{\partial C}{\partial w_{J_l K_l}^l}, \cdots, \frac{\partial C}{\partial w_{J_L K_L}^L}, \frac{\partial C}{\partial b_1^1}, \cdots, \frac{\partial C}{\partial b_{J_1}^1}, \cdots, \frac{\partial C}{\partial b_{J_l}^l}, \cdots, \frac{\partial C}{\partial b_{J_L}^L} \right]^T \label{error_gradient} \end{equation} To be able to apply the backpropagation algorithm, we need two assumptions about the loss function \(C\). The first one is that it can be written as an average of some entity \(C_X\) depending on the value of the input layer \(X\), \(C=\frac{1}{N}\sum_XC_X\). Remember that \(N\) is the total number of training samples. One example of this is the cuadratic loss function which is expresed as the average of the cuadratic errors multiplied by \(\frac{1}{2}\). Such a factor (\(\frac{1}{2}\)) is included just for convenience as it will be shown. In math notation we write: $$ C=\frac{1}{N} \sum_n^N \frac{1}{2} \sum_j^J\left[y_{n_j}-a_j^ L\left(x_n, \cdots\right)\right]^2 $$ In the case of the cuadratic loss function, we have that \(C_X = \frac{1}{2}\sum_j^J(y_j - a_j^L)\). Be aware about the sumation on the variable \(j\); it means we are computing the cuadratic error at each perceptron of the output layer and then summing them up. The advantage of writing the loss function in this way is that, given the additive property of derivatives, it can be written as the average of partial derivatives. If we compute the gradient in the \(b_j^l\) direction, it would look as follows: $$ \frac{\partial C}{\partial b_j^l}=\frac{1}{N} \sum_n^N \frac{\partial}{\partial b_j^l} \frac{1}{2} \sum_m^M\left[y_{n_m}-a_m^L\left(x_n, \cdots\right)\right]^2 $$ with \(M\) the number of perceptrons in the output layer. The second sum can be expresed as \([y_n - a^L]^2\), as the dot product of two vectors indicates. Then, $$ \frac{\partial C}{\partial b_j^l}=\frac{1}{N} \sum_n^N \frac{\partial}{\partial b_j^l} \{ \frac{1}{2} \left[y_{n}-a^L\left(x_n, \cdots w_{jk}^l\right)\right]^2 \} $$ What this equation tells us is that we have to compute the derivative of the cuadratic error for every training sample and then to average all of them to get the main derivative. So, onward we will reffer to \(C\) as the loss associated to the \(n^{th}\) training sample. Take into account that the above also applies to \(w_{jk}^{l}\)

As it was mentioned previously, the backpropagation algorithm is a method to compute numerically the derivatives by imputing some portion of the final error to each perceptron of the net. For that, let's introduce the quantity \(\delta_j^l\) which is the error in the \(j^{th}\) perceptron of the layer \(l\). Aand it is an error because it tell us how the loss \(C\) changes with the weighted input \(z_j^l\):

\begin{equation} \delta_j^l \equiv \frac{\partial C}{\partial z_j^l} \label{dC_dz} \end{equation} Asumming that \(z_j^l\) is function of the bias of its correspondent perceptron \(b_j^l\) but not other biases on previous layers we can apply the chain rule as follows: $$ \frac{\partial C}{\partial b_j^l} = \frac{\partial C}{\partial z_j^l} \frac{\partial z_j^l}{\partial b_j^l} $$ The first factor \(\frac{\partial C}{\partial z_j^l} = \delta_j^l\) is a derivative whose value we will lern soon how to comput. The second factor \(\frac{\partial z_j^l}{\partial b_j^l}\) can be computed as follows: $$ \frac{\partial z_j^l}{\partial b_j^l} = \frac{\partial }{\partial b_j^l} [(\sum_k^K w_{jk}^l a_k^{l-1}) + b_j^l] $$ This is equal to 1. So, replacing: \begin{equation} \frac{\partial C}{\partial b_j^l} = \delta_j^l \label{dC_db} \end{equation} Now, to complete the calculation of the gradient of the loss function \(C\) (equation \eqref{error_gradient}), we need to compute \(\frac{\partial C}{\partial w_{jk}^l}\). Again, we use the chain rule and use a similar approach with \(w_{jk}^l\) to that with \(b_j^l\). In this case, we are assuming that just \(z_j^l\), and non weighted input in subsequent layers (\(z_j^{l+1}\), \(z_j^{l+2}\), \(\cdots\), \(z_j^{L-1}\), \(z_j^L\)), is function of \(w_{jk}^l\) So, we have: $$ \frac{\partial C}{\partial w_{jk}^l} = \frac{\partial C}{\partial z_j^l} \frac{\partial z_j^l}{\partial w_{jk}^l} $$ From equation \eqref{dC_dz} we know that the first factor is: $$ \frac{\partial C}{\partial z_j^l} = \delta_j^l $$ The second factor can be managed as: $$ \frac{\partial z_j^l}{\partial w_{jk}^l} = \frac{\partial }{\partial w_{jk}^l} [(\sum_i^K w_{ji}^l a_i^{l-1}) + b_j^l] $$ In the summation, when \(i = k\), we have: $$ \frac{\partial z_j^l}{\partial w_{jk}^l} = a_k^{l-1} $$ Thus, \begin{equation} \frac{\partial C}{\partial w_{jk}^l} = \delta_j^l a_k^{l-1} \label{dC_dw} \end{equation} With the basic equations to compute the gradient of the loss function (equations \eqref{dC_db} and \eqref{dC_dw}), we can now tackle the computation of the error \(\delta_j^l = \frac{\delta C}{\delta z_j^l}\). Here, we will do another assumption regarding the weighted input \(z_j^l\). Opposite to the previous assuption in which \(z_j^{l+1}\) does not depend on \(b_j^l\), we are going to assume that \(z_j^{l+1}\) does depend on \(z^l\) (all the elements in the vector). In words, we are saying that the weighted inputs from a layer will depend on all weighted inputs from a previous layer. Given that and the chain rule: $$ \frac{\partial C}{\partial z_j^l} = \frac{\partial C}{\partial z_1^{l+1}} \frac{\partial z_1^{l+1}}{\partial z_j^l} + \frac{\partial C}{\partial z_2^{l+1}} \frac{\partial z_2^{l+1}}{\partial z_j^l} + \cdots + \frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_j^l} + \cdots + \frac{\partial C}{\partial z_K^{l+1}} \frac{\partial z_K^{l+1}}{\partial z_j^l} $$ $$ = \sum_k^K \frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_j^l} $$ Note that \(\frac{\partial C}{\partial z_k^{l+1}} = \delta_k^{l+1}\). In the other hand, we can express \(z_k^{l+1}\) in terms of \(z^l\) as follows: $$ z_k^{l+1} = \sum_j^J w_{kj}^{l+1} a_j^l + b_k^{l+1} = \sum_j^J w_{kj}^{l+1} \sigma(z_j^l) + b_k^{l+1} $$ Replacing, $$ \frac{\partial C}{\partial z_j^l} = \sum_k^K \delta_k^{l+1} \frac{\partial}{\partial z_j^l} [\sum_i^J w_{ki}^{l+1} \sigma (z_i^l) + b_k^{l+1}] $$ When \(i = j\), the second derivative in the previous equation turns into \(w_{kj}^{l+1} \sigma'(z_j^l)\). Note that \(\sigma'\) means the first derivative of the activation function. Thus, we have: \begin{equation} \delta_j^l = \sigma'(z_j^l) \sum_k^K \delta_k^{l+1} w_{kj}^{l+1} \label{dC_dz_solved} \end{equation} Observe that the factor \(\sigma'(z_j^l)\) is a common factor of all terms in the sumation, so it can go outside of the sumation. As can seen, we need the vector of errors of the layer \(l+1\), \(\delta^{l+1}\) to be able to compute the errors in the previous layer \(l\), \(\delta^{l}\). This is the reason why this algorithm is called backpropagation, i.e we are propagating the error from advanced layers to previous ones.
"This is the reason why this algorithm is called backpropagation, i.e we are propagating the error from advanced layers to previous ones."
The only value of \(l\) for which equation \eqref{dC_dz_solved} doesn't work is for \(l = L\), because nor the error \(\delta_k^L\) nor the weight \(w_{kj}^{L+1}\) exist. For that reason we have to obtain an expression for \(\delta_j^L = \frac{\partial C}{\partial z_j^L} \). Since we know how to compute the loss function in terms of \(a_j^L\) (remember \(C = 0.5 \sum_j^J (y_j - a_j^L)^2\)); and how to express \(a_j^L\) in terms of its weighted input \(z_j^L\), we can write: $$ \delta_j^L = \frac{\partial C}{\partial z_j^L} = \frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial z_j^L} $$ The first factor gives: $$ \frac{\partial C}{\partial a_j^L} = \frac{\partial}{\partial a_j^L} [\frac{1}{2} \sum_i^J (y_i - a_i^L)^2] = \frac{\partial}{\partial a_j^L} [\frac{1}{2} (y_j - a_j^L)^2] $$ Note how the terms disappears when \(i \neq j\). Then: $$ \frac{\partial C}{\partial a_j^L} = a_j^L - y_j $$ Thus, replacing we have: \begin{equation} \delta_j^L = (a_j^L - y_j) \sigma'(z_j^l) \label{dC_dz_solved_output} \end{equation} Keep in mind that the equation \eqref{dC_dz_solved_output} only applies if we are ussing the the cuadratic loss function. Also, take into account that the notation \(\sigma'\) reffers to the first derivative of the activation function and it depends on the shape of the function we choose, as we saw in the entry of activation function. For the case of the sigmoid activation function, we have: $$ \sigma(z_j^l) \equiv \frac{1}{1 + e^{-z_j^l}} $$ $$ \sigma'(z_j^l) = \sigma(z_j^l) [1 - \sigma(z_j^l)] $$ You can prove these results by just applying the properties of the derivatives to the sigmoid function.

The algorithm

We have derived the four fundamental equations of the backpropagation algorithm (equations \eqref{dC_db}, \eqref{dC_dw}, \eqref{dC_dz_solved}, \eqref{dC_dz_solved_output}): We state them again: $$ \frac{\partial C}{\partial b_j^l} = \delta_j^l $$ $$ \frac{\partial C}{\partial w_{jk}^l} = \delta_j^l a_k^{l-1} $$ $$ \delta_j^l = \sigma'(z_j^l) \sum_k^K \delta_k^{l+1} w_{kj}^{l+1} $$ $$ \delta_j^L = (a_j^L - y_j) \sigma'(z_j^l) $$ The algorithm works this way:

1. Find \(\delta_j^L\) by using equation \eqref{dC_dz_solved_output} for every \(j\) from 1 to \(J\)

2. For every \(l\) from \(L-1\) to 1, compute \(\delta_j^l\) by using \eqref{dC_dz_solved} for every \(j\) from 1 to \(J\)

3. Compute the error gradient \eqref{error_gradient} by calculating \(\frac{\partial C}{\partial b_j^l}\) and \(\frac{\partial C}{\partial w_{jk}^l}\) as expressed in equations \eqref{dC_db} and \eqref{dC_dw}

Is important to highlight that the values for variables like \(w_{jk}^l\), \(b_j^l\), \(z_j^l\), \(a_j^L\), \(y_j\), etc, are already available as the model has a first estimation for such a parameters. So, the backpropagation algorithm is only applied to find the values in left hand of the equations \eqref{dC_db}, \eqref{dC_dw}, \eqref{dC_dz_solved}, \eqref{dC_dz_solved_output}.

Final remarks

As was stated at the beginning of this entry, the backpropagation is intended to compute the gradient of the errors at each iteration of the gradient descent method to arrive to a better combination of parameters \(b's\) and \(w's\). So, remember! The gradient descent allows us to update the parameters of the model, and the backpropagation feed the gradient descent algorithm with the gradient of the errors. That's all folks!