Author Avatar

Yordan Arango

Non-linearity! The core of this post. Think about the meeting of a river with the sea. How is it possible that a simple phrase like that can mean such a difficult process? Imagine you traveling from the interior of Russia to the Arctic Ocean by navigating the Lena River (). It sounds like you are simply boarding a boat and just letting it carry you with the flow. Isn't it? The question is, what path to choose (see again )? Something similar thought Magellan when he tried to find a way from the Atlantic to the Pacific through the continental mass of America. After several months of fighting against storms, monster waves, hunger and even insurrections, he came to a gap in the continent believing that a route finally had been found; the calvary to find the right route was just begun, because what the hell was the correct path to the Pacific (search for the Strait of Magellan in your browser)?

Photo by USGS
Lena river delta in the north of Russia. Image taken by the Landsat 7 satellite, operated by the U.S. Geological Survey and NASA.
The question remains. What makes the discharge of a river so complex? It can be helpful to think about all the forces involved in the problem. Let's consider, for example, in the ocean currents, the waves and tides, the water flow from the river, the sediment load, the hydrological cycle in the basin, the climate change, the political decisions preventing the area is disturbed, among others. As can be seen, many processes overlap each other, making it difficult to predict how a delta will behave. Similarly, this complexity is present in the vast majority of problems we face as humankind, not only in nature, but also in economy, politics, ethics, etc., where multiple phenomena act at once, making it difficult to describe it and even more challenging to predict it. We frequently refer to this kind of problems as non-linear processes, trying to denote the complexity and challenging nature of the phenomena, and highlighting the contrast with problems where the behavior is considered trivial, simple, or linear.

Linearity of Neural Networks (NN's)

Remember the last two equations in the post dedicated to introduce Naural Networks (NNs) and Deep Learning (DL):

$$ = g\left(\underbrace{\left[\begin{array}{cccc} w_{11} & w_{12} & w_{13} & w_{14}\\ w_{21} & w_{22} & w_{23} & w_{24}\\ w_{31} & w_{32} & w_{33} & w_{34}\\ \end{array}\right]}_{w} \cdot \underbrace{\left[\begin{array}{c} x_1 \\ x_2 \\ x_3 \\ \end{array}\right]}_{x} + \underbrace{\left[\begin{array}{c} b_1 \\ b_2 \\ b_3 \\ \end{array}\right]}_{b}\right) $$

$$ a = g (w \cdot x + b) $$

In this case we are operating two linear transformations to the \(x\) vector: a matrix multiplication followed by a vector addition. Let's introduce a new notation to be aware about the layer we are treating:

$$ a^l = g (w^l \cdot a^{l-1} + b^l) $$

where \(a^l\) and \(a^{l-1}\) denotes the vector of activations in the layer \(l\) and \(l-1\), respectively; \(w^l\), the matrix of weights connecting the layer \(l-1\) with layer \(l\), and \(b^l\) the vector of biases from the layer \(l\). This way, we are multiplying the activations from the previous layer by the current matrix of weights and adding the current vector of biases. Notice that \(l = \{1,2,...,L\}\) with L the number of layers in all the network. Expanding this, we have:

$$ \left[\begin{array}{c} a^{l}_{1}\\ a^{l}_{2}\\ \vdots\\ a^{l}_{j}\\ \vdots\\ a^{l}_{J-1}\\ a^{l}_{J}\\ \end{array}\right] = g\left(\left[\begin{array}{cccc} w^{l}_{11} & w^{l}_{12} & \cdots & w^{l}_{1k} & \cdots & w^{l}_{1K-1} & w^{l}_{2K}\\ w^{l}_{21} & w^{l}_{22} & \cdots & w^{l}_{2k} & \cdots & w^{l}_{2K-1} & w^{l}_{2K}\\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \vdots\\ w^{l}_{j1} & w^{l}_{j2} & \cdots & w^{l}_{jk} & \cdots & w^{l}_{jK-1} & w^{l}_{jK}\\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \vdots\\ w^{l}_{J-11} & w^{l}_{J-12} & \cdots & w^{l}_{J-1k} & \cdots & w^{l}_{J-1K-1} & w^{l}_{J-1K}\\ w^{l}_{J1} & w^{l}_{J2} & \cdots & w^{l}_{Jk} & \cdots & w^{l}_{JK-1} & w^{l}_{JK}\\ \end{array}\right] \cdot \left[\begin{array}{c} a^{l-1}_1 \\ a^{l-1}_2 \\ \vdots \\ a^{l-1}_k \\ \vdots \\ a^{l-1}_{K-1} \\ a^{l-1}_K\\ \end{array}\right] + \left[\begin{array}{c} b^{l}_1 \\ b^{l}_2 \\ \vdots \\ b^{l}_j \\ \vdots \\ b^{l}_{J-1} \\ b^{l}_{J} \\ \end{array}\right] \right) $$

Note that:

$$ a^1 = g (w^1 \cdot x + b^1) $$

where \(x = a^0\), with \(x\) the input layer to the net. For the next, let's forget the function \(g()\) we are aplying to \(w^l \cdot a^{l-1} + b^l\), so:

$$ a^l = w^l \cdot a^{l-1} + b^l $$

It won't be difficult for you to see that, without \(g()\), \(a^L\) will be a linear transformation of \(x = a^0\). That is because, in the end, the linear transformation, of a linear transformation, of a linear transformation... of a vector, is itself a line. At this point, I know you have already thought that, maybe, that is precisely the reason for the function \(g()\). And you are right!

As we began this post, non-linearity is a feature of the vast majority of processes we investigate in different areas of knowledge. That is why the linear approach of Nural Networks (NN's), shown earlier, is not a suitable approximation for addressing current challenges. This is the reason we introduce functions \(g()\), also called activation functions, which give the desired non-linear behavior to every neuron's output. To be clear, this function applies element-wise:

$$ g\left(\left[\begin{array}{c} a^{l}_{1}\\ a^{l}_{2}\\ \vdots\\ a^{l}_{j}\\ \vdots\\ a^{l}_{J-1}\\ a^{l}_{J}\\ \end{array}\right]\right) = \left[\begin{array}{c} g\left(a^{l}_{1}\right)\\ g\left(a^{l}_{2}\right)\\ \vdots\\ g\left(a^{l}_{j}\right)\\ \vdots\\ g\left(a^{l}_{J-1}\right)\\ g\left( a^{l}_{J}\right)\\ \end{array}\right] $$

Activation functions

bobi
Illustration by the author.
Here is a list of different activation functions you can choose according to your prefferences and applications.

Step activation function

$$ y = g(x) = \left\{\begin{matrix} 0 & & x \leq 0\\ 1 & & x > 0 \end{matrix}\right. $$

step_af
Step activation function. Illustration by the author.

Signum activation function

$$ y = g(x) = \left\{\begin{matrix} -1 & & x \leq 0\\ 1 & & x > 0 \end{matrix}\right. $$

signum_af
Signum activation function. Illustration by the author.

Sigmoid activation function

$$ y = g(x) = \frac{1}{1+e^{-x}} $$

sigmoid_af
Sigmoid activation function. Illustration by the author.

Hyperbolic tangent activation function

$$ y = g(x) = \frac{e^x - e^{-x}}{e^x+e^{-x}} $$

hyperbolicT_af
Hyperbolic tangent activation function. Illustration by the author.

ReLU activation function

$$ y = g(x) = \left\{\begin{matrix} 0 & & x \leq 0\\ x & & x > 0 \end{matrix}\right. $$

relu_af
ReLU activation function. Illustration by the author.

Softmax activation function

$$ y_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} $$

This is a special activation function we will cover in detail in later posts. Don't worry if you don't understand yet. Just keep it in mind.

Final remarks

All of we have seen is how Artifitial Intelligence, specially Deep Learning, has literally become part of our lives in the last 2 to 5 years. But, you know what? The first perceptron, a.k.a the first Neural Network, was invented ~60 years ago. Thus, let's agree that what is new here is something different to Deep Learning and Artifitial Intelligence. That new thing is brute force. And yes! It was not invented until our days. Not to develope Artifitial Intelligence as we know it. Let me explain that in the next post of this series dedicated to the math of DL. There, we will begin to dive into a fascinating world of math, creative solutions to Deep Learning problems, and computers. See you soon!!!