Author Avatar

Yordan Arango

One of the funniest and most fascinating things happening in our era (by 2024) is the irruption of Artificial Intelligence (AI). In the last 2 to 3 years, we have seen how incredible things can be done by just writing down a few sentences in our devices: from asking a chatbot to give a complete investment guide for the New York Stock Exchange ("The Big Board") to asking it (him/her?) to compose an overture in a Beethoven-fashion way. How is that possible? In short: math and computers. The aim of the next series of posts will be to review the mathematical intricacies behind the Neural Networks (NN's) algorithms that are making these technologies possible. But first, let us start by clarifying some concepts about AI, NN's, Deep Learning (DL), and others.

Photo by Pavel Danilyuk on Pexels
Photo by Pavel Danilyuk on Pexels.
First, the AI is a huge branch of computer science and informatics intended to create machines capable to do human tasks. Machine Learning (ML) is a subarea of AI aimed to create algorithms based on learned patterns from data. Here, the big difference between AI and ML is the use of data to learn the algorithms; i.e, while AI builds machines by using rules from any source, ML uses machines to learn those rules from data. Finally, DL is a subarea of the ML that uses deep NN's to learn patterns from data and build algorithms.
AI-ML-DL
Illustration by the author.
Now that we know the difference between this common terms, let us dive into the NN's. Precisely, NNs are called so since their similarity with the brain structure. The NN's are constitued by perceptrons like brains by neurons. In fact, the way information flows through the perceptrons is also a feature that resembles the electrical impulses that carry the information through the brain's neurons. As we know, the information entering to the brain (light, smells, flavors, sounds, etc) is transformed into electric signals which somehow are converted into data of the world we can interpret. Both in NN's and brains, information from previous perceptrons/neurons is transformed into a different kind of information that is passed to the next perceptron/neuron. The difference is that neurons process and pass electrical signals, whereas perceptrons process and pass literally numbers. Thus, perceptrons are to NN's as neurons to brain.
"...perceptrons are to NN's as neurons to brain."

How do perceptrons work?

In the previous metaphor, perceptrons are similar to brain, at least as far as the transit of information is concerned. But, we do not expect that perceptrons have dendrits or axons like brain does. So, how do perceptrons look like? See the . In perceptrons, information from previous perceptrons is taken and transformed into new information that is passed to the following perceptrons. The input information from previous perceptrons is illustrated here by the red squares \(x_1\) and \(x_2\). This information is converted into the output data following two steps: 1. a weighted sum is computed by multiplying every input data, \(x_1\) and \(x_2\), by their correspondent weight, \(w_{x_1}\) and \(w_{x_2}\), summing them up and then adding a bias \(b\); 2. after that, the weighted sum is then passed through an activation function.
perceptron
Perceptron. The perceptron is the most basic unit of a NN, where information from previous perceptrons is taken and is transformed into new information that is passed to following perceptrons. The input information from previous perceptrons is illustrated here by the red squares \(x_{1}\) and \(x_{2}\). This information is converted in the output data following two steps: 1. a weighted sum is computed by multiplying every input data, \(x_1\) and \(x_2\), by their correspondent weight, \(w_{x_1}\) and \(w_{x_2}\), summing them up and then adding a bias \(b\); 2. after that, the weighted sum is then passed through an activation function. Illustration by the author.
What we pursue when we make a NN is to try to find the best values for the weights and biases, in this case \(w_{x_1}\), \(w_{x_2}\) and \(b\), that best solve the problem we are facing. Suppose we want a NN such that given two inputs, \(x_1\) and \(x_2\), of binary nature (i.e 0 or 1), we will get an output equivalent to 1. If both inputs satisfy \(x_1\ = x_2 = 1\); otherwise, the output will be 0. The following activation function is given:

$$F(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \\ \end{cases}$$

Let us try the following values for the weights \(w_{x_1}\) and \(w_{x_2}\) and the bias \(b\):

$$w_1 = 2$$ $$w_2 = 4$$ $$b = 0$$

Possible outputs for different combinations of \(x_1\) and \(x_2\) are shown in the following table:
Possible outputs for different combinations of \(x_1\) and \(x_2\), given \(w_1 = 2\), \(w_2 = 4\) and \(b = 0\).
$$x_1$$ $$x_2$$ $$\sum x_i \cdot w_i + b$$ $$activation$$ $$output$$
$$1$$ $$0$$ $$1 \cdot 2 + 0 \cdot 4 + 5 = 7$$ $$F(7) = 1$$ $$1$$
$$0$$ $$1$$ $$0 \cdot 2 + 1 \cdot 4 + 5 = 9$$ $$F(9) = 1$$ $$1$$
$$0$$ $$0$$ $$0 \cdot 2 + 0 \cdot 4 + 5 = 5$$ $$F(5) = 1$$ $$1$$
$$1$$ $$1$$ $$1 \cdot 2 + 1 \cdot 4 + 5 = 11$$ $$F(11) = 1$$ $$1$$
Note that the net is not suitable as the output is always 1; however this should be obtained only in the case \(x_1\ = x_2 = 1\). Now, let us assess again the NN in the following case:

$$w_1 = 2$$ $$w_2 = 1$$ $$b = -3$$

Possible outputs for different combinations of \(x_1\) and \(x_2\), given \(w_1 = 2\), \(w_2 = 1\) and \(b = -3\).
$$x_1$$ $$x_2$$ $$\sum x_i \cdot w_i + b$$ $$activation$$ $$output$$
$$1$$ $$0$$ $$1 \cdot 2 + 0 \cdot 1 + (-3) = -1$$ $$F(-1) = 0$$ $$0$$
$$0$$ $$1$$ $$0 \cdot 2 + 1 \cdot 1 + (-3) = -2$$ $$F(-2) = 0$$ $$0$$
$$0$$ $$0$$ $$0 \cdot 2 + 0 \cdot 1 + (-3) = -3$$ $$F(-3) = 0$$ $$0$$
$$1$$ $$1$$ $$1 \cdot 2 + 1 \cdot 1 + (-3) = 0$$ $$F(0) = 1$$ $$1$$
We refer to the previous problem as an AND GATE problem and its graphic representation is as follows:
and_gate
AND GATE problem. Illustration by the author.
Here, the diagonal line is the solution that we tried to find with the NN and that is characterized by the parameters \(w_1 = 2\), \(w_2 = 1\) and \(b = -3\), above of which we would get an output equal to 1 and below of which an output equal to 0.

An OR GATE would be solved as follows:

or_gate
OR GATE problem. Illustration by the author.
And for a XOR GATE we would need more than two lines:
xor_gate
XOR GATE problem. Illustration by the author.
The requirement for an additional line would be reflected in a NN with a more complex architecture with, maybe, two or more perceptrons connected between them. Let us review how a regular NN, with two or more perceptrons, is structured.

Neural Networks architecture

A typical NN is composed of multiple perceptrons, potentially millions, connected between them. There are three types of perceptrons in a NN: those in the input layer, the ones in the hidden layers, and those in the output layer. A look to the can maket it easier to understand.
architecture
Typical Neural Network architecture with an input layer, multiple hidden layers, and an output layer. Typically, a Neural Network can have millions of perceptrons. Illustration by the author.
There are some ideas to highlight here. As we already showed, perceptrons are arranged on layers. The input layer doesn't perform any computations; instead, it represents the relevant information of the problem. These are the \(x_1\) and \(x_2\) variables in the and . The hidden layers are those where the information is transformed and where the network learns the relationship between the input data and the expected output. And the output layer represents the result of all calculations the NN did.

Perceptrons in a layer just pass their outputs to the nearest layer of perceptrons. There are not connections between the perceptrons that make up a layer, and no connections with previous layers. That means the information flows in just one direction. See in how two adjacent layers share information.

two_layers
Two adjacent layers of a NN. The notation \(w_{jk}\) refers to the weight connecting the \(j^{th}\) perceptron from a layer with the \(k^{th}\) perceptron of the previous one. Illustration by the author.
Here we have two layers represented by circles. In the left one we have inputs \(x_k\); in the right one we have outputs \(a_j\). On the other hand, we have weights represented by squares. Thus, each perceptron in the left layer is connected to every perceptron in the right layer through the weight \(w_{jk}\). Yellow squares connect perceptrons from the left layer with the first perceptron in the right layer, \(a_1\). Similarly, green and blue squares connect left perceptrons with the second and third perceptrons in the right (\(a_2\) and \(a_3\)), respectively. Nevertheless, notice that this colors are not needed to know which weight is connecting two adjacent perceptrons. The subscripts in the weights are enough notation to this purpose. Thus, the weight \(w_{13}\) is connecting the perceptron \(x_3\) in the left with the perceptron \(a_1\) in the right. In general, the notation \(w_{jk}\) refers to the weight connecting the \(j^{th}\) perceptron of a layer with the \(k^{th}\) perceptron of a previous one. This way we can write the following equations:

$$a_1 = g(x_1w_{11} + x_2w_{12} + x_3w_{13} + x_4w_{14} + b_1) = g(\sum_{k=1}^K x_kw_{1k} + b_1)$$ $$a_2 = g(x_1w_{21} + x_2w_{22} + x_3w_{23} + x_4w_{24} + b_2) = g(\sum_{k=1}^K x_kw_{2k} + b_2)$$ $$a_3 = g(x_1w_{31} + x_2w_{32} + x_3w_{33} + x_4w_{34} + b_3) = g(\sum_{k=1}^K x_kw_{3k} + b_3)$$

We can write a more general equation for the outputs of the right layer as follows:

$$a_j = g(\sum_{k=1}^K x_kw_{jk} + b_j)$$

for \(j\) between 1 and \(J\), being \(J\) and \(K\) the number of perceptrons in the right and left hand layers, respectively. In this case \(J = 3\) and \(K = 4\). We can also summarize this in matrix notation as follows:

$$ a = g\left(\left[ \begin{array}{c} x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1 \\ x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2 \\ x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3 \\ \end{array}\right]\right) $$

$$ = g\left(\underbrace{\left[\begin{array}{cccc} w_{11} & w_{12} & w_{13} & w_{14}\\ w_{21} & w_{22} & w_{23} & w_{24}\\ w_{31} & w_{32} & w_{33} & w_{34}\\ \end{array}\right]}_{w} \cdot \underbrace{\left[\begin{array}{c} x_1 \\ x_2 \\ x_3 \\ \end{array}\right]}_{x} + \underbrace{\left[\begin{array}{c} b_1 \\ b_2 \\ b_3 \\ \end{array}\right]}_{b}\right) $$

Thus,

$$ a = g (w \cdot x + b) $$

where \(a\) is a vector whose elements are the outputs of the right hand layer (); \(w\) is the matrix of weights conecting both layers, where the element in the row \(j\) and column \(k\) is the weight that connects the perceptron \(k\) in the left hand layer with perceptron \(j\) in the right hand layer; \(x\) is the vector whose elements are the outputs of the left hand layer; and \(b\) is the vector of biases of the right hand layer.

Final remarks

Notice that we have been using the notation \(g()\) to represent the activation function acting on the weighted sum \(w \cdot x + b\). Precisely, this will be the topic of the following entry of this blog, where we will be reviewing the concept of activation function. But, let's to introduce briefly why this is important for NNs. If you review again what we have written, you will perceive that the equations rely on pure linear combinations of the form \(w \cdot x + b\). If you refer back to () you should see that if there were not activation functions, the final output of the NN would end up with a simple linear transformation of the input data (input layer). Is that what we need when we try to model the nature? Think on that and you should conclude that, someway, we need to achieve a non-linear transformation of our inputs. And here is where activation functions come to the game.