Neural networks for beginners. Part 1 / Sudo Null IT News

Hello to all Habrahabr readers, in this article I want to share with you my experience in studying neural networks and, as a result, their implementation using the Java programming language on the Android platform.
My acquaintance with neural networks happened when the Prisma application was released. It processes any photo using neural networks and reproduces it from scratch using the selected style. Having become interested in this, I rushed to look for articles and “tutorials,” primarily on Habré. And to my great surprise, I did not find a single article that clearly and step-by-step described the algorithm for the operation of neural networks. The information was scattered and missing key points. Also, most authors rush to show code in one programming language or another without resorting to detailed explanations. Therefore, now that I have mastered neural networks quite well and found a huge amount of information from various foreign portals, I would like to share this with people in a series of publications where I will collect all the information that you will need if you are just starting to get acquainted with neural networks. In this article, I will not place a strong emphasis on Java and will explain everything with examples so that you can transfer it to any programming language you need. In subsequent articles, I will talk about my application, written for Android, which predicts the movement of stocks or currencies. In other words, everyone who wants to plunge into the world of neural networks and craves a simple and accessible presentation of information, or simply those who did not understand something and wants to improve it, are welcome under the cat. My first and most important discovery was the playlist of the American programmer Jeff Heaton, in which he explains in detail and clearly the principles of operation of neural networks and their classification. After viewing this playlist, I decided to create my own neural network, starting with the simplest example. You probably know that when you first start learning a new language, your first program will be Hello World. It's a kind of tradition. The world of machine learning also has its own Hello world and this is a neural network that solves the exclusion or (XOR) problem. The XOR table looks like this:

a	b	c
0	0	0
0	1	1
1	0	1
1	1	0

Accordingly, the neural network takes two numbers as input and must produce another number as the output - the answer. Now about the neural networks themselves.

What is a neural network?

A neural network is a sequence of neurons connected by synapses. The structure of a neural network came to the world of programming straight from biology. Thanks to this structure, the machine gains the ability to analyze and even remember various information. Neural networks are also capable of not only analyzing incoming information, but also reproducing it from their memory. For those interested, be sure to watch 2 videos from TED Talks: Video 1, Video 2). In other words, a neural network is a machine interpretation of the human brain, which contains millions of neurons transmitting information in the form of electrical impulses.

Partial derivatives

Partial derivatives can be calculated, so we know what the contribution to error was for each weight. The need for derivatives is obvious. Imagine a neural network trying to find the optimal speed for a self-driving car. If the car detects that it is driving faster or slower than the required speed, the neural network will change the speed, accelerating or slowing down the car. What speeds up/slows down? Derivative speeds.

Let's look at the need for partial derivatives using an example.

Suppose children were asked to throw a dart at a target, aiming for the center. Here are the results:

Now, if we find the overall error and simply subtract it from all the weights, we will total the errors made by everyone. So, let's say a child hits too low, but we ask all children to strive to hit the target, then this will lead to the following picture:

The error of a few children may decrease, but the overall error still increases.

Having found the partial derivatives, we find out the errors corresponding to each weight separately. If you selectively correct the weights, you get the following:

What are neural networks for?

Neural networks are used to solve complex problems that require analytical calculations similar to what the human brain does.
The most common applications of neural networks are: Classification

— distribution of data by parameters. For example, you are given a set of people as input and you need to decide which of them to give credit to and which not. This work can be done by a neural network, analyzing information such as age, solvency, credit history, etc.

Prediction

- the ability to predict the next step. For example, the rise or fall of shares based on the situation in the stock market.

Recognition

- Currently, the most widespread use of neural networks. Used in Google when you search for a photo or in phone cameras when it detects the position of your face and highlights it and much more.

Now, to understand how neural networks work, let's take a look at its components and their parameters.

Advantages and disadvantages of a neural network

The advantages of an artificial neural network include:

Input noise. To understand what we are talking about, imagine a large stadium with spectators. The sounds of music are heard throughout the perimeter, people communicate, laugh, shout. At this time, you are talking with a friend on the podium: you hear extraneous voices and sounds, but your brain sweeps them away, allowing you to concentrate only on the conversation. ANNs have a similar quality. Having undergone training, neural networks perceive only what is needed, sweeping away everything unnecessary, i.e. extraneous noise.

What is a neuron?

A neuron is a computational unit that receives information, performs simple calculations on it, and transmits it further. They are divided into three main types: input (blue), hidden (red) and output (green). There is also a displacement neuron and a context neuron, which we will talk about in the next article. In the case when a neural network consists of a large number of neurons, the term layer is introduced. Accordingly, there is an input layer that receives information, n hidden layers (usually no more than 3) that process it, and an output layer that outputs the result. Each neuron has 2 main parameters: input data and output data. In the case of an input neuron: input=output. In the rest, the input field contains the total information of all neurons from the previous layer, after which it is normalized using the activation function (for now let’s just imagine it as f(x)) and ends up in the output field.

Important to remember

that neurons operate with numbers in the range [0,1] or [-1,1]. But how, you ask, then process numbers that fall outside this range? At this point, the simplest answer is to divide 1 by that number. This process is called normalization and it is very often used in neural networks. More on this a little later.

Let's move on to training

To backpropagate an error, you need to know the values of the outputs and inputs, as well as the values of the derivatives of the activation function of the neural network, layer by layer, therefore, you need to create a LayerT structure, where there will be three vector values:

x is the layer input,
z - output,
df is the derivative of the activation function.

For each layer we will need delta vectors, as a result of which we will need to add them to the class as well. As a result, the class will look like this:

What is a synapse?

A synapse is a connection between two neurons.
Synapses have 1 parameter - weight. Thanks to it, input information changes as it is transmitted from one neuron to another. Let's say there are 3 neurons that transmit information to the next one. Then we have 3 weights corresponding to each of these neurons. For the neuron with the most weight, that information will be dominant in the next neuron (example: color mixing). In fact, the set of weights of a neural network or the weight matrix is a kind of brain of the entire system. It is thanks to these weights that the input information is processed and turned into a result. Important to remember

, that during the initialization of the neural network, the weights are placed in random order.

Story

What is the history of the development of neural networks in science and technology? It originates with the advent of the first computers or computers (electronic computers) as they were called in those days. So, back in the late 1940s, a certain Donald Hebb developed a neural network mechanism, which laid down the rules for teaching computers, these “proto-computers.”

The further chronology of events was as follows:

In 1954, the first practical use of neural networks in computer operation took place.

In 1958, Frank Rosenblatt developed a pattern recognition algorithm and a mathematical annotation to it.

In the 1960s, interest in the development of neural networks faded somewhat due to the weak computer power of that time.

And it was revived again in the 1980s; it was during this period that a system with a feedback mechanism appeared and self-learning algorithms were developed.

By 2000, computer power had grown so much that it could make the wildest dreams of scientists of the past come true. At this time, voice recognition programs, computer vision and much more appear.

How does a neural network work?

This example shows part of a neural network, where the letters I represent the input neurons, the letter H represents the hidden neuron, and the letter w represents the weights. The formula shows that the input information is the sum of all input data multiplied by their corresponding weights. Then we will give 1 and 0 as input. Let w1=0.4 and w2 = 0.7 The input data of neuron H1 will be as follows: 1*0.4+0*0.7=0.4. Now that we have the input, we can get the output by plugging the input into the activation function (more on that later). Now that we have the output, we pass it on. And so, we repeat for all layers until we reach the output neuron. Having launched such a network for the first time, we will see that the answer is far from correct, because the network is not trained. To improve the results we will train her. But before we learn how to do this, let's introduce a few terms and properties of a neural network.

Activation function

An activation function is a way of normalizing input data (we talked about this earlier).
That is, if you have a large number at the input, passing it through the activation function, you will get an output in the range you need. There are quite a lot of activation functions, so we will consider the most basic ones: Linear, Sigmoid (Logistic) and Hyperbolic tangent. Their main differences are the range of values. Linear function

This function is almost never used, except when you need to test a neural network or pass a value without conversion.

Sigmoid

This is the most common activation function and its value range is [0,1]. This is where most of the examples on the web are shown, and is also sometimes called the logistic function. Accordingly, if in your case there are negative values (for example, stocks can go not only up, but also down), then you will need a function that also captures negative values.

Hyperbolic tangent

It only makes sense to use hyperbolic tangent when your values can be both negative and positive, since the range of the function is [-1,1]. It is not advisable to use this function only with positive values as this will significantly worsen the results of your neural network.

Russian Blogs

With the advent of new neural network architectures, it is becoming increasingly difficult to record all neural networks. To understand all the networks that these acronyms refer to (DCIGN, IiLSTM, DCGAN, etc.), it is assumed that it is impossible to start from the beginning.

The following table shows the most commonly used models (mainly neural networks and some other models). Although these architectures are new and unique, when I start drawing their results, the basic relationships of each architecture will be clear.

Obviously, these node graphs cannot show the internal workflow of each model. For example, variational autoencoder (VAE) and autoencoder (AE) node graphs look similar, but their training process is actually completely different, and the use cases for the trained model are more different. VAE is an oscillator used to insert noise into a sample. AE simply maps the input it receives to the most recent training sample in its "memory"! This article does not go into detail about how each individual architecture works internally.

Although most of the abbreviations have been generally accepted, some conflicts also exist. For example, RNN usually refers to recurrent neural networks, and sometimes to recursive neural networks, and even in many places simply refers to various recurrent architectures (including LSTMs, GRUs, and even bidirectional variants). AE is the same thing, VAE and DAE are simply called AE. In addition, the abbreviation of the same model also has the problem that the number of suffixes N is different. The same model can be called a convolutional neural network or convolutional network, the corresponding abbreviation becomes CNN or CN.

It is almost impossible to treat this article as a complete list of neural networks because new architectures have been invented, and even if new architectures are released, they are difficult to find. So this article may give you some insight into the world of artificial intelligence, but definitely not all of it, especially if you didn't see it until many years after this article was published.

For each architecture depicted in the figure above, this article has a very brief description. If you are familiar with some architectures, you may find some of them useful.

Perceptron (P on the left) and Feedforward Neural Network (FF or FFNN on the right) Very intuitive, they input information from the front end and then output from the back end. Neural networks are often described as having layers (input, hidden or output), where each layer consists of parallel blocks. Typically the same layer has no connections, and two adjacent layers are completely connected (every neuron in each layer to every neuron in the other layer). The simplest practical network has two input blocks and one output block, which can be used to build a logic model (used to determine whether it is worthwhile or not). FFNN is typically trained using back propagation, and the dataset consists of paired inputs and outputs (this is called supervised learning). We simply input data and let the network populate the output. Backpropagation errors are usually some variation of the difference between the padding output and the actual output (such as MSE or just a linear difference). Given that there are enough hidden neurons in the network, it is theoretically possible to always model the relationship between input and output. In fact, their applications are very limited and they are usually merged with other networks to form new networks.

Radial Basis Function (RBF) Network This is an FFNN network with radial basis functions as activation functions. However, RBFNN has its own different use case from FFNN (most FFNNs with other activation functions do not have their own names due to the time of invention issue).

Hopfield Network (HN) Each neuron is connected to other neurons, its structure resembles a plate of fully entangled Italian panels. Each node is introduced before training and then hidden and withdrawn during training. The network is trained by setting the values of the neurons according to the desired pattern, and after that the weights remain unchanged. After learning one or more patterns, the network will always converge to one of the training patterns because the network is stable in this state. It should be noted that HN does not always agree with the ideal state. One reason for the stability of the network is that the overall "energy" or "temperature" gradually decreases during training. Each neuron has an activation threshold that varies with temperature. Once the sum of the inputs is exceeded, it will cause the neuron to go into one of two states (usually -1 or 1, sometimes 0 or 1). Network updates can be done synchronously or one at a time. The latter is more common. When the network is updated one by one, a fair random sequence is generated and each unit is updated in the prescribed order. So when each module is updated and no longer changes, you can judge that the network is stable (no longer converges). These networks are also called associative memories because they converge to a state that is most similar to the input: when people see half the table, we imagine the other half of the table. If the input signal is half noisy and half table, HN will be converged into the table.

Markov Chain (MC or Discrete Time Markov Chain, DTMC) It is the predecessor of BM and HN. This can be understood as follows: from my current node, the probability that I will hit any of the neighboring nodes does not know, which means that the node you select is completely dependent on the current node, I have no relation to the node in the past. Although this not a true neural network, it is similar to a neural network and forms the theoretical basis of BM and HNs. Like BM, RBM and HN, MC is not always considered a neural network. In addition, Markov chains are not always completely connected.

Boltzmann Machine (BM) Like HN, except that only some neurons are marked as input neurons and others remain “hidden”. Input neurons become output neurons at the end of a complete network update. It starts with random weights and trains the model through backpropagation or contrastive divergence (a Markov chain used to determine the gradient between two information gains). Compared to the HN, most BM neurons have a binary firing pattern. Since the MC is trained, the BM is a random network. The training and operation process of BM is very similar to HN: the input neurons are set to certain clamp values, thereby freeing the network. While you can get any value by freeing a node, this results in multiple iterations between the input and hidden layers. Activation is controlled by a global threshold. This process of gradually reducing global errors causes the network to eventually reach equilibrium.

Restricted Boltzmann Machine (RBM) Very similar to BM and CN. The biggest difference between BM and RBM is that RBM has better usability since it is more limited. RBM does not connect every neuron to every other neuron, but only connects every group of neurons to every other group, so no input neuron is directly connected to other input neurons, and there will be no hidden layer connect directly to the hidden layer. RBM can be trained like FFNN instead of propagating data forward and then backward.

Autoencoders (AE) Similar to Feedforward Neural Network (FFNN). Instead of saying that this is a completely different network structure, it is better to say that this is a different application of feedforward neural networks. The basic idea of an autoencoder is to automatically encode information (eg compression rather than encryption). Hence the name. The entire network is funnel-like in shape: it always has fewer hidden layer units than the input and output layers. Autoencoders are always symmetric about the central layer (whether the central layer is one or two depends on the number of layers in the network: if it is odd, it is symmetric about the middle layer; if it is even, it is symmetric about the middle two layers). The smallest hidden layer is always in the central layer where the information is most compressed (called the network's blocking point). From the input layer to the center layer is called the encoding part, from the center layer to the output layer is called the decoding part, and the center layer is called the code. You can use the backpropagation algorithm to train an autoencoder, feed data into the network, and set the error as the difference between the input data and the network output. The weight of the auto encoder is also symmetrical, that is, the encoding weight and the decoding weight are the same.

Sparse Autoencoders (SAE) Somewhat the opposite of autoencoders. Instead of training a network to represent a bunch of information in lower dimensional space and nodes, here we are trying to encode information in a higher dimensional space. Thus, at the central level the network does not converge, but expands. This type of network can be used to extract features from a dataset. If we train a sparse autoencoder using the autoencoder learning method, in almost all cases we will end up with a completely useless identity network (that is, what is the input, the network will not output anything, without any transformation or decomposition). To avoid this, a sparse driver is added to the feedback input process. This sparse driver can take the form of threshold filtering, meaning only certain errors can be propagated and trained, while other errors are considered independent of training and set to zero. In some ways this is similar to a spiking neural network: not all neurons are firing at every moment (this has some biological rationality)

Variational autoencoders (VAE) It has the same network structure as an autoencoder, but the model learns something else: the approximate probability distribution of the input samples. This point is more similar to Boltzmann Machine (BM) and Restricted Boltzmann Machine (RBM). However, they rely on Bayesian mathematics, which includes probabilistic inference and independence, and reparameterization techniques to obtain different representations. Probabilistic inference and independence are intuitive, but they rely on sophisticated mathematical knowledge. The basic principle is this: consider the impact. If something happens in one place and another happens in another place, they are not necessarily related. If they are not relevant, then this should be taken into account during the backpropagation process. This method is useful because a neural network is a large graph (to some extent), so when entering a deeper network layer you can eliminate the influence of some nodes on other nodes.

Denoising Autoencoders (DAE) This is an autoencoder. In a denoising autoencoder, instead of inputting raw data, we input data with noise (for example, making the image more detailed). But we calculate the error in the same way as before. In this way, the network output is compared with the original input data without interference. This encourages the network to explore not just the details, but also the broader possibilities. Because features can change depending on noise, features learned by the general network are often incorrect

Deep Belief Networks (DBN) This is a folded structure of a restricted Boltzmann machine or variational autoencoder. These networks have proven to be effective in learning. Among them, each self-encoder or Boltzmann machine only needs to learn to encode the previous network. This technique is also called greedy training. Greediness is to solve only the local optimal solution in the descent process, which cannot be the global optimal solution. Deep belief networks can be trained using contrastive divergence or backpropagation and learn to represent data in the form of a probabilistic model such as a conventional restricted Boltzmann machine or a variational autoencoder. Once the model is trained or brought into a (more) stable state through unsupervised learning, it can be used to generate new data. If you use contrastive divergence training, it can even classify existing data because neurons learn to look for different features.

Convolutional Neural Networks (CNN, or Deep Convolutional Neural Networks, DCNN) This is completely different from most other networks. They are primarily used for image processing, but can also be used for other types of input such as audio. A typical application of convolutional neural networks is to feed images into the network and the network will classify the images. For example, if you enter a picture of a cat, it will display "cat", and if you enter a picture of a dog, you will see "dog". Convolutional neural networks typically use a single input “scanner” instead of analyzing all the training data at once. For example, to input a 200 x 200 pixel image, you don't need to use an input layer with 40,000 nodes. Instead you only need to create a scan layer, this input layer only has 20 x 20 nodes and you can input the first 20 x 20 pixels of the image (usually starting from the top left corner of the image). Once you've passed in that 20 x 20 pixel data (probably trained by it), you can input the next 20 x 20 pixel: Move the "scanner" one pixel to the right. Be careful not to move more than 20 pixels (or other "scanner" width). Instead of breaking the image into 20 x 20 pieces, you move the "scanner" around a bit. These inputs are then passed to the convolutional layer instead of the normal layer. The nodes in the convolutional layer are not fully connected. Each node is connected only to its neighboring cells (how closely it depends on the application implementation, but usually no more than a few). These convolutional layers will gradually get smaller as the network gets deeper. Typically, the number of convolutional layers is the input factor. (So if the input is 20, the next convolutional layer could be 10 and then 5). The powers of two are often used because they are divided into: 32,16,8,4,2,1. In addition to the convolutional layer, there are also feature pooling layers. Pooling is a method for filtering parts: the most common pooling method is max pooling. For example, using 2 x 2 pixels, take the largest of the four pixels. To apply a convolutional neural network to audio, input clip-length audio waves in chunks. Convolutional neural network applications in the real world typically add a feedforward neural network (FFNN) at the end for further data processing, allowing for nonlinear feature matching. These networks are called DCNNs, but these names and abbreviations are often used interchangeably.

Deconvolutional Networks (DN) Also called Inverse Graph Network (IGN). This is an inverse convolutional neural network. Imagine inputting the word "cat" into a neural network and training the network model by comparing the difference between the network's output and an image of a real cat, resulting in an image that looks like a cat. Deconvolutional neural networks can be used in combination with feedforward neural networks such as conventional convolutional neural networks, but this may involve new cuts. They may be deep deconvolutional neural networks, but you may be inclined: when you add a feedforward neural network before or after a deconvolutional neural network, they may be new network structures and should be given new names. It's worth noting that in real-world applications, you can't directly input text into the network, but instead input a binary classification vector. For example, <0,1> is a cat, <1,0> is a dog, and <1,1> is a cat and a dog. In a convolutional neural network, there is a pooling layer, which is usually replaced by a similar inverse operation, usually biased interpolation or extrapolation (for example, if the pooling layer uses the max pool when the inverse operation is performed, (Other lower new data can be generated to fill in)

Deep Convolutional Inverse Graphics Networks (DCIGN) This name is somewhat misleading because they are actually variational autoencoders (VAEs), except that the encoder and decoder respectively have convolutional neural networks (CNNs) and deconvolutional neural networks ( DNN). These networks attempt to probabilistically model “features” during the encoding process. This way, you can let the network learn to generate a photo of a cat and a dog if you only use a cat and a dog. Likewise, you can enter a photo of a cat. If there is an annoying neighbor's dog near the cat, you can ask the network to remove the dog. Experiments show that these networks can also be used to learn to perform complex transformations on images, such as changing the light source of 3D objects or performing rotation operations on objects. These networks are usually trained with back propagation.

Generative Adversarial Networks (GANs) This is a new type of network. Networks come in pairs: two networks work together. Generative adversarial networks can consist of any two networks (though typically a pair of feedforward neural networks and convolutional neural networks), one responsible for generating content and the other responsible for distinguishing content. The recognition network takes the training data and generates the data generated by the network at the same time. The discriminatory network can correctly predict the source of the data, and then it is used as part of the error of the generation network. This creates a kind of confrontation: the discriminator gets better and better at identifying real data and generating data, and the generator tends to generate data that the discriminator has difficulty recognizing. This network achieved relatively good results, in part because even complex noise patterns can be predicted in the end, but content that generates features similar to the input is harder to discern. Generative adversarial networks are difficult to train because you not only need to train two networks (they each have their own problems), but also take into account the dynamic balance of the two networks. If part of the prediction or generation becomes better than the other, the network will eventually fail to converge.

Recurrent Neural Networks (RNN) This is a feedforward neural network that takes time into account: they are not stateless entities, there is a certain relationship between channels and channels in time. Neurons not only receive information from the previous layer of neural networks, they also receive information from the previous channel. This means that the order in which you enter the neural network and the data used to train the network is important: entering "milk", "cookies" and entering "cookies", "milk" will produce different results. The biggest problem with recurrent neural networks is the disappearance of gradients (or gradient bursts), which depends on the activation function used. In this case, information quickly disappears over time as the depth of the feedforward neural network increases, information is lost. Intuitively, this isn't a big problem because these are just weights, not neuron states. But over time, the weights retained past information. If the weight reaches 0 or 1,000,000, the previous state becomes useless. Convolutional neural networks can be applied to many areas. Most data does not have a real-time axis (unlike audio and video), but it can be expressed as sequences. For an image or line of text, you can enter one pixel or one character at a time. Therefore, time-dependent weights can be used to represent information one second before a sequence rather than information a few seconds ago. In general, for predicting future information or augmenting information, a recurrent neural network, such as auto-completion, is a good option.

Long Term/Short Term Memory (LSTM) By introducing a gate and a well-defined memory cell, we try to overcome the problem of gradient vanishing or gradient exploding. This idea was primarily inspired by circuit design rather than biology. Each neuron has a memory block and has a three-element structure: input, output and forget. The function of these gateway structures is to protect information by denying or allowing the flow of information. The input gate structure determines how much information from the previous layer is stored in the current memory block. The exit gate structure involves working at the other end, deciding how much information can be learned at that level. Forgetting about the gate structure is strange at first glance, but sometimes it is necessary to forget:

If the network is studying a book and starting a new chapter, it is necessary to forget some characters from the previous chapter.

Long- and short-term memory networks have proven their ability to remember complex sequences, such as writing like Shakespeare or synthesizing simple music. It's worth noting that each of these gate structures weighs the memory cells in the previous neuron, so typically requires more resources to run.

Gated Recurrent Units (GRU) This is a variant of long-term and short-term memory networks. The difference is that there is no input gate, no output gate, no forget gate, it only has one update gate. The update gate determines how much information is retained from the previous state and how much information is retained from the previous layer. This reset gate works the same as the LSTM forget gate, but its position is slightly different. It always emits all states, but there are no output gates. In most cases they are very similar to LSTM in function, the biggest difference being that GRU is slightly faster and easier to run (but inferior in expression ability). In practice they tend to cancel each other out because when you need a larger network to get more expressive power, they often cancel out the performance advantage. GRU can be better than LSTM without the need for additional expressiveness.

Neural Turing Machines (NTMs) This can be understood as an LSTM abstraction that attempts to remove the black box (to give us insight into what is going on). A neural Turing machine does not directly encode a memory block into a neuron; its memory block is separate. It attempts to combine the efficiency and persistence of conventional digital storage with the efficiency and expressiveness of neural networks. This idea is based on having a memory bank with addressable contents from which neural networks can read and write. Turing in a neural Turing machine comes from complete Turing: the ability to read, write, and change states depending on what it reads, which means it can express a universal Turing machine that can be expressed in everything.

Bidirectional recurrent neural networks, bidirectional long-short-term memory networks, bidirectional recurrent neural networks (BiRNN; bidirectional long-short-term memory networks, BiLSTM; bidirectional bidirectional recurrent units, BiGRU) are not shown in the table because they look the same , as the corresponding one-way network. The difference is that these networks are connected not only to the past, but also to the future. For example, predicting the word “fish” uses a one-way network of long-term and short-term memory. The learning process is as follows: enter the word "fish" letter by letter, and here the cyclic connection remembers the last value in time. To provide future information, the bilateral short-term memory network will input the next letter in the return channel. This method trains the network to fill gaps instead of predicting future information, for example, in image processing, it does not expand the boundaries of the image, but can fill the gap in the image.

Deep Residual Networks (DRN) This is a very deep feedforward neural network. In addition to connections between adjacent layers, it can pass input data from one layer to the next few layers (typically 2 to 5 layers). A deep residual network does not map some inputs (e.g. through a 5-layer network) to outputs, but learns to map some inputs to some outputs+inputs. Basically, it adds an identification function that takes the old input as the new input in the next layer. The results show that when reaching 150 layers, these networks are very effective at learning patterns, much more than the typical 2-5 layers. However, it turns out that these networks are not essentially recurrent neural networks (RNNs) built on a specific time basis, and they are always compared to long short-term memory (LSTM) networks without gate structures.

Echo State Networks (ESN) This is another different type of (circular) network. The difference is that the neurons are randomly connected (that is, there is no uniform form of communication between the layers), and the methods for training them are also different. Unlike inputting data and then backpropagating an error, an echo state network first inputs data, passes it forward, and then temporarily updates neurons. Its input layer and output layer play a slightly different role here than usual: the input layer is used to dominate the network, and the output layer is used to observe the active mode that evolves over time. During training, only the connections between observers and hidden units change.

Extreme Learning Machines (ELM) Essentially a feedforward neural network with random connections. It is similar to the Liquid State Machine (LSM) and Echo State Network (ESN), but has neither pulses nor cycles. They don't use backpropagation. Instead, they randomly initialize the weights and train the weights (the smallest error across all features) in a single step by least squares fitting. This makes the model a little less expressive, but much faster than backpropagation.

Liquid Machines (LSM) This is very similar to the Echo State Network (ESN). The real difference is that the fluid state machine is a kind of spiking neural network: the activation function of the sigmoid colon is replaced by a threshold function, and each neuron is a storage memory cell. Thus, when a neuron renews itself, its value lies not in the accumulation of neighboring neurons, but in its own accumulation. Once the threshold is reached, it transfers its energy to other neurons. This creates an impulse pattern: nothing happens until the threshold is suddenly reached.

Support Vector Machines (SVM) Found a better solution to the classification problem. Traditional SVMs usually deal with linearly separable data. For example, detecting which picture is Garfield and which picture is Snoopy, but not other results. During training, a support vector machine can be thought of as drawing all the data points (Garfield and Snoopy) on a (2D) graph and then figuring out how to draw a straight line to differentiate between those data points. This line divides the data into two parts, all Garfield on one side of the line and Snoopy on the other. The best dividing line is to maximize the spacing between the points on either side and the line. When new data needs to be classified, we will plot that new data point on the graph and then simply look at the side where it belongs to the straight line. Using kernel methods, they can be trained to classify n-dimensional data. This requires drawing points on a 3D map so you can spot Snoopy, Garfield and Simon, or even more cartoon characters. Support vector machines are not always considered neural networks.

Kohonen networks (Khonen networks, KN (Also called self-organizing (functional) mapping, SOM, SOFM) uses competitive learning to classify data in an unsupervised manner. The data is fed into the network, which then evaluates which neuron best matches that input. These neurons are then tuned to better match the input. Neighboring neurons move in the process. How many neighboring neurons move depends on their distance to the best fit unit. Sometimes the Kohonen network is not considered a neural network.

era

When the neural network is initialized, this value is set to 0 and has a manually set ceiling. The larger the epoch, the better trained the network and, accordingly, its result. The epoch increases each time we go through the entire set of training sets, in our case, 4 sets or 4 iterations.

It is important
not to confuse iteration with epoch and understand the sequence of their increment. First, the iteration increases n times, and then the epoch, and not vice versa. In other words, you cannot first train a neural network on only one set, then on another, and so on. You need to train each set once per era. This way, you can avoid errors in calculations.

Error

Error is a percentage that reflects the difference between the expected and received responses.
The error is formed every era and must decline. If this doesn't happen, then you are doing something wrong. The error can be calculated in different ways, but we will consider only three main methods: Mean Squared Error (hereinafter MSE), Root MSE and Arctan. There is no restriction on use like there is in the activation function, and you are free to choose any method that will give you the best results. You just have to keep in mind that each method counts errors differently. With Arctan, the error will almost always be larger, since it works on the principle: the greater the difference, the greater the error. The Root MSE will have the smallest error, so it is most common to use an MSE that maintains balance in error calculation. MSE

Root MSE

Arctan

The principle of calculating errors is the same in all cases. For each set, we count the error by subtracting the result from the ideal answer. Next, we either square it or calculate the square tangent from this difference, after which we divide the resulting number by the number of sets.