Neural Networks: 1. Basics

Introduction
Objects and their characteristics
Tutored training
Neuron
Network of neurons
Neuron as a hyperplane
Neuron Confidence
Neuron utility

Multiple exits
When is a hidden layer needed?
Neurons as logic gates
Four xor options
Approximation of the function y=f(x)
Function approximation in Phyton
Fuzzy logic

Introduction

This document begins a series of materials devoted to neural networks. Sometimes they are treated according to the principle: a human neural network can solve any problem, therefore a sufficiently large artificial neural network is capable of this. Most often, the network architecture and its training parameters are the subject of numerous experiments. The network turns out to be a black box, the events in which are mysterious even to her teacher.

We will try to combine empirical advice and mathematical understanding of the nature of neurons as separating hyperplanes and functions of fuzzy logic. First, various model examples of two-dimensional feature spaces will be considered. Our goal is to develop an intuitive understanding of network architecture choices. In the future, we will move on to multidimensional problems, graphic pattern recognition, convolutional and recurrent networks. The examples below can be run after practicing selecting training parameters.

Hyperparameters

A neural network is used to automate feature selection, but some parameters are manually configured.

learning rate

The learning rate is a very important hyperparameter. If the learning rate is too slow, then even after training the neural network for a long time, it will be far from optimal results. The results will look something like this:

On the other hand, if the learning rate is too high, the network will produce answers very quickly. You will get the following:

Objects and their characteristics

Let there be objects

(bottles of wine, visitors in the hospital, positions on the chessboard):

Each object is characterized by a set (vector) of features
x ={x1,x2,…,xn}
. Signs may be:

real (weight, height)
binary (woman/man)
non-numeric (red, blue,...)

Next, we will consider the signs to be real numbers from the range [0…1]
.
This can always be achieved using normalization, for example: x -> (x-xmin)/(xmax-xmin)
.
Binary features, respectively, take the value 0
or
1
.
x
non-numeric features can be made binary (red/not red, blue/not blue). In addition, for now we will assume that objects are not causally related to each other and their order is not significant.

Let objects of this type be divided into classes (person: {healthy, sick}, wine: {Italian, French, Georgian}). Each object can also be associated with a number y

(the degree of White’s advantage in a chess position; the quality of wine according to the average opinion of experts, etc.). The following two closely related problems are often solved:

Classification:
which of
the K
classes the object belongs to.
Regression:
what number
y
corresponds to the object.

Examples:

1) 3 signs: x
={temperature, hemoglobin level, cholesterol amount};
2 classes: { 0
: healthy,
1
: sick};
2) w*h
features:
x
={brightness of image pixels with width
w
and height
h
};
10 classes: { 0-9
: number in the picture}.
3) 8*8*13
signs:
x
={codes of chess pieces in cells};
regression: {white advantage over black = [-1...1]
}.

To successfully solve classification or regression problems, the features characterizing an object must be significant, and the feature vector must be complete (sufficient for classifying objects or determining the regression value y

One day Jed and Ned wanted to tell their horses apart. Jed made a scratch on the horse's ear. But Ned's horse scratched the same ear on a thorn. Then Ned put a blue bow on his horse's tail, but Jed's horse ate it. Farmers thought long and hard and chose a trait that is not so easy to change. They carefully measured the height of the horses, and it turned out that Jed's black mare was one centimeter taller than Ned's white stallion.

Tutored training

Let there be a set of objects, each of which belongs to one of k

numbered
(0,1,2,…k-1)
classes.
On this training set
, provided by the “teacher” (usually a person), the system learns.
Then, for objects unknown to the system ( test set
), it classifies them, i.e. tells which class this object belongs to. In this formulation, this is a problem of pattern recognition after supervised learning.

Number of features n

called
the dimension of the feature space
.
Let the attributes lie in the range [0…1]
.
Then we can represent any object as a point inside a unit n
-dimensional cube in the feature space.

Let's imagine the recognition system as a black box. This box has n

inputs to which feature values
x ={x1,x2,…,xn}
and
k
outputs
y ={y1,…,yk} are
(according to the number of classes).
We will also consider the value of the outputs to be real numbers from the range [0…1]
.
The system is considered to be correctly trained if, when feeding features corresponding to the i
-th class to the inputs, the value of
the i
-th output is equal to
1
, and all others
are 0
. In practice, such a result is difficult to achieve and all outputs turn out to be non-zero. Then it is considered that the output number with the maximum value is the class number, and the proximity of this value to one indicates the “degree of confidence” of the system.

When there are only two classes, a box can have one exit. If it is equal 0

, then this is one class, and if
1
, then another.
For fuzzy recognition, confidence thresholds are introduced. For example, if the output value lies in the range y=[0...0.3]
- this is the first class, if
y=[0.7...1]
- the second, and if
y=(0.3...0.7)
the system “refuses to make a decision”.

A box with one output can also approximate the function y=f(x1,…,xn)

, the values
of
which are continuous and are usually also normalized to unity, i.e.
y=[0 … 1]
. In this case, the regression problem is solved.

History of neural networks

The history of neural networks is much longer than is commonly believed. The idea of a “thinking system” dates back to ancient Greece, and the popularity of neural networks has varied over time. We will focus on the key events of modern evolution:

1943: Warren McCulloch and Walter Pitts published "A Logical Calculus of Ideas Relating to Nervous Activity" (link is external, PDF, 1 MB). The purpose of this study was to study the workings of the human brain, namely, the creation of complex patterns by transmitting signals by brain cells or neurons. One of the main ideas that emerged from this research was the analogy between binary threshold neurons and Boolean logic (0/1 values or true/false statements).

1958: Frank Rosenblatt, in his study “Perceptron: a probabilistic model of information storage and organization in the brain” (external link, PDF, 1.6 MB) described the perceptron model. He developed the ideas of McCulloch and Pitts by adding weighting coefficients to the formula. Using an IBM 704 computer, Rosenblatt was able to train the system to recognize cards marked on the left and right.

1974: The first scientist in the United States to describe in his dissertation (external link, PDF, 8.1 MB) the use of backpropagation in neural networks was Paul Verbos, although many researchers have been developing this idea.

1989: Yann LeCun published a paper (external link, PDF, 5.7 MB) that described the practical use of backpropagation constraints and integration into neural network architecture for training algorithms. In this study, a neural network was successfully trained to recognize handwritten zip code characters provided by the US Postal Service.

Neuron

A neural network is one of the possible fillings of the black box. A network node is a neuron having n

inputs
x ={x1,x2,…,xn}
and one output
y
.
Each input is associated with a real synaptic weight ω ={w1,w2,…,wn}
.
In addition, the neuron also has a “ bias parameter
”
w0
.
Thus, any neuron with n
inputs is completely determined by
n+1
parameters.

The output of a neuron is calculated as follows. The value of each input xi

multiplied by the corresponding synaptic weight
wi
and these products are added.
The offset parameter w0
.
The result d
is reduced to the range
[0 … 1]
using the nonlinear
sigmoid function y=S(d)
:
d = w0 + w1x1 + … + wnxn, y = S(d) = 1/(1+exp(-d)) .
The sigmoid function tends to 1

for large positive
d
and to
0
for large negative
d
.
When d=0
, it is equal to
S(0)=0.5
.
Thus, a neuron is a nonlinear function of n
variables of the threshold form:

activation function

The activation function is one of the most powerful tools that influences the strength assigned to neural networks. In part, it determines which neurons will be activated, in other words, and what information will be transmitted to subsequent layers.

Without activation functions, deep networks lose much of their learning ability. The nonlinearity of these functions is responsible for increasing the degree of freedom, allowing high-dimensional problems to be generalized to lower dimensions. The following are examples of common activation functions:

Network of neurons

A network is a set of interconnected neurons. Different connection methods and therefore different network architectures are possible. Let neurons be arranged in layers and the output values of neurons i

-of that layer is fed to the inputs of all neurons of the next
i+1
layer.
Such a network is called a fully connected feed-forward network. We will denote the network inputs by squares and call them input neurons. Unlike “ordinary” neurons, this is simply a linear function y=x
. The outputs of the neurons of the last layer of the network are the outputs of the black box and are indicated by triangles.

Below in the first figure, the network consists of three inputs (zero layer) and two output neurons. We will code this architecture as follows: [3,2]

, where the numbers are the number of neurons in the layer.
The first digit is always the number of inputs, and the last digit is the number of outputs. The following figure shows the network [2,3,1]
.
It contains one hidden layer
with three neurons.
It is hidden in the sense that it is located inside a black box (dotted line) between the input and output layers (the neurons of the output layer, however, are also partially hidden and only their outputs “stick out” to the outside). The third figure shows a [2,3,3,2]
with two hidden layers.

These networks are called direct propagation networks because data (signs of an object) are supplied to the input and sequentially, without loops, transmitted (distributed) to the outputs. The same type of networks includes the so-called. convolutional networks

, in which not all neurons of two adjacent layers are connected to each other (below the first figure). Often, the weights of all neurons in the convolutional layer are the same. Such networks will be discussed in more detail in image recognition. The second figure below shows a version of a network in which the concept of a layer is absent, but it is still a direct distribution network.

The last picture is not a direct distribution network, but a so-called one. recurrent network

. In it, signals from one or more output neurons are fed back to the input. Typically, such a recursion is carried out in several cycles until stationary values are established at the network outputs. Recurrent networks have memory and the sequence of objects is important for them. This behavior is useful if objects are ordered in time (for example, when predicting time series).

Training any network consists of selecting parameters w0,w1,…,wn

each neuron, so that for a given object (we apply
x1,...,xn
), the network outputs have values corresponding to the class of the object.

Note that although a neuron always has only one output, it can be “supplied” to the inputs of different neurons. Similarly, in living neurons, the axon splits into separate processes, each of which affects the synapses (“junction points”) of the dendrites of other neurons. If a neuron is excited, then this excitation is transmitted along the axon to the dendrites of its neighbors.

Direct error propagation

Direct propagation
Let's set the initial weights randomly:

Let's multiply the input data by weights to form a hidden layer:

h1 = (x1 * w1) + (x2 * w1)
h2 = (x1 * w2) + (x2 * w2)
h3 = (x1 * w3) + (x2 * w3)

The output from the hidden layer is passed through a nonlinear function (activation function) to obtain the network output:

y_ = fn(h1 , h2, h3)

Neuron as a hyperplane

To make the black box of the recognition system more transparent, let's consider the geometric interpretation of the neuron. In n

-dimensional space, each point is specified by
n
coordinates (real numbers
x = {x1,…,xn}
).
The plane (as in ordinary 3-dimensional space) is defined by the normal vector ω ={w1,…,wn}
(perpendicular to the plane) and an arbitrary point
x0 ={x01,…,x0n}
lying in this plane.
When n > 3
the plane is usually called
a hyperplane
.

Distance d

from the hyperplane to some point
x ={x1,…,xn}
is calculated by the formula
d = w0 + w1 x1 + … + wn xn,
where
w0 = -(w1 x01 + … + wn x0n).
In this case d > 0

, if point
x
lies on the side of the plane where the vector
ω
and
d < 0
, if on the opposite side.
When d = 0
, point
x
lies in the plane. This is a key statement for further presentation that is worth remembering.

Changing parameter w0

shifts the plane in a parallel manner in space.
If w0
decreases, then the plane shifts in the direction of the vector
ω
(the distance is smaller), and if
w0
increases, the plane shifts against the vector
ω
. This follows directly from the above formula.

◄ We will derive this formula (which can be skipped) in vector notation. Let's write the vector x - x0

, starting at point
x0
(lying in the plane) and directed to point
x
(see the figure on the right; vectors add up according to the triangle rule).
The position of the point x0
is chosen at the base of the vector
ω
, so
ω
and
x - x0
are collinear (lie on the same straight line).
If the vector ω
is unit (
ω
2=1), then the scalar product of the vectors
x - x0
and
ω
is equal to the distance of point
x
to the plane:
d = w·(x-x0) = -w·x0 + w·x = w0 + w· x.
If length
w=| ω
|
vector ω
normal to the plane is different from unity, then
d
is
w
times greater (
w>1
) or less (
w<1
) than the Euclidean distance in
n
-dimensional space.
When the vectors x - x0
and
ω
are directed in opposite directions:
d < 0
. ►

If the space has n

dimensions, then the hyperplane is an
(n-1)
-dimensional object.
It divides the entire space into two parts. For clarity, let's consider a 2-dimensional space. The hyperplane in it will be a straight line (a one-dimensional object). On the right in the figure, the circle represents one point in space, and the cross represents another. They are located on opposite sides of the line (hyperplane). If the length of the vector ω
is much greater than one, then the distances
d
from the points to the plane in absolute value will be significantly greater than one.

Let's return to the neuron. It's easy to see that it calculates the distance d

from a point with coordinates
x =(x1,…,xn)
(input vector) to the hyperplane (
w0
,
ω
).
The neuron parameters ω =(w1,…,wn)
determine the direction of the normal of the hyperplane, and
w0
is associated with the displacement of the plane along the vector
ω
.
The neuron output is supplied with
distance
S(d)
[0…1] .
For large wi
, the object drawn above with a circle will lead to a neuron output close to one, and a cross - to zero.
Ratio w0/
|
ω
|
equal to the distance from the plane to the origin (0,…,0)
.
In modulus it should not exceed n½

1. A neuron is a hyperplane. The value of its output is equal to the normalized distance from the input vector to the hyperplane. During the learning process, the plane of each neuron changes its orientation and shifts in feature space.

Neural networks against resistant bacteria

We looked at how neural networks that identify an object in medicine can be used. But in medicine, neural networks are also used, which summarize information and find patterns among huge volumes of data. People may not see a pattern in mountains of data - but they are there, and the neural network will find them. Such neural networks are used to solve very important and pressing problems. It is known that many antibiotics begin to work worse and worse. This happens because bacteria become resistant to antibiotics. When there are a lot of bacteria, they quickly multiply and mutate. After taking antibiotics, especially not according to the rules, without following the recommendations and prescriptions of the doctor, there is a non-zero chance that some bacteria with some mutation will survive and give rise to a huge number of bacteria that will also survive after taking the antibiotic. The surviving strong bacterium will give rise to a whole colony of resistant bacteria that are not afraid of the antibiotic. Finding new antibiotics is the only way to combat the problem. At the moment, nothing better has been invented yet. Every time it becomes more and more difficult to find new types of antibiotics; There are fewer and fewer low-hanging fruits, and more and more substances have to be considered. As often happens, in cases where it is necessary to analyze gigantic volumes of information, biologists are helped by programming methods.

Neural networks were able to find a new antibiotic, galicine, as well as groups of other substances - potential antibiotics. As always, the neural network was trained using a database with already known antibiotics and other substances. After this, a huge number of various substances were driven through it, the ability of which to kill bacteria was unknown. And the neural network produced candidates, one of which turned out to be halicine, a medicine that was initially tried to be used to treat diabetes, but trials showed it to be ineffective. But tests on bacteria have proven that this is a new broad-spectrum antibiotic [5]. Figure 5 briefly describes the essence of the process. First, the network received a million molecules, after which they were processed in a black box; the network then produced a model of a possible antibiotic, after which it was tested on bacteria. Read more about the search for new antibiotics using machine learning in the article “Searching for new antibiotics using machine learning” [6].

Figure 5. Scheme of the neural network.

[5]

Neuron Confidence

When training a network, a criterion is required according to which the parameters of neurons are selected. Typically, this is done using the square of the deviation of the network outputs from their target values. So, for two classes and one output, the error is Error

network we consider
Error2 = (1/N) ∑ (y-yc)2
,

where y

is the actual output, and
yc
is its correct value, which is
0
for one class and
1
for the second.
The sum is calculated based on N
training examples.
this root mean square error
for all training objects minimal. We will discuss error minimization methods (selection of neuron parameters) later.

Consider a 2-dimensional feature space x1,x2

and two classes
0
and
1
.
In the figure below, objects of one class are represented by blue circles, and objects of the second class are represented by red crosses. To the right of the feature space is a network [2,1]
of one neuron.
Behind it, on a blue-red square, is a map of the neuron’s output values
for certain inputs (
x1, x2
run values from
0
to
1
in increments of
0.01
).
If y=0
- then it is blue, if
y=1
- then red, and white color corresponds to the value
y=0.5
:
σ=0.2D
To the right of the figures below the line, the neuron parameters are given in square brackets: [w0,w1,w2 ]

.
|w|
is given in parentheses. the normal vector
ω
and the average output value and its volatility
σy
(see below).
x1
axis of the feature space is directed to the right, and
the x2
is directed downward.
Therefore, the vector ω
with positive components
{w1, w2}
is directed diagonally downward (it is drawn next to the circle on the line containing neuron number
1
).

Above the line in the table is the root mean square error Error

such a network.
In this case, the Learn
means a training sequence of objects, and
Test
means a testing sequence that did not participate in training (test objects on the feature space graph are depicted as semi-transparent).
Miss
column contains the percentage of objects incorrectly recognized by the network (not assigned to their class).
The last line kNear
means the error and percentage of errors in the
10
nearest neighbors method (this will be described later).

In this example, the scatter of features of objects of each class is small. The classes are easily separated by a hyperplane (a line in 2 dimensions). The network strives to minimize the error to target values 0

or
1
at the output, so the magnitude of the vector
|
w|=48 takes on a relatively large value.
As a result, even objects located close to the plane (in the usual Euclidean sense) receive a large absolute value of d
.
Accordingly, y=S(d) = 0
or
1
.
We will call such a neuron confident
.
The more |
w|, the more confident the neuron is. On his exit map, a thin white line (the area of uncertainty) separates the deep blue (one class) from the deep red (second class).

The situation is somewhat different in the second example, where there is a wide area of overlap between objects of different classes. Now the neuron is not so confident and the vector length |

w|=24 is 2 times less:
σ=D σ=D
Let us present the distance deformation functions (sigmoid) for the length of the normal vector equal to 1,2,5,10,100

Since the neuron inputs are normalized to unity, the maximum distance from point x

with coordinates
{x1,…,xn}
in
an n
-dimensional cube (its diagonal) is equal to the root of
n
.
In 2-dimensional feature space dmax=1.4
.
If the plane passes through the middle of the cube dmax~0.5
.

A confident neuron is not always a good neuron. If the dimension of the feature space is n

is large, and the training data
N
is small, a network consisting of confident neurons may be
overtrained
and lead to a large recognition error on test objects. In addition, self-confident neurons learn more slowly. We will discuss these issues in more detail below.

In conclusion, we formulate the main conclusion, which is valid for spaces of any dimension:

2. If two classes in the feature space are separated by a hyperplane, then one neuron is enough to recognize them.

More complex options

Consider a perceptron with multiple outputs. In this case, the same situation occurs - the perceptron successfully learns in this case as well. This follows from the fact that a perceptron with outputs can be considered as independent perceptrons with a single output. And perceptrons with one output, as we found out, can learn by initializing the weights to zero values.

This is true for both linear and nonlinear activation functions with non-zero values of derivatives (, , , etc.). With these activation functions, the perceptron successfully learns by starting with zero weights.

However, the problem still arises when used as an activation function or with a zero slope coefficient for negative numbers (negative slope). In this case, the left derivative at zero is zero, but the right derivative is not zero, so the derivative is not mathematically defined. However, for the gradient descent method, some value for the derivative of the activation function must still be determined. In both TensorFlow and PyTorch implementations, the derivative at zero is considered to be zero. Because of this, when weights are initialized to zero, the value for each weight is zero. Therefore, the weights do not change and no learning occurs.

But the problem with the activation function described above can also manifest itself with weights initialized randomly, even with different signs. For example, if the weights were initialized as , then on the training dataset described above from one element, training will not occur, since the value is negative: , the derivative of the activation function for negative values of the argument is equal to zero, and a zero value of the derivative of the activation function leads to zero changes in the weights.

That is, when used, learning may not occur either when the weights are initialized to zero or when they are initialized to random values. Therefore, the case described above cannot be considered a strong argument against initializing weights with zeros and constants. Rather, this is an argument against using .

So why not initialize the weights to zero if it doesn't interfere with learning (unless you use relu)?

Neuron utility

If the hyperplane of a neuron does not intersect the unit hypercube in which the attributes (or the output values of previous neurons) are located, then such a neuron is usually of little use. It does not split the input data into two parts (which always belong to the interval [0 … 1]

.
Such a neuron will be called useless
.

It is necessary to strive to ensure that all neurons in the network are useful. Sometimes uselessness also appears for a plane intersecting a hypercube if objects of any classes end up on one side of this plane.

Before training begins, the neuron parameters are assumed to be equal to random values. In this case, the neuron may immediately become useless. To prevent this from happening, you can use the following initialization algorithm:

Components of the vector ω

we set it randomly, for example in the range
[-w ... w]
, where
w ~ 1 - 10
.
Then, inside the unit hypercube (or in some central part of it), we select a random point x 0={x01,…,x0n}
.
We set the shift parameter as follows: w0 = —
ω x0 = -(w1x01 + … + wnx0n). As a result, the hyperplane will be guaranteed to pass through the hypercube.

The shift parameter should also be controlled during the learning process, so that the neuron is useful all the time. There are two possible methods here - geometric and empirical. In the empirical one, the average output value of each neuron over the training objects is calculated. If, after passing through the network of all training examples, the average values of some neurons are close to zero or one, then they are considered useless. In this case, they can be “shake up” randomly (possibly preserving the vector ω

, changing only the shift parameter
w0
).

In all examples in this document, the neurons in the networks are colored according to their meaning. If = 0.5

, then the neuron is white, if
= 0
- blue, and if
= 1
- red. Saturated blue or red colors mean the neuron is useless. In the first two examples, the only neuron turned out to be very useful (white), since class objects were equally likely to be located to the right and left of the line (separating the hyperplane).

neuron volatility plays an important role

σy, equal to the standard deviation of its output from the average value. The lower the volatility, the less useful the neuron. Indeed, in this case, regardless of the values of the inputs, it takes the same output value. Therefore, without changing the output values of the network, such a neuron can be discarded by appropriately shifting the parameters of the neurons for which the useless neuron is the input.

How many layers and nodes should I use?

With the MLPs preamble out of the way, let's get to your real question.

How many layers should you use in your multilayer perceptron and how many nodes per layer?

In this section, we list five approaches to solve this problem.

1) Experimentation

In general, when people ask me how many layers and nodes to use for MLP, I often answer:

I don't know. Use systematic experimentation to find out what works best for your specific data set.

I still stand by this answer.

In general, you cannot analytically calculate the number of layers or the number of nodes to use per layer in an artificial neural network to solve a particular predictive modeling problem in the real world.

The number of layers and the number of nodes in each layer are hyperparameters of the model that need to be specified.

Chances are you will be the first to try to solve your specific problem using a neural network. Nobody decided this before you. So no one can tell you the answer on how to set up a network.

You must find the answer using a reliable test harness and controlled experiments. For example, see the post:

How to Assess the Skills of Deep Learning Models

Regardless of the heuristics you may come across, all answers come back to the need for careful experimentation to see what works best for your specific data set.

2) Intuition

The network can be configured through intuition.

For example, you might have an intuition that a particular predictive modeling problem requires a deep network.

The deep model provides a hierarchy of layers that create increasing levels of abstraction from the space of input variables to output variables

Given the understanding of the problem domain, we can believe that a deep hierarchical model is required to solve the forecasting problem. In this case, we can choose a network configuration that has many levels of depth.

The choice of a deep model encodes the very general belief that the function we want to learn must include a composition of several simpler functions. This can be interpreted from a representational learning perspective as saying that we believe that the learning problem is to discover a set of basic factors of variation, which in turn can be described in terms of other, simpler basic factors of variation.

— page 201, Deep Learning, 2016

This intuition can come from domain experience, experience modeling problems with neural networks, or a combination of both.

In my experience, intuitions are often falsified by experiments.

3) Go to depth

In their important textbook on deep learning, Goodfellow, Bengio, and Courville emphasize that, empirically, deep neural networks perform better on problems of interest.

In particular, they state the choice of using deep neural networks as a statistical argument in cases where depth may be intuitively useful.

Empirically, greater depth appears to lead to better generalization across a wide range of problems. […] This suggests that the use of deep architectures does express a useful prior on the space of features that the model learns.

— page 201, Deep Learning, 2016

We can use this argument to suggest that using deep networks that have multi-layer networks can be a heuristic approach for setting up networks to solve predictive modeling problems.

This is similar to the advice for starting with Random Forests and Stochastic Gradient Boosting on a predictive modeling problem with tabular data to quickly gain insight into assessing model skill before testing other methods.

4) Borrow Ideas

A simple, but perhaps time-consuming, approach is to use results reported in the literature.

Find research papers that describe the use of MLP in cases of forecasting problems that are somewhat similar to your problem. Pay attention to the network configurations used in these documents and use them as a starting point for a configuration to test your problem.

The transferability of model hyperparameters that lead to skillful models from one problem to another is a difficult open problem, and the reason why model hyperparameter configuration is more art than science.

However, the network layers and number of nodes used to solve problems are a good starting point for testing ideas.

5) Search

Developing an automatic search to check various network configurations.

You can start your search with ideas from literature and intuition.

Some popular search strategies include:

random : Try random configurations of layers and nodes per layer.
mesh : Try a systematic search by number of layers and nodes per layer.
heuristic : Try directed search on configurations such as genetic algorithm or Bayesian optimization.
exhaustive : Try all layer combinations and number of nodes; this may be feasible for small networks and data sets.

This can be difficult with large models, large datasets, and combinations thereof. Some ideas for reducing or managing the computing load include:

Place models in a smaller subset of the training dataset to speed up the search.
The dimensions of the search space are aggressively bound.
Parallelize the search across multiple server instances (for example, use Amazon EC2 Service).

I recommend being systematic if time and resources allow.

I've seen countless heuristics for estimating the number of layers and the total number of neurons or the number of neurons per layer.

I don't want to list them; I'm skeptical that they add practical value beyond the special occasions in which they are displayed.

If this area interests you, perhaps start with "Section 4.4 Capacity vs. Size" in the Neural Forging book. It summarizes a wealth of results in this area. The book dates back to 1999, so there's almost 20 more years of ideas to be found in this area if you're willing to go for it.

Also, look at some of the discussions linked in the Further Reading section (below).

Am I missing your favorite way to set up a neural network? Or do you know a good link to the topic? Let me know in the comments below.

Multiple exits

Let's now consider 3

class.
Using one neuron is not very convenient, therefore, as described at the beginning of the document, we will create a network [2,3]
with three outputs. Let the classes be localized in the feature space as follows:

Each output neuron separates its “class” from the other two. For example, the first neuron from the top (in the figure, the horizontal plane is number 1

) recognizes objects marked with blue circles, outputting
1
if the object is on the side where the vector
ω
(a dash next to the plane number).

Similarly, the second neuron recognizes red crosses, and the third one recognizes green squares. In each case, the normal vectors are directed towards “their” classes. All neurons of the network are quite confident and quite useful. Their slight blueness is due to the fact that there is always twice as much data (the “foreign” two classes) opposite the normal vector than along the vector. Therefore, the average value of each output is shifted below the neutral level 0.5

Conclusions based on formulas

In this section, the formulas are given between any two adjacent layers. Further designations are used: the number of the neuron on the current layer, and the number of the neuron (on the next layer.

When calculating the error, each term is proportional:

– error on the next layer,
– the value of the derivative of the activation function on the next layer,
– the weight of the corresponding connection.

Calculation of error for the i-th element

The weight change is proportional to:

– error on the next layer.
– the value of the derivative of the activation function on the next layer,
– learning rate coefficient,
– the value of the th element of the current layer.

Calculation of weight change w(i,j)

When is a hidden layer needed?

Let's now move on to a slightly more complex problem. Let objects of two classes (circles and crosses) be concentrated at the corners of the feature space as in the figure on the right. These two classes cannot be separated by one hyperplane (line). This problem is sometimes called partitioning OR ( xor

).
This logical operation is equal to truth (one) only if one of the arguments is true and the second is false (zero): “Masha loves either Kolya or Vasya, but not both of them.” In the figure, for the class marked with circles, the network should output zero, and for the class with a cross, one. If the objects are exactly at the corners, then xor(0,0) = xor(1,1) = 0
and
xor(0,1) = xor(1,0) = 1
.

To carry out classification, a neural network is needed [2,2,1]

with one hidden layer.
Below in the first graph (in the feature axes x1
and
x2
) two hyperplanes are shown (lines
A
and
B
).

hidden
neurons A
and
B. The neuron output values are shown in the second graph ( yA
and
yB
).

Both crosses lie in the directions of the normal vectors of the planes A

and
B.
_
Therefore, the distance from them to the planes will be positive and, if the neurons are confident enough in themselves, their outputs yA
and
yB
will produce one (lower right corner of the plane
yA
,
yB
).

Circle with signs (0,0)

from the upper left corner of the plane
x1
,
x2
will give at the outputs of the neurons
yA~0
,
yB~1
(this object lies opposite the normal vector of plane
A
and along the normal vector of plane
B
).
The second circle with features (1,1)

yA~1
,
yB~0
at the neuron outputs .
The resulting “deformed” space of features yA
and
yB
(second graph) can easily be divided by one plane
C
, which will be the output neuron of the network.
If its normal vector is directed as indicated in the second graph, then for both crosses the result will be y~1
, and for the circles
y~0
.

Below is a real example of a neural network trained to recognize two classes of objects, each of which is divided into two clusters:

3. Each layer of the network transforms the input feature space into some other space, possibly with a different dimension. This nonlinear transformation occurs until the classes turn out to be linearly separable neurons in the output layer.

Loss function

The loss function is at the center of the neural network. It is used to calculate the error between the actual and received responses. Our global goal is to minimize this error. Thus, the loss function effectively brings neural network training closer to this goal.

The loss function measures “how good” the neural network is at given the training set and the expected responses. It can also depend on variables such as weights and offsets.

The loss function is one-dimensional and not a vector because it evaluates how well the neural network performs as a whole.

Some famous loss functions:

Quadratic (standard deviation);
Cross entropy;
Exponential (AdaBoost);
Kullback-Leibler distance or information gain.

Standard deviation is the simplest loss function and the most commonly used. It is specified as follows:

The loss function in a neural network must satisfy two conditions:

The loss function should be written as an average;
The loss function should not depend on any activation values of the neural network other than the values produced at the output.

Neurons as logic gates

The analysis of neuron behavior can be approached from the standpoint of mathematical logic. To do this, let's focus on one class of xor problem, for example crosses. Let's write a logical condition that is satisfied by all objects of this class. In the example above: “any cross lies along the plane vector A

and along the plane vector
B
".
This can be briefly expressed by the formula A &
B.
The output neuron “ C
” implements such a logical “AND”.
Indeed, its plane is pressed to the lower right corner of the square in the feature space with coordinates (1,1)
.
Therefore, for inputs (supplied by neurons “ A
” and “
B
”) close to one, the output of this neuron will be
1
(more precisely, its value is greater than
0.5
).
Therefore, as expected, 1 & 1 = 1
.
If at least one of the inputs is different from 1
, then the output will be zero (less than
0.5
).
This is also true in a space of arbitrary dimension, where the hyperplane of the neuron that provides logical “AND” is pressed against the corner of the hypercube with coordinates (1,1,…,1)
(cuts it off from the rest of the hypercube).

If the plane is shifted to an angle (0,0)

, maintaining the direction of the normal to the angle
(1,1)
, then such a neuron will be a logical “OR”.
Its function y=S(x1,x2)
gives
S(0,0)=0
and in other cases
1
(below the first picture):

In the general case, the plane of a neuron that implements the logical “OR” cuts off the angle (0,0,...,0)
of an n
-dimensional cube, and its normal vector is directed towards the larger volume of the hypercube. In contrast, the standard logical “AND” (second figure) has a normal vector towards the smaller volume.

Logical "AND" for a neuron with n

inputs is described by the following function:
y = S( w·(x1+…+xn+α-n)
,
y = x1 & x2 & … & xn
,

where parameter α is a parameter lying in the range 0<α<1

.
The closer it is to zero, the stronger the plane is pressed against the angle with coordinates (1,1,…,1)
.
Indeed, when α=0
and
x1=…=xn=1
, we have
x1+…+xn-n=0
.
For this neuron to provide logical AND, it must give a negative distance to the “nearest” corner of the hypercube in which one coordinate is equal to zero: (1,…,1,0,1,…,1)
.
This gives the constraint α<1
.
The common factor w
characterizes the length of the normal (the larger it is, the more confident the neuron is).
The sigmoid function S(d)
is given at the beginning of the document.

The logical “OR” function is written similarly ( 0<α<1

)
y = S( w·(x1+…+xn-α) )
,
y = x1 ∨ x2 ∨ … ∨ xn
.

Another logical function of negation is implemented using subtraction. Let's denote it with a line above the variable name. Then x

=
1-x
and, as usual,
0 =1
,
1 =0
.
If one of the neuron inputs is negative, then its output function has the form: y = S( w·(-x1+…+xn+α-n+1)
,
y = x1 & x2 & … & xn
.

Thus, one of the components of the normal vector changes its sign and the plane of the neuron shifts.
Above, in the third and fourth figures, various negations of variables are shown. In these terms, it is worth obtaining logical OR from logical AND using de Morgan’s rule: !(x1 & x2) = x1 ∨ x2
,

where the exclamation mark is another way to indicate logical negation.

Approximation of the function y=f(x)

Using a neural network with one input, one output and a sufficiently large hidden layer, you can approximate any function y=f(x)

.
To prove this, let’s first create a network that gives an output of 1
if the input is in the range
[a...b]
and
0
otherwise.

Let σ(d) = S(ω d)

is a sigmoid function whose argument is multiplied by a large number
ω
, so that a rectangular step is obtained. Using two such steps, you can create a column of unit height:

We normalize the approximated function y=f(x)

to the interval
[0...1]
for both its argument
x
and value
y
.
Divide the range of change x=[0...1]
into a large number of intervals (not necessarily equal). At each interval the function should change slightly. Below are two such intervals:

Each pair of neurons in the hidden layer implements a unit column. Value d

equal to
w1
if
x∈(a,b)
and
w2
if
x∈(b,c)
.
If the output neuron is a linear adder, then we can put wi=fi
, where
fi
are the function values at each interval.
If the output neuron is an ordinary nonlinear element, then it is necessary to recalculate the weights wi
using the inverse function of the sigmoid (the last formula).

Backpropagation

The total error (total_error) is calculated as the difference between the expected value “y” (from the training set) and the obtained value “y_” (calculated at the stage of forward error propagation), passing through the cost function.
The partial derivative of the error is calculated for each weight (these partial differentials reflect the contribution of each weight to the total error (total_loss)).
These differentials are then multiplied by a number called the learning rate (η).

The result obtained is then subtracted from the corresponding weights.

The result will be the following updated weights:

w1 = w1 — (η * ∂(err) / ∂(w1))
w2 = w2 — (η * ∂(err) / ∂(w2))
w3 = w3 — (η * ∂(err) / ∂(w3))

The idea that we'll guess and initialize the weights randomly and they'll give accurate answers doesn't sound entirely reasonable, but it works well.

Popular meme about how Carlson became a Data Science developer

If you are familiar with Taylor series, backpropagation has the same end result. Only instead of an infinite series, we are trying to optimize only its first term.

Bias are weights added to hidden layers. They are also randomly initialized and updated in the same way as the hidden layer. The role of the hidden layer is to determine the shape of the underlying function in the data, while the role of the bias is to shift the found function so that it partially matches the original function.

Function approximation in Phyton

Below is the Phyton code that approximates the function y=sin(pi*x)

:
import numpy as np # library of numerical methods import matplotlib.pyplot as plt # library for drawing graphs def F(x): # this function is approximated return np.sin(np.pi*x);
n=10 # number of intervals x1 = np.arange(0, 1, 1/n) # left border coordinates x2 = np.arange(1/n, 1+1/n, 1/n) # right border coordinates print( "x1:",x1,"\nx2:",x2) # output these arrays f = F( (x1+x2)/2 ) # function in the middle of the interval fi = np.log( f/(1-f) ) # inverse values to sigmoid def S(z, z0, s): # sigmoid return 1/(1+np.exp(-100*s*(z-z0))) def Net(z): # network output return 1 /(1+np.exp(-np.dot(fi, S(z, x1, 1) + S(z, x2, -1) -1))) x = np.arange(0.01, 1, 0.01) # array of x-s y = [ Net(z) for z in x ] # array of y-s (network output) plt.plot(x, y) # results plt.plot(x, F(x)) # initial function plt.show() # show the picture As a result of work, with n=10

and
n=100
the following results are obtained:

Neural networks and IBM Cloud

IBM is at the forefront of the development of AI technologies and neural networks, as evidenced by the emergence and evolution of IBM Watson. Watson is a trusted solution for large enterprises that need to implement advanced deep learning and natural language processing technologies into their systems, backed by a proven, multi-layered approach to AI design and implementation.

The Apache Unstructured Information Management Architecture (UIMA) and IBM DeepQA software that powers Watson enable powerful deep learning capabilities to be integrated into applications. Using tools like IBM Watson Studio, your enterprise can efficiently bring open source AI projects into production with the ability to deploy and run models in any cloud environment.

For more information on how to get started using deep learning technology, visit the IBM Watson Studio and Deep Learning service pages.

Obtain an IBMid and create an IBM Cloud account.