Stanford course: lecture 5. Convolutional neural networks

Watch the material on video

You've probably heard that neural networks are capable of solving problems of classifying graphic images (distinguishing a cat from a dog, an airplane from a car, men from women, etc.), stylizing images, coloring them, generating new graphic images and doing many other interesting things with images. And we will learn how to do some of these effects. So, when it comes to image processing, a special NN architecture is used - convolutional NNs. In English it sounds like:

Convolutional Neural Networks (CNN)

They were originally proposed by Yann LeCun and designed for graphical image classification problems. It all started with the fact that in 2012, Alex Krizhevsky’s team won the annual ImageNet competition in graphic image recognition. Their algorithm showed an accuracy of 83.6% of correct classification - a record at that time. And this record was achieved by the convolutional neural network – AlexNet.

The general idea of the architecture of such networks was taken from the biological visual system. Scientists have found that the dendrites of each neuron do not connect to all receptors in the retina, but only to a certain local area. And already the dendrites of the entire group of visual neurons cover the entire retina of the eye:

Mathematicians have generalized this structure and proposed the following solution. The input image signal is applied to the input of the neuron only within a limited area, usually square, for example, 3x3 pixels. Then, this area is shifted to the right by a given step, say 1 pixel, and the inputs are sent to the second neuron. This is how the entire image is scanned. Moreover, the weighting coefficients for all neurons in this group are the same.

After this, scanning the image is repeated, but with a different set of weighting coefficients. We get the second group of neurons. Then, the third, fourth, and in the general case we have n different groups. This is how the first hidden layer of convolutional neural networks is formed.

Let's now take a closer look at what happens at the inputs of the neurons of each individual group. Since the weighting coefficients within the group do not change, in fact, we have a window (in our example, its size is 3x3 pixels) with a set of certain numbers:

These numbers are multiplied by the values of the corresponding pixels in the image and summed up:

And, in addition, each such maxi can have one more additional parameter - bias (displacement), let's denote it by . Then, the window is shifted to the right (in our example by one pixel) and the calculations are repeated:

In general, for the entire image, the sum can be determined as follows:

That is, by moving this window, the weighting coefficients remain unchanged and we get the total number of adjustable parameters for one group of neurons:

Accordingly, for n groups, we get:

So, in digital signal processing this sum is called convolution

, and the window with weighting coefficients is
the impulse response of the filter
(or
the filter kernel
). What is the point of such a filter? Let's imagine that we have a schematic image of a house and we will pass it through these kernels:

The output is clear vertical lines in the first case and horizontal lines in the second case. All other lines became paler. That is, the filter allows you to select characteristic areas in the image in accordance with the configuration of the weighting coefficients. Thanks to this approach, the neurons of each group are activated when a fragment matching their nuclei appears in the image area:

And the output is a set of feature maps, which are called channels

. Significant values in each map indicate the presence of a feature in a strictly defined location in the image. If there are several such features (in different areas of the image), then several neurons associated with these areas will be activated at the output. Thanks to this, subsequent convolution layers can generalize the found features to more complex ones, for example, ellipses, rectangles, various line intersections, etc.

Of course, the values of the feature maps are the outputs of the neuron activation functions, that is, here, everything is as usual: the sum (convolution) passes through the activation function and the output values are generated:

I hope that the concept of SNA channels is generally clear. But we considered the simplest option, when a single-channel image, for example, in grayscale, was supplied to the input. If a full-color image is processed, represented, for example, by three RGB color components, then each color component is first transformed by its own separate, independent kernel, then the calculated feature maps are added up, a bias is added to them, and a single final feature matrix is formed, which passes through the activation function neurons and output values on the corresponding channel are obtained.

This is what processing full-color images looks like on the first hidden layer of the CNN. Subsequent layers work on the same principle, only instead of RGB color channels, channels from feature maps generated on the previous layer are processed.

And one last important point. See, after processing the original image, each channel actually forms a new image of a slightly smaller size:

If initially there was an image, for example, 128x128 pixels, then a feature map with the size of:

128-2 x 128-2 = 126x126

This is not the best option. We would like to have the output size equal to the input size. To do this, the center of the transformation mask should be placed over the very first pixel of the image, and the kernel cells that extend beyond the boundaries should be filled with some values, often zero:

Then the output dimensions of the feature map will be exactly equal to the original dimensions of the image (of course, if the mask offset step is one pixel). If the offset step ( stride

) increase and make it equal to 2, then the output dimensions of the feature map on each channel will be 2 times smaller than the dimensions of the original image:

128:2 x 128:2 = 64 x 64

Which step to choose depends on the specific problem being solved and the desires of the CNN architecture developer. There are no clear recommendations here.

A little history

In 1957, Frank Rosenblatt invented the Mark-1 computing system, which was the first implementation of a perceptron. This algorithm also uses a linear classifier interpretation and a loss function, but the output is either 0 or 1, with no intermediate values.

Mark I indicators and switches

w and the original data x are supplied to the perceptron input , and their product is adjusted by the offset b .

In 1960, Bernard Widrow and Ted Hoff developed the single-layer ADALINE neural network and its improved version, the three-layer MADALINE. These were the first deep (for that time) architectures, but they did not yet use the backpropagation method.

ADALINE

The backpropagation algorithm appeared in 1986 in the work of David Rumelhart, which was called “Multilayer Perceptron”. It uses the equations we are already familiar with, the differentiation rule, and output values ranging from 0 to 1.

Then a period of stagnation began in the development of machine learning, since the computers of that time were not suitable for creating large-scale models. In 2006, Geoffrey Hinton and Ruslan Salakhutdinov published an article in which they showed how deep neural networks could be trained effectively. But even then they had not yet acquired their modern appearance.

Artificial intelligence researchers achieved their first truly impressive results in 2012, when successful solutions to speech recognition and image classification problems appeared almost simultaneously. At the same time, the first convolutional neural network AlexNet was presented, which achieved the high classification accuracy of the ImageNet dataset at that time. Since then, such architectures have been widely used in various fields.

How CNN works

Convolution is actually the main thing to understand about convolutional neural networks. This intricate mathematical term is needed for a moving window or filter across the image being examined. A moving window is applied to a specific area of nodes, as shown below. Where the filter applied is ( 0.5 * value in node):

The diagram shows only two output values, each representing a 2x2 input square. The mapping weight for each input square, as previously mentioned, is 0.5 for all four inputs. Therefore, the output can be calculated like this:

In the convolutional neural network part, we can imagine such a moving 2 x 2 filter sliding over all available nodes or pixels of the input image. Such an operation can be illustrated using standard neural network node diagrams:

The first position of the moving filter connections is shown with a blue line, the second - with a green line. The weights for each such connection are 0.5.

Here are a few things in the convolution step that speed up the training process by reducing the number of parameters and weights:

Sparse connections—not every node in the first (input) layer is connected to every node in the second layer. This distinguishes CNN architecture from a fully connected neural network, where each node is connected to all others in the next layer.
Constant filter parameters. In other words, as the filter moves across the image, the same weights are applied to each 2 x 2 set of nodes. Each filter can be trained to perform specific transformations on the input space. Therefore, each filter has a specific set of weights that are applied to each convolution operation. This process reduces the number of parameters. It cannot be said that any weight is constant within an individual filter. In the example above, the weights were [0.5, 0.5, 0.5, 0.5], but nothing prevented them from being [0.25, 0.1, 0.8, 0.001]. The choice of specific values depends on the training of each filter.

These two properties of convolutional neural networks significantly reduce the number of parameters for training compared to fully connected networks.

The next step in the CNN structure is to pass the output of the convolution operation through a nonlinear activation function. We are talking about a certain subtype of ReLU, which provides the well-known nonlinear behavior of this neural network.

The process used in the convolutional block is called feature mapping. The name is based on the idea that each convolutional filter can be trained to find different features in an image, which can then be used in classification. Before we talk about the next feature of CNNs, called pooling, let's look at the idea of feature mapping and channels.

Neural networks are everywhere

Convolutional networks are good at handling huge data sets and train efficiently on GPUs through parallel computing. These features are the key to the fact that artificial intelligence is now used almost everywhere. It solves problems of image classification and search, object detection, segmentation, and is also used in more specialized fields of science and technology.

Examples of image classification and search problems

Examples of object detection and segmentation

Neural networks are actively being developed and are already used in autonomous cars, in tasks of facial recognition, video classification, and determining human poses or gestures. By the way, it was convolutional architectures that learned to beat people at chess and. Other specific applications include medical image analysis, geographic map segmentation, Image captioning, and transferring artists' style to photographs.

Style Transfer Examples

This is just a small part of the examples of using convolutional networks. Let's take a look at how they work and what makes them so versatile.

Feature mapping and multi-channel

Because the weights of individual filters remain constant when applied to input nodes, they can learn to select specific features from the input data. In the case of images, architecture is able to learn to distinguish common geometric objects - lines, edges and other shapes of the object under study. This is where the definition of feature mapping comes from. Because of this, any convolutional layer needs many filters that are trained to detect different features. Therefore, it is necessary to supplement the previous moving filter diagram as follows:

Now on the right side of the figure you can see several stacked outputs of the convolution operation. There are several of them because there are several trainable filters, each of which produces its own 2D output (in the case of a 2D image). Such a set of filters is often called channels in deep learning. Each channel must be trained to extract a specific key feature from the image. The output of a convolutional layer for a black-and-white image, as in the MNIST dataset, has 3 dimensions - 2D for each of the channels and another for their number.

If the input object is multi-channel, then in the case of a color RGB image (one channel for each color), the output will be four-dimensional. Fortunately, any deep learning library, including PyTorch, can easily handle mapping. Finally, remember that the convolution operation goes through an activation function at each node.

The next important part of convolutional neural networks is a concept called pooling.

How convolutional networks work

In the last lecture, we discussed the idea of creating fully connected linear layers. Let's assume we have a 32x32x3 original 3D image. Let's stretch it into one long vector 3072×1 and multiply it with a weight matrix of size, for example, 10×3072. As a result, we need to get activation (output with class estimates) - for this we take each of the 10 rows of the matrix and perform the scalar product with the original vector.

As a result, we get a number that can be compared with the value of the neuron. In our case, we get 10 values. Fully connected layers work on this principle.

The main difference between convolutional layers is that they preserve the spatial structure of the image. Now we will use weights in the form of small filters - spatial matrices that pass through the entire image and perform a scalar product on each of its sections. In this case, the filter dimension (not to be confused with the size) always corresponds to the dimension of the original image.

As a result of walking through the image, we get an activation map, also known as a feature map. This process is called spatial convolution - you can read more about it in the article Convolution in Deep Learning in simple words . From it you can also find out why the size of the activation map is smaller than that of the original photo.

You can apply many filters to an image and get different activation maps as output. This way we will form one convolutional layer. To create an entire neural network, layers alternate one after another, and activation functions (for example, ReLU) and special pooling layers are added between them, reducing the size of feature maps.

Let's take a closer look at what convolution filters are. In the very first layers, they usually correspond to low-level features of the image, such as edges and boundaries. In the middle there are more complex features such as angles and circles. And in the final layers, the filters are more reminiscent of certain specific features that can be interpreted more broadly.

Examples of filters for convolutional layers of the VGG-16 neural network

The figure below shows examples of 5x5 filters and the activation maps that result from applying them to the original image (top left corner). The first filter (circled in red) looks like a small section of border slanted to the right. If we apply it to a photograph, the highest values (white) will be obtained in those places where there are edges with approximately the same orientation. You can verify this by looking at the first activation card.

Thus, one layer of the neural network finds the areas of the image that are most similar to the given filters. This process is very similar to the usual convolution of two functions. It shows how objects correlate with each other.

Putting everything together, we get something like the following picture: taking the original photo, we pass it through alternating convolutional layers, activation functions and pooling layers. At the very end we use a regular fully connected layer connected to all the pins, which shows us the final scores for each class.

Scheme of operation of a convolutional neural network

Modern convolutional neural networks operate on this principle.

Features of CNN

A fully connected neural network with multiple layers can do a lot, but to show truly outstanding results in classification tasks, it needs to go deeper. In other words, it requires many more layers in the network. However, adding many new layers brings problems. First, we face the problem of forgetting the gradient, although this can be solved using a sensitive activation function - the ReLU family of functions. Another problem with deep fully connected networks is that the number of weights to train grows rapidly. This means that the training process becomes slower or almost impossible, and the model may overtrain. However, there is a solution.

Convolutional neural networks attempt to solve the second problem by exploiting correlations between adjacent inputs in pictures or time series. For example, in a picture of a cat and a dog, pixels close to the cat's eyes are more likely to correlate with nearby pixels on its nose than with pixels on the dog's nose on the other side of the picture. This means that there is no need to connect every node to every node in the next layer. This technique reduces the number of weight parameters for training the model. CNNs also have other features that enhance the training process, which will be discussed in other chapters.

Subsampling (pooling) layer

This layer allows you to reduce the feature space while preserving the most important information. There are several different versions of the pooling layer, including maximum pooling, average pooling, and sum pooling. The maxpooling layer is most often used.

The subsampling layer requires only one hyperparameter - the pooling step, that is, the number of times by which the spatial dimensions need to be reduced. The most commonly used maxpooling layer is to reduce the size of the input tensor by half. Some libraries allow you to set separate reduction parameters for height and width, but most often these parameters are the same.

CoLab: Fashion MNIST Clothing Element Classification Using Convolutional Neural Network

We've been fooled! It makes sense to perform this practical part only after completing the previous part - all the code, except for one block, remains the same. The structure of our neural network changes, and these are four additional lines for convolutional neural layers and max-pooling subsampling layers.

model = tf.keras.Sequential([ tf.keras.layers.Conv2D(32, (3,3), padding='same', activation=tf.nn.relu, input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D((2, 2), strides=2), tf.keras.layers.Conv2D(64, (3,3), padding='same', activation=tf.nn.relu), tf.keras.layers.MaxPooling2D((2, 2), strides=2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation=tf.nn.relu), tf. keras.layers.Dense(10, activation=tf.nn.softmax) ])

They promise to give us all the detailed explanations of how it works in the next part—part 4.

Oh yes. The accuracy of the model at the training stage became equal to 97% (the model was “retrained” at epochs=10), and when running through the test data set it showed exactly 91%. A noticeable increase in accuracy compared to the previous architecture, where we used only fully connected layers - 88%.

Direct signal propagation

Having described the creation of the layer, let's move on to writing direct distribution. To obtain the output tensor, you need to go through all the scale x scale submatrices of the input volume, find the maximum in them and write it into the output tensor:

// forward propagation Tensor MaxPoolingLayer::Forward(const Tensor &X) { Tensor output(outputSize); // create an output tensor // go through each of the channels for (int d = 0; d max) max = value; // update the maximum } } output(d, i / scale, j / scale) = max; // write the found maximum into the output tensor } } } return output; // return the output tensor }

Implementation of CNN on PyTorch

Any deep learning framework worth its salt can handle convolutional neural network operations with ease. PyTorch is such a framework. This section will show you how to create a CNN using PyTorch step by step. Ideally, you should have some knowledge of PyTorch, but this is not required. We want to develop a neural network to classify characters in the MNIST dataset. The complete code for this tutorial can be found in this GitHub repository.

We are going to implement the following convolutional network architecture:

At the very beginning, the input is black-and-white representations of characters measuring 28x28 pixels each. The first layer consists of 32 channels of 5x5 convolutional filters + ReLU activation function, then 2x2 max pooling with downsampling with a step of 2 (this layer outputs data of 14x14 size). The next layer is fed with the 14x14 output from the first layer, which is scanned again with 5x5 convolutional filters from 64 channels, followed by 2x2 max pooling with downsampling to generate a 7x7 output.

After the convolutional part of the network follows:

alignment operation that creates 7x7x64=3164 nodes
middle layer of 1000 fully connected catches
softmax operation on the outermost 10 nodes to generate class probabilities.

These layers are represented in the output classifier.

Introduction

In the last lesson, we learned how to develop deep neural networks that can classify images of clothing items from the Fashion MNIST dataset.

The results we achieved while working on the neural network were impressive - 88% classification accuracy. And this is in a few lines of code (not taking into account the code for creating graphs and images)!

model = tf.keras.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28, 1)), tf.keras.layers.Dense(128, activation=tf.nn.relu), tf.keras .layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparce_categorical_crossentropy', metrics=['accuracy']) NUM_EXAMPLES = 60000 train_dataset = train_dataset.repeat( ).shuffle(NUM_EXAMPLES).batch(32) test_dataset = test_dataset.batch(32) model.fit(train_dataset, epochs=5, steps_per_epoch=math.ceil(num_train_examples/32)) test_loss, test_accuracy = model.evaluate(test_dataset, steps=math.ceil(num_test_examples/32)) print('Accuracy: ', test_accuracy) Accuracy: 0.8782

We also experimented with the influence of the number of neurons in hidden layers and the number of training iterations on the accuracy of the model. But how can we make this model even better and more accurate? One way to achieve this is to use convolutional neural networks, or CNNs for short. CNNs show greater accuracy when solving image classification problems than the standard fully connected neural networks that we encountered in previous lessons. It is for this reason that CNNs have become so popular and it is thanks to them that a technological breakthrough in the field of computer vision has become possible.

In this lesson, we will learn how easy it is to develop a CNN classifier from scratch using TensorFlow and Keras. We will use the same Fashion MNIST dataset that we used in the previous lesson. At the end of this lesson, we will compare the clothing classification accuracy of the previous neural network with the convolutional neural network from this lesson.

Before diving into development, it’s worth delving a little deeper into the operating principle of convolutional neural networks.

Two main concepts in convolutional neural networks:

convolution
subsampling operation (pooling, max pooling)

Let's take a closer look at them.

Stride

Another hyperparameter we can specify in the convolutional layer is stride, which specifies the number of steps in which the filter window is moved (in the previous example there was one stride).

Larger stride values reduce the size of information that will be passed on to the next layer. In the following image we can see the same previous example, but now with a step value of 2:

(source)

As we can see, the 5 × 5 image has become a smaller 2 × 2 filter. But in reality, convolutional size reduction steps are rarely used in practice; This is done using the join operations we introduced earlier. In Keras, a step in the Conv2D layer is configured with a step argument that defaults to steps = (1,1) value, which separately indicates progress in two dimensions.

Fully connected layer

This layer contains a matrix of weighting coefficients and a displacement vector and is no different from the same layer in an ordinary fully connected network. The layer's only hyperparameter is the number of output values. In this case, the result of applying the layer is a vector or tensor whose matrices in each channel have a size of 1x1. More information about the operation of the layer can be found in the article about creating a feedforward neural network.

Fully connected layer

Activation layer

This layer is a function that is applied to each number in the input image. The most commonly used activation functions are ReLU, Sigmoid, Tanh, LeakyReLU. Typically, the activation layer is placed immediately after the convolution layer, which is why some libraries even embed the ReLU function directly into the convolution layer. You can read more about activation functions here: activation functions.

Activation functions

General view

Below is an image from Wikipedia that shows the structure of a fully developed convolutional neural network:

If we look at the picture from top left to right, we first see an image of a robot. Next, a series of convolutional filters scan the input image to display features. From the outputs of these filters, pooling operations select a subsample. This is followed by the next set of convolutions and pooling as the output of the first set of pooling and convolution operations. Finally, at the output of the neural network there is an attached fully connected layer, the meaning of which needs explanation.

Fully connected layer

As previously discussed, a convolutional neural network takes high-resolution data and efficiently transforms it into object representations. The fully connected layer can be considered as a standard classifier over the information-rich output of the neural network to interpret and ultimately predict the classification result. To attach a fully connected layer to a network, the CNN's output measurements need to be "flattened".

Considering the previous diagram, the output has several channels of xxy tensors/matrices. These channels must be reduced to one (N x 1) tensor. Let's take an example: we have 100 channels with 2x2 matrices displaying the output of the last pooling operation in the network. In PyTorch you can easily do the conversion to 2x2x100 = 400 lines, as shown below.

Now that the basics of convolutional neural networks are laid, it's time to implement a CNN using PyTorch.