Pattern recognition using neural networks. Basic Research

Introduction

The topic of this research is the development of an image recognition system based on artificial neural networks. The task of image recognition is very important, since the ability to automatically recognize images by a computer brings many new opportunities in the development of science and technology, such as the development of systems for searching for faces and other objects in photographs, quality control of manufactured products without human intervention, automatic transport control and many others.

As for artificial neural networks, in recent years this section of machine learning has been increasingly developed due to a significant increase in the computing power of existing computers and the widespread use of graphics cards for computing, which makes it possible to train neural networks of much greater depth and complex structure than before, which, in turn, show significantly better results compared to other algorithms for many tasks, especially image recognition. This direction of development of neural networks is called deep learning and is one of the most successful and rapidly developing at present. For example, according to the results of the annual image recognition competition ImageNet-2014, the vast majority of successful algorithms used deep convolutional networks.

Since the problem of image recognition is very extensive and in most cases requires a separate approach for different types of images, it is almost impossible to consider the problem of image recognition as a whole within the framework of one study, so it was decided to consider, as an example, separately such an image recognition subtask as road recognition signs.

Thus, the main goal of this study was to develop an image recognition system based on artificial neural networks for images of road signs. To achieve this goal, the following tasks were formulated:

Performing an analytical review of the literature on the topic of artificial neural networks and their application to image recognition problems

Development of an algorithm for recognizing road signs using artificial neural networks

Development of a prototype image recognition system based on the developed algorithm. The result of this task should be software package, allowing the user to upload an image and get a prediction of the class of that image

Conducting experimental studies. It is necessary to conduct research and evaluate the accuracy of the resulting algorithm.

During the study, all assigned tasks were completed. Specific results for each of them will be described in the main part of the work.

1. Literature review

.1 Machine learning

Neural networks, which are discussed in detail in this work, are one of the types of machine learning algorithms, or machine learning. Machine learning is one of the subfields of artificial intelligence. The main property of machine learning algorithms is their ability to learn while working. For example, an algorithm for constructing a decision tree, without having any preliminary information about what the data is and what patterns exist in it, but only a certain input set of objects and the values of some features for each of them, along with a class label, in the process of constructing the tree itself reveals hidden patterns, that is, it learns, and after training it is able to predict the class for new objects that it has not seen before.

There are two main types of machine learning: supervised learning and unsupervised learning. Supervised learning assumes that the algorithm, in addition to the original data itself, is provided with some additional information about it, which it can later use for training. Some of the most popular problems for supervised learning include classification and regression problems. For example, a classification problem can be formulated as follows: given a certain set of objects, each of which belongs to one of several classes, it is necessary to determine which of these classes belongs to new object. The task of recognizing road signs, which was considered in this work, is a typical type of classification problem: there are several types of road signs - classes, and the task of the algorithm is to “recognize” the sign, that is, to assign it to one of the existing classes.

Unsupervised learning differs from supervised learning in that the algorithm is not provided with any additional information other than the original data set itself. The most popular example of an unsupervised learning problem is the clustering problem. The essence of the clustering problem is as follows: the input to the algorithm is a certain number of objects belonging to different classes (but which class belongs to which object is unknown, the number of classes itself may also be unknown), and the goal of the algorithm is to split this set of objects into subsets of “similar” ones. objects, that is, belonging to the same class.

Among all machine learning algorithms, there are several main families. When it comes to classification problems, the most popular such families include, for example:

· Rule-based Classifiers - the main idea of such classifiers is to search for rules for assigning objects to a particular class in the “IF - THEN” form. To find such rules, some statistical metrics are usually used, and the construction of rules based on a decision tree is also common.

· Logistic regression - the main idea is to find a linear plane that most accurately divides space into two half-spaces so that objects of different classes belong to different half-spaces. In this case, the equation of the target plane is sought in the form of a linear combination of input parameters. To train such a classifier, for example, the gradient descent method can be used.

· Bayesian classifier - as the name suggests, the classifier is based on Bayes' theorem, which is written in the form

neural network machine learning

The idea of the classifier in in this case is to find the class with the maximum posterior probability, provided that all parameters have the values they have for the instance being classified. IN general case this task assumes prior knowledge of a very large number of conditional probabilities and, accordingly, a huge training sample size and high computational complexity, therefore, in practice, a type of Bayes classifier called the Naive Bayes classifier is most often used, in which it is assumed that all parameters are independent of each other, respectively, the formula takes a much simpler form and to use it you only need to know a small number of conditional probabilities.

Although this assumption is usually far from reality, a naive Bayes classifier often produces good results.

· Decision trees - in a simplified form, this algorithm consists of constructing a tree in which each node corresponds to some test that is performed on the parameters of the object, and the leaves are the final classes. There are many types of decision trees and algorithms for their construction. For example, one of the most popular algorithms is C4.5.

· Neural networks are a model represented as sets of elements (neurons) and connections between them, which in general can be directed or undirected and have some weights. During the operation of a neural network, a part of its neurons, called input, receives a signal (input data), which is somehow propagated and transformed, and at the output of the network (output neurons) you can see the result of the network’s operation, for example, the probabilities of individual classes. Neural networks will be discussed in this work in more detail in the next section.

· Support vector machines - the concept of the algorithm is the same as in the case of logistic regression, in searching for a dividing plane (or several planes), however, the method of searching for a given plane in this case is different - a plane is searched such that the distance from it to the nearest points is representatives of both classes as much as possible, for which quadratic optimization methods are usually used.

· Lazy learners are a special type of classification algorithms that, instead of first building a model and subsequently making decisions about assigning an object to a particular class based on it, are based on the idea that similar objects most often have the same same class. When such an algorithm receives an object for classification as input, it searches for objects similar to it among previously viewed objects and, using information about their classes, forms its prediction regarding the class of the target object.

It can be seen that classification algorithms can have very different ideas at their core and, of course, show different effectiveness for different types of problems. Thus, for problems with a small number of input features, rule-based systems may be useful if for input objects you can quickly and conveniently calculate some similarity metric - lazy classifiers, but if we are talking about problems with a very large number of parameters, which also difficult to identify or interpret, such as image or speech recognition, neural networks become the most suitable classification method.

1.2 Neural networks

Artificial neural networks are one of the widely known and used machine learning models. The idea of artificial neural networks is based on simulating the nervous system of animals and humans.

A simplified model of the animal nervous system is represented as a system of cells, each of which has a body and branches of two types: dendrites and axons. At a certain moment, a cell receives signals from other cells through dendrites and, if these signals are of sufficient strength, it becomes excited and transmits this excitation to other cells with which it is connected through axons. Thus, the signal (excitation) spreads throughout the nervous system. The neural network model is structured in a similar way. A neural network consists of neurons and directed connections between them, with each connection having some weight. In this case, some of the neurons are input - they receive data from the external environment. Then, at each step, the neuron receives a signal from all the input neurons, calculates the weighted sum of the signals, applies some function to it, and passes the result to each of its outputs. The network also has a number of output neurons that form the result of the network. So, for a classification task, the output values of these neurons can mean the predicted probabilities of each of the classes for the input object. Accordingly, training a neural network consists of selecting such weights for connections between neurons so that the output values for all input data are as close as possible to the actual ones.

There are several main types of neural network architectures:

· Feed-forward network - implies that neurons and connections between them form an acyclic graph, where signals propagate in only one direction. It is these networks that are the most popular and widely studied, and their training presents the least difficulties.

· Recurrent neural networks - in such networks, unlike feedforward networks, signals can be transmitted in both directions, and can arrive at the same neuron several times during the processing of one input value. A particular type of recurrent neural networks is, for example, the Boltzmann machine. The main difficulty in working with such networks is training them, since creating an effective algorithm for this in the general case is a difficult task and still does not have a universal solution.

· Self-organizing Kohonen maps - a neural network designed primarily for clustering and data visualization.

In the history of the development of neural networks, there are 3 main periods of growth. The first research in the field of artificial neural networks dates back to the 40s of the 20th century. In 1954, J. McCulloch and W. Pitts published the work “Logical Calculus of Ideas Relating to Nervous Activity,” which outlined the basic principles for constructing artificial neural networks. In 1949, D. Hebb’s book “Organization of Behavior” was published, where the author examined the theoretical foundations of training neural networks and for the first time formulated the concept of training neural networks as adjusting weights between neurons. In 1954, W. Clark made the first attempt to implement an analogue of the Hebb network using a computer. In 1958, F. Rosenblatt proposed a model of the perceptron, which was essentially a neural network with one hidden layer. The principal view of the Rosenblatt perceptron is shown in Figure 1.

Figure 1. Rosenblatt Perceptron

This model was trained using the error correction method, which consisted in the fact that the weights remain unchanged as long as the output value of the perceptron is correct, but in case of an error, the connection weight changes by 1 in the direction opposite to the sign of the error that occurred. This algorithm, as proven by Rosenblatt, always converges. Using such a model, it was possible to create a computer that recognized some letters of the Latin alphabet, which, undoubtedly, was a great success at that time.

However, interest in neural networks decreased significantly after the publication of the book “Perceptrons” by M. Minsky and S. Papert in 1969, where they described the significant limitations that the perceptron model has, in particular, the impossibility of representing the exclusive or function, and also pointed out that the requirements for the required computing power of computers for training neural networks are too high. Since these scientists had very high authority in the scientific community, neural networks were recognized for some time as an unpromising technology. The situation changed only after the creation of the backpropagation algorithm in 1974.

The backpropagation algorithm was proposed in 1974 simultaneously and independently by two scientists, P. Verbos and A. Galushkin. This algorithm is based on the gradient descent method. The main idea of the algorithm is to propagate error information from the network outputs to its inputs, that is, in the opposite direction with respect to the standard approach. In this case, the weights of the connections are adjusted based on the error information that has reached them. The main requirement that this algorithm imposes is that the activation function of neurons must be differentiable, since the gradient descent method, unsurprisingly, is calculated based on the gradient.

The backpropagation algorithm makes it easy to train a network that has several hidden layers, which allows you to bypass the limitations of the perceptron that have previously blocked the development of this industry. From a mathematical point of view, this algorithm comes down to sequential matrix multiplication - which is a fairly well studied and optimized problem. In addition, this algorithm is highly parallelizable, which can significantly speed up the network training time. All this together led to a new blossoming of neural networks and a lot of active research in this direction.

The backpropagation algorithm, at the same time, has a number of problems. Thus, the use of gradient descent involves the risk of convergence to a local minimum. Another important problem is the long training time of the algorithm in the presence of a large number of layers, since the error in the backpropagation process tends to decrease more and more as it approaches the beginning of the network; accordingly, training of the initial layers of the network will occur extremely slowly. Another disadvantage inherent to neural networks in general is the difficulty in interpreting the results of their work. The trained model of a neural network is something like a black box, the input of which is an object and the output is a forecast, but determining which features of the input object were taken into account and which neuron is responsible for what is usually quite problematic. This makes neural networks in many ways less attractive compared to, for example, decision trees, in which the trained model itself represents some quintessence of knowledge about the subject area under consideration and it is easy for the researcher to understand why a given object was assigned to a particular class.

These shortcomings, combined with the fact that although neural networks showed good results, these results were comparable to the results of other classifiers, for example, the increasingly popular support vector machines, while the results of the latter were much easier to interpret and training required less time, led to another decline in the development of neural networks.

This decline ended only in the 2000s of the 21st century, when the concept of deep learning, or deep learning, appeared and began to spread. The revival of neural networks was facilitated by the emergence of new architectures, such as convolutional networks, restricted bolzman machines, stacked autoencoders, etc., which made it possible to achieve significantly better results in such areas of machine learning as image and speech recognition. A significant factor for their development was also the emergence and spread of powerful video cards and their use for computing tasks. Video cards, featuring a significantly larger number of cores compared to a processor, albeit of lower power each, are ideal for training neural networks. This, combined with the recent significant increase in computer performance in general and the proliferation of computing clusters, has made it possible to train significantly more complex and deep neural network architectures than before.

1.3 Deeplearning

One of the most important problems faced when using machine learning algorithms is the problem of choosing the right features on which to train. This problem becomes especially significant when considering tasks such as image recognition, speech recognition, natural language processing and the like, that is, those where there is no obvious set of features that can be used for training. Typically, the selection of a set of features for training is carried out by the researcher himself through some analytical work, and it is the selected set of features that largely determines the success of the algorithm. So, for the task of image recognition, such features can be the predominant color in the image, the degree of its change, the presence of clear boundaries in the image, or something else. The issue of image recognition and choosing the right features for this will be discussed in more detail in the corresponding chapter.

However, this approach has significant drawbacks. Firstly, this approach involves a significant amount of work to identify features, and this work is carried out manually by the researcher and can be time-consuming. Secondly, identifying features on the basis of which a high-quality algorithm can be obtained in this case becomes largely random; moreover, it is unlikely that features that can have an important impact on the internal structure of the image will be taken into account, but this is not obvious to humans. Thus, the idea of automatically identifying features that can later be used to operate machine learning algorithms looks especially attractive. And this is precisely the opportunity provided by using the deep learning approach.

From the point of view of machine learning theory, deep learning is a subset of the so-called representation learning. The main concept of representation learning is precisely the automatic search for features, on the basis of which some algorithm, for example, classification, will work in the future.

On the other hand, another important problem that one has to face when using machine learning is the presence of variation factors that can have a significant impact on the appearance of the source data, but are not related to their very essence, which the researcher is trying to analyze . Thus, in an image recognition task, such factors may be the angle at which the object in the image is turned towards the observer, time of day, lighting, etc. So, depending on the point of view and the weather, the red car may have a different shade and shape in the photo. Therefore, for tasks such as identifying an object in a photograph, it seems reasonable to consider not specific low-level facts, such as the color of a particular pixel, but higher-level abstraction characteristics, such as the presence of wheels. However, it is obvious that determining, based on the original image, whether it has wheels is a non-trivial task, and solving it directly can be very difficult. Moreover, the presence of wheels is only one of a huge number of possible features, and identifying them all and writing algorithms to check an image for their presence does not seem very realistic. This is where researchers can take full advantage of the deep learning approach. Deep learning is based on providing the original object in the form of a hierarchical structure of features, in such a way that each next level of features is built on the basis of elements of the previous level. So, if we are talking about images, the lowest level will be the original pixels of the image, the next level will be the segments that can be distinguished among these pixels, then the corners and other geometric shapes into which the segments are added. At the next level of their figures, objects that are already recognizable to humans are formed, for example, wheels, and finally, the last level of the hierarchy is responsible for specific objects in the image, for example, a car.

To implement the deep learning approach in modern science, multilayer neural networks of various architectures are used. Neural networks are ideal for solving the problem of identifying from data and constructing a hierarchical set of features, since, in essence, a neural network is a collection of neurons, each of which is activated only if the input data meets certain criteria - that is, it represents a certain feature, while the rules for neuron activation - what determines this feature - are learned automatically. At the same time, neural networks in their most common form themselves represent a hierarchical structure, where each next layer of neurons uses the outputs of the neurons of the previous layer as its input - or, in other words, features of a higher level are formed based on features of a lower one level.

The spread of this approach and, in connection with this, the next flowering of neural networks, was due to three interrelated reasons:

· The emergence of new neural network architectures tailored to solve specific problems (convolutional networks, Boltzmann machines, etc.)

· Development and availability of computing using gpu and parallel computing in general

· The emergence and spread of the layer-by-layer training approach for neural networks, in which each layer is trained separately using the standard backpropagation algorithm (usually on unlabeled data, that is, in essence, an autoencoder is trained), which makes it possible to identify significant features at a given level, and then all layers are combined into a single network and the network is further trained using labeled data to solve a specific problem (fine-tuning). This approach has two significant advantages. Firstly, in this way the efficiency of network training is significantly increased, since at each moment of time it is not the deep structure that is trained, but the network with one hidden layer - as a result, problems with decreasing error values as the network depth increases and a corresponding decrease in the learning rate disappear. And secondly, this approach to network training allows you to use unlabeled data when training, which is usually much more than labeled data - which makes network training simpler and more accessible for researchers. Labeled data in this approach is required only at the very end to fine-tune the network to solve a specific classification problem, and at the same time, since the general structure of the features describing the data has already been created in the process of previous training, significantly less data is required to fine-tune the network than for initial training in order to identify signs. In addition to reducing the required amount of labeled data, using this approach allows you to train the network once using a large amount of unlabeled data and then use the resulting feature structure to solve various classification problems, refining the network using different data sets - in much less time than would be required in case of fully training the network every time.

Let's look in a little more detail at the basic neural network architectures commonly used in the context of deep learning.

· Multilayer perceptron - is a regular fully connected neural network with a large number of layers. The question of how many layers are considered large enough does not have a clear answer, but usually networks with 5-7 layers are already considered “deep”. This architecture of neural networks, although it does not have any fundamental differences from the networks that were previously used before the concept of deep learning was spread, can turn out to be very effective if the problem of its training is successfully solved, which was the main problem of working with such networks previously. Currently, this problem is solved by using graphic cards for network training, which allows for faster training and, accordingly, a greater number of training iterations, or layer-by-layer training of the network mentioned earlier. Thus, in 2012, Ciresan and colleagues published the article “Deep big multilayer perceptrons for digit recognition”, in which they made the assumption that a multilayer perceptron with a large number of layers, in case of sufficient training time (which is achieved in a reasonable time using parallel computing on a gpu ) and a sufficient amount of training data (which is achieved by applying various random transformations to the original data set) can show performance no worse than other, more complex models. Their model, which is a neural network with 5 hidden layers, showed an error rate of 0.35 when classifying numbers from the MNIST dataset, which is better than previously published results from more complex models. Also, by combining several networks trained in this way into a single model, they managed to reduce the error rate to 0.31%. Thus, despite its apparent simplicity, the multilayer perceptron is a quite successful representative of deep learning algorithms.

· Stacked autoencoder (stacked autoencoder) - this model is closely related to the multilayer perceptron and, in general, to the task of training deep neural networks. It is with the use of a stack autoencoder that layer-by-layer training of deep networks is implemented. However, this model is used not only for the purposes of training other models, but often has great practical significance in itself. To describe the essence of a stack autoencoder, let's first consider the concept of a regular autoencoder. An autoencoder is an unsupervised learning algorithm in which the expected output values of a neural network are its own input values. The autoencoder model is shown schematically in Figure 2:

Figure 2. Classic autoencoder

Obviously, the task of training such a model has a trivial solution if the number of neurons in the hidden layer is equal to the number of input neurons - then the hidden layer just needs to broadcast its input values to the output. Therefore, when training autoencoders, additional restrictions are introduced, for example, the number of neurons in the hidden layer is set to be significantly smaller than in the input layer, or special regularization techniques are used aimed at ensuring high degree sparsity of hidden layer neurons. One of the most common uses of pure autoencoders is the task of obtaining a compressed representation of the original data. For example, an autoencoder with 30 neurons in the hidden layer, trained on the MNIST dataset, allows you to restore the original images of numbers on the output layer practically without changes, which means that in fact, each of the original images can be quite accurately described by only 30 numbers. In this application, autoencoders are often considered as an alternative to principal component analysis. A stacked autoencoder is essentially a combination of several ordinary autoencoders trained layer by layer. In this case, the output values of the trained neurons of the hidden layer of the first of the autoencoders act as input values for the second of them, etc.

· Convolutional networks are one of the most popular deep learning models recently, used primarily for image recognition. The concept of convolutional networks is built on three main ideas:

o Local sensitivity (local receptive fields) - if we talk about the problem of image recognition, this means that the recognition of an element in an image should primarily be influenced by its immediate surroundings, while pixels located in another part of the image most likely are not associated with this element in any way and do not contain information that would help to correctly identify it

o Shared weights - the presence of shared weights in a model actually represents the assumption that the same object can be found in any part of the image, while the same pattern (set of weights) is used to search for it in all parts of the image

o Subsampling is a concept that allows you to make a model more resistant to minor deviations from the desired pattern - including those associated with minor deformations, changes in lighting, etc. The idea of subsampling is that when matching a pattern, it is not the exact value for a given pixel or region of pixels that is taken into account, but its aggregation in a certain neighborhood, for example, the average or maximum value.

From a mathematical point of view, the basis of convolutional neural networks is the operation of matrix convolution, which consists of element-wise multiplication of a matrix representing a small area of the original image (for example, 7 * 7 pixels) with a matrix of the same size, called the convolution kernel, and subsequent summation of the resulting values . In this case, the verification kernel is essentially a certain template, and the number obtained as a result of the summation characterizes the degree of similarity of a given area of the image to this template. Accordingly, each layer of the convolutional network consists of a number of templates, and the task of training the network is to select the correct values in these templates so that they reflect the most significant characteristics of the original images. In this case, each template is compared sequentially with all parts of the image - this is where the idea of separation of weights finds expression. These types of layers in a convolutional network are called convolution layers. In addition to convolution layers, convolutional networks contain subsampling layers that replace small regions of the image with a single number, thereby simultaneously reducing the sample size for the next layer to work with and making the network more robust to small changes in the data. The last layers of a convolutional network usually use one or more fully connected layers that are trained to directly classify objects. In recent years, the use of convolutional networks has become the de facto standard in image classification and allows achieving the best results in this area.

· Restricted Boltzmann Machines are another type of deep learning models, unlike convolutional networks, used primarily for the task of speech recognition. A Boltzmann machine in its classical sense is an undirected graph in which the edges reflect the dependencies between nodes (neurons). In this case, some neurons are visible, and some are hidden. From the point of view of neural networks, the Boltzmann machine is essentially a recurrent neural network; from the point of view of statistics, it is a random Markov field. Important concepts for Boltzmann machines are the concepts of network energy and equilibrium states. The energy of the network depends on how many strongly interconnected neurons are simultaneously in an activated state, and the task of training such a network is to converge to an equilibrium state in which its energy is minimal. The main disadvantage of such networks is the big problems with training them in a general way. To solve this problem, J. Hinton and his colleagues proposed a model of Restricted Boltzmann Machines, which imposes restrictions on the structure of the network, representing it in the form of a bipartite graph, in one part of which there are only visible neurons, and in the other - only hidden ones , accordingly, connections are present only between visible and hidden neurons. This limitation made it possible to develop effective algorithms for training networks of this type, due to which significant progress was made in solving speech recognition problems, where this model practically replaced the previously popular model of hidden Markov networks.

Now, having examined the basic concepts and principles of deep learning, let us briefly consider the basic principles and evolution of the development of image recognition and the place deep learning occupies in it.

1.4 Image recognition

There are many formulations for the image recognition problem, and it is quite difficult to define it unambiguously. For example, one can consider image recognition as the task of searching and identifying certain logical objects in the original image.

Image recognition is usually a challenging task for a computer algorithm. This is due, first of all, to the high variability of images of individual objects. Thus, the task of finding a car in an image is simple for the human brain, which is able to automatically identify the presence of important features for a car (wheels, specific shape) in an object and, if necessary, “get” the picture in the imagination, imagining the missing details, and extremely difficult for a computer. since there are a huge number of varieties of cars of different brands and models, having largely different shapes, in addition, the final shape of the object in the image greatly depends on the shooting point, the angle at which it is taken and other parameters. Lighting also plays an important role, as it affects the color of the resulting image and can also make individual details invisible or distort.

Thus, the main difficulties in image recognition are caused by:

· Variability of subjects within the class

Variability of shape, size, orientation, position in the image

· Variability of lighting

To combat these difficulties, a variety of methods have been proposed throughout the history of image recognition, and significant progress has already been made in this area.

The first research in the field of image recognition was published in 1963 by L. Roberts in the article “Machine Perception Of Three-Dimensional Solids”, where the author attempted to abstract from possible changes in the shape of an object and concentrated on recognizing images of simple geometric shapes under different lighting conditions and when there are turns. The computer program he developed was capable of identifying geometric objects of some simple shapes in an image and generating a three-dimensional model of them on the computer.

In 1987, S. Ullman and D. Huttenlocher published the article “Object Recongnition Using Alignment” where they also made an attempt to recognize objects of relatively simple shapes, and the recognition process was organized in two stages: first, searching for the area in the image where the target object is located , and determining its possible size and orientation (“alignment”) using a small set of characteristic features, and then pixel-by-pixel comparison of the potential image of the object with the expected one.

However, pixel-by-pixel comparison of images has many significant disadvantages, such as its complexity, the need to have a template for each of the objects of possible classes, and also the fact that in the case of pixel-by-pixel comparison, only a search for a specific object can be carried out, and not for an entire class of objects. In some situations this is applicable, but in most cases it is still necessary to search not for one specific object, but for many objects of some class.

One of the important directions in the further development of image recognition has become image recognition based on contour identification. In many cases, it is the contours that contain most of the information about the image, and at the same time, considering the image as a set of contours allows it to be significantly simplified. To solve the problem of finding edges in an image, the classic and most well-known approach is the Canny Edge Detector, whose operation is based on searching for a local gradient maximum.

Another important direction in the field of image analysis is the application of mathematical methods such as frequency filtering and spectral analysis. These methods are used, for example, to compress images (JPEG compression) or improve its quality (Gaussian filter). However, since these methods are not directly related to image recognition, they will not be discussed in more detail here.

Another task that is often considered in connection with the problem of image recognition is the segmentation problem. The main goal of segmentation is to highlight individual objects in an image, each of which can then be separately studied and classified. The segmentation task is greatly simplified if the source image is binary - that is, it consists of pixels of only two colors. In this case, the segmentation problem is often solved using methods of mathematical morphology. The essence of the methods of mathematical morphology is to represent an image as a certain set of binary values and apply logical operations to this set, the main ones being transfer, build-up (logical addition) and erosion (logical multiplication). Using these operations and their derivatives, such as closing and opening, it becomes possible, for example, to eliminate noise in an image or highlight boundaries. If such methods are used in a segmentation problem, then their most important task becomes precisely the task of eliminating noise and forming more or less homogeneous areas in the image, which can then be easily found using algorithms similar to searching for connected components in a graph - these will be the desired segments Images.

Regarding RGB image segmentation, one of the important sources of information about image segments can be its texture. To determine the texture of an image, a Gabor filter is often used, which was created in an attempt to reproduce the features of texture perception by human vision. The operation of this filter is based on the frequency conversion function of the image.

Another important family of algorithms used for image recognition are those based on local feature search. Local features are some clearly distinguishable areas of the image that allow you to correlate the image with the model (the desired object) and determine whether the given image matches the model and, if it does, determine the model parameters (for example, tilt angle, applied compression, etc.) . To perform their functions efficiently, local singularities must be resistant to affine transformations, shifts, etc. A classic example of local features are corners, which are often present at the boundaries of various objects. The most popular algorithm for finding angles is the Harris detector.

Recently, image recognition methods based on neural networks and deep learning have become increasingly popular. The main flowering of these methods came after the emergence of convolutional networks (LeCun, 2015) at the end of the 20th century, which show significantly better results in image recognition compared to other methods. Thus, most of the leading (and not only) algorithms in the annual image recognition competition ImageNet-2014 used convolutional networks in one form or another.

1.5 Road sign recognition

Recognition of road signs in general is one of the many tasks of recognizing images or, in some cases, video recordings. This task is of great practical importance, since road sign recognition is used, for example, in driving automation programs. The task of recognizing road signs has many variations - for example, identifying the presence of road signs in a photograph, highlighting an area in an image that represents a road sign, determining which specific sign is depicted in a photograph that is obviously an image of a road sign, etc. Typically, there are three global tasks associated with the recognition of road signs - their identification among the surrounding landscape, direct recognition, or classification, and the so-called tracking - this implies the ability of the algorithm to “follow”, that is, to keep the road sign in focus in the video sequence. Each of these subtasks in itself is a separate subject for research and usually has its own circle of researchers and traditional approaches. In this work, attention was focused on the problem of classifying the road sign depicted in the photograph, so we will consider it in more detail.

This problem is a classification problem for frequency imbalanced classes. This means that the probability of an image belonging to different classes is different, since some classes are more common than others - for example, on Russian roads the “40” speed limit sign is much more common than the “No through passage” sign. In addition, road signs form several groups of classes such that the classes within one group are very similar to each other - for example, all speed limit signs look very similar and differ only in the numbers within them, which, of course, significantly complicates the classification task. On the other hand, road signs have a clear geometric shape and a small set of possible colors, which could greatly simplify the classification procedure - if not for the fact that real photographs of road signs can be taken from different angles and under different lighting. Thus, the task of classifying road signs, although it can be considered as a typical image recognition task, requires a special approach to achieve the best result.

Until a certain point in time, research on this topic was quite chaotic and unrelated, since each researcher set his own tasks and used his own data set, so it was not possible to compare and generalize the existing results. Thus, in 2005, Bahlmann and colleagues, as part of a comprehensive road sign recognition system that supports all 3 previously mentioned subtasks of road sign recognition, implemented a sign recognition algorithm that works with an accuracy of 94% for road signs belonging to 23 different classes. Training was carried out on 40,000 images, with the number of images corresponding to each class varying from 30 to 600. To detect road signs in this system, the AdaBoost algorithm and Haar wavelets were used, and to classify the found signs, an approach based on the Expectation Maximization algorithm was used. The speed limit sign recognition system developed by Moutarde in 2007 had an accuracy of up to 90% and was trained on a set of 281 images. This system used circle and square detectors to detect traffic signs in images (for European and American signs, respectively), which then extracted each digit and classified it using a neural network. In 2010, Ruta and colleagues developed a system to detect and classify 48 different types of traffic signs with a classification accuracy of 85.3%. Their approach was based on searching images for circles and polyhedra and identifying a small number of special regions in them that make it possible to distinguish this sign from all others. In this case, a special transformation of image colors was applied, called by the authors Color Distance Transform, which allows reducing the number of colors present in the image, and accordingly increasing the possibilities for comparing images and reducing the size of the processed data. Broggie and Collen in 2007 proposed a three-stage algorithm for detecting and classifying road signs, consisting of color segmentation, shape detection and a neural network, but their publication does not provide quantitative indicators of the results of their algorithm. Gao et al. in 2006 proposed a traffic sign recognition system based on the analysis of the color and shape of the intended sign and showed a recognition accuracy of 95% among 98 traffic sign instances.

The situation with the fragmentation of research in the field of road sign recognition changed in 2011, when a competition on road sign recognition was held as part of the IJCNN (International Joint Conference on Neural Networks) conference. For this competition, the GTSRB (German Traffic Sign Recognition Benchmark) dataset was developed, containing more than 50,000 images of traffic signs located on German roads and belonging to 43 different classes. Based on this data set, a competition was held, consisting of two stages. Based on the results of the second stage, the article “Man vs. Computer: Benchmarking Machine Learning Algorithms for Traffic Sign Recognition”, which provides an overview of the competition results and a description of the approaches used by the most successful teams. Also, in the wake of this event, a number of articles were published by the authors of the algorithms themselves - participants in the competition, and this data set subsequently became the basic benchmark for algorithms related to the recognition of road signs, similar to the well-known MNIST for recognizing handwritten numbers.

The most successful algorithms in this competition include the Committee of Convolutional Networks (IDSIA team), Multi-Scale CNN (Sermanet team) and Random Forests (CAOR team). Let's look at each of these algorithms in a little more detail.

The neural network committee proposed by the IDSIA team from the Italian Dalle Molle Institute for Artificial Intelligence Research, led by D. Ciresan, achieved a character classification accuracy of 99.46%, which is higher than human accuracy (99.22%), an assessment that was carried out as part of the same competition. This algorithm was subsequently described in more detail in the article “Multi-Column Deep Neural Network for Traﬃc Sign Classification”. The main idea of the approach is that 4 various methods Normalizations: Image Adjustment, Histogram Equalization, Adaptive Histogram Equalization, and Contrast Normalization. Then, for each set of data obtained as a result of normalization, and the original data set, 5 convolutional networks with randomly initialized initial values of weights, each of 8 layers, were built and trained, while various random transformations were applied to the input values of the network during training, which allowed increase the size and variability of the training sample. The resulting network prediction was formed by averaging the prediction of each of the convolutional networks. To train these networks, an implementation using GPU computing was used.

An algorithm using a multi-scale convolutional network was proposed by a team consisting of P. Sermanet and Y. LeCun from New York University. This algorithm was described in detail in the article “Traffic Sign Recognition with Milti-Scale Convolutional Networks”. In this algorithm, all source images were scaled to a size of 32*32 pixels and converted to grayscale, after which contrast normalization was applied to them. Also, the size of the original training set was increased 5 times by applying small random transformations to the original images. The resulting network was composed of two stages, as shown in Figure 3, and the output values of not only the second stage, but also the first, were used in the final classification. This network showed an accuracy of 98.31%.

Figure 3. Multiscale neural network

The third successful algorithm using random forest was developed by the CAOR team from MINES ParisTech. Detailed description their algorithm was published in the article “Real-time traffic sign recognition using spatially weighted HOG trees”. This algorithm is based on constructing a forest of 500 random decision trees, each of which is trained on a randomly selected subset of the training set, with the final output value of the classifier being the one that received the most votes. This classifier, unlike the previous ones considered, did not use the original images in the form of a set of pixels, but HOG representations of images (histograms of oriented gradient) provided by the organizers of the competition along with them. The final result of the algorithm was 96.14% of correctly classified images, which shows that methods not related to neural networks and deep learning can also be used quite successfully for the task of recognizing road signs, although their effectiveness still lags behind the results of convolutional networks .

1.6 Analysis of existing libraries

To implement algorithms for working with neural networks in the system being developed, it was decided to use one of the existing libraries. Therefore, an analysis of existing software solutions for implementing deep learning algorithms was carried out, and based on the results of this analysis, a choice was made. The analysis of existing solutions consisted of two phases: theoretical and practical.

During the theoretical phase, libraries such as Deeplearning4j, Theano, Pylearn2, Torch and Caffe were considered. Let's look at each of them in more detail.

· Deeplearning4j (www.deeplearning4j.org) is an open source library for implementing neural networks and deep learning algorithms, written in Java. Can be used from Java, Scala and Closure languages, supports integration with Hadoop, Apache Spark, Akka and AWS. The library is developed and maintained by Skymind, which also provides commercial support for the library. This library uses a library for fast work with n-dimensional arrays ND4J developed by the same company. Deeplearning4j supports many types of networks, including multilayer perceptron, convolutional networks, Restricted Bolzmann Machines, Stacked Denoising Autoencoders, Deep Autoencoders, Recursive autoencoders, Deep-belief networks, recurrent networks and some others. An important feature of this library is its ability to work in a cluster. The library also supports training networks using GPUs.

· Theano (www.github.com/Theano/Theano) is an open source Python library that allows you to efficiently create, evaluate, and optimize mathematical expressions using multidimensional arrays. To represent multidimensional arrays and actions on them, the NumPy library is used. This library is intended primarily for scientific research and was created by a group of scientists from the University of Montreal. Theano's capabilities are very wide, and working with neural networks is only one of its small parts. Moreover, this particular library is the most popular and is most often mentioned when it comes to working with deep learning.

· Pylearn2 (www.github.com/lisa-lab/pylearn2) - an open source python library built on top of Theano, but providing a more convenient and simpler interface for researchers, providing ready set algorithms and allowing simple configuration of networks in YAML file format. Developed by a group of scientists from the LISA laboratory at the University of Montreal.

· Torch (www.torch.ch) is a library for computing and implementing machine learning algorithms, implemented in C, but allowing researchers to use the much more convenient Lua scripting language to work with it. This library provides its own efficient implementation of operations on matrices, multidimensional arrays, and supports GPU calculations. Allows you to implement fully connected and convolutional networks. It is open source.

· Caffe (www.caffe.berkeleyvision.org) - a library focused on the efficient implementation of deep learning algorithms, developed primarily by the Berkley Vision and Learning Center, however, like all previous ones, it is open source. The library is implemented in C, but also provides a convenient interface for Python and Matlab. Supports fully connected and convolutional networks, allows you to describe networks in the form of a set of layers in the .prototxt format, supports GPU calculations. The advantages of the library also include the presence of a large number of pre-trained models and examples, which, combined with other characteristics, makes the library the easiest to start working among the above.

Based on a set of criteria, 3 libraries were selected for further consideration: Deeplearning4j, Theano and Caffe. These 3 libraries have been installed and tested in practice.

Among these libraries, Deeplearning4j turned out to be the most problematic to install, in addition, errors were discovered in the demo examples supplied with the library, which raised some questions regarding the reliability of the library and made it extremely difficult to study it further. Taking into account, moreover, the lower productivity of the Java language compared to C, in which Caffe is implemented, it was decided to abandon further consideration of this library.

The Theano library also turned out to be quite difficult to install and configure, but for this library there is a large amount of high-quality and well-structured documentation and examples of working code, so in the end the library was able to work, including using a graphics card. However, as it turned out, the implementation of even an elementary neural network in this library requires writing a large amount of your own code; accordingly, great difficulties also arise with the description and modification of the network structure. Therefore, despite the potentially much broader capabilities of this library in comparison with Caffe, for this study it was decided to focus on the latter, as it is the most appropriate for the tasks set.

1.7 LibraryCaffe

The Caffe library provides a fairly simple and researcher-friendly interface, allowing you to easily configure and train neural networks. To work with the library, you need to create a network description in the prototxt format (protocol buffer definition file - a data description language created by Google), which is somewhat similar to the JSON format, well structured and human-readable. The description of the network is essentially a description of each of its layers in turn. The library can work with a database (leveldb or lmdb), in-memory data, HDF5 files and images as input data. It is also possible to use a special type of data called DummyData for development and testing purposes.

The library supports the creation of layers of the following types: InnerProduct (fully connected layer), Splitting (converts data for transmission to several output layers at once), Flattening (converts data from a multidimensional matrix into a vector), Reshape (allows you to change the dimension of data), Concatenation (converts data from several input layers into one output), Slicing and several others. For convolutional networks, special types of layers are also supported - Convolution (convolution layer), Pooling (subsampling layer) and Local Response Normalization (layer for local data normalization). In addition, several types of loss functions used in network training (Softmax, Euclidean, Hinge, Sigmoid Cross-Entropy, Infogain and Accuracy) and neuron activation functions (Rectified-Linear, Sigmoid, Hyperbolic Tangent, Absolute Value, Power and BNLL) are supported. - which are also configured as separate network layers.

Thus, the network is described declaratively in a fairly simple form. Examples of network configurations used in this study can be seen in Appendix 1. Also, for the library to work using standard scripts, it is necessary to create a solver.prototxt file, which describes the network training configuration - the number of iterations for training, learning rate, computing platform - cpu or gpu etc.

Model training can be implemented using built-in scripts (after they have been modified for the current task) or manually by writing code using the provided api in Python or Matlab. At the same time, there are scripts that allow you not only to train the network, but also, for example, to create a database based on the provided list of images - in this case, the images will be brought to a fixed size and normalized before being added to the database. The scripts used for training also encapsulate some auxiliary actions - for example, they evaluate the current accuracy of the model after a certain number of iterations and save the current state of the trained model to a snapshot file. Using snapshot files allows you to continue training the model in the future instead of starting over, if such a need arises, and also, after a certain number of iterations, change the configuration of the model - for example, add a new layer - and at the same time the weights of previously trained layers will retain their values, which allows you to implement the previously described layer-by-layer learning mechanism.

In general, the library turned out to be quite convenient to use and allowed us to implement all the desired models, as well as obtain classification accuracy values for these models.

2. Development of a prototype image recognition system

.1 Image classification algorithm

In the course of studying theoretical material on the topic and practical experiments, the following set of ideas was formed that should be embodied in the final algorithm:

· Using deep convolutional neural networks. Convolutional networks consistently show the best results in image recognition, including road signs, so their use in the developed algorithm seems logical

· Use of multilayer perceptrons. Despite the generally greater efficiency of convolutional networks, there are types of images for which a multilayer perceptron shows better results, so it was decided to use this algorithm as well

· Combining the results of several models using an additional classifier. Since it has been decided to use at least two types of neural networks, a way is required to generate some overall classification result based on the results of each of them. For this, it is planned to use an additional classifier not associated with neural networks, the input values for which are the classification results of each of the networks, and the output is the final predicted image class

· Applying additional transformations to input data. To increase the suitability of input images for recognition and, accordingly, improve the performance of the classifier, several types of transformations must be applied to the input data, and the results of each of them must be processed by a separate network trained to recognize images with this type of transformation.

Based on all the above ideas, the following concept of an image classifier was formed. The classifier is an ensemble of 6 neural networks operating independently: 2 multilayer perceptrons and 4 convolutional networks. In this case, networks of the same type differ from each other in the type of transformation applied to the input data. The input data is scaled so that the input to each network always produces data of the same size, although these sizes may vary for different networks. To aggregate the results of all networks, an additional classical classifier is used, of which 2 options were used: the J48 algorithm, based on a decision tree, and the kStar algorithm, which is a “lazy” classifier. Transformations used in the classifier:

· Binarization - the image is replaced with a new one, consisting of pixels of only black and white colors. To perform binarization, the adaptive thresholding method is used. The essence of the method is that for each pixel of the image, the average value of a certain neighborhood of its pixels is calculated (it is assumed that the image contains only shades of gray; for this, the original images were previously converted accordingly), and then, based on the calculated average value, it is determined whether the pixel should be considered black or white.

· Histogram equalization - the essence of the method is to apply a certain function to the histogram of an image, such that the values in the resulting diagram are distributed as evenly as possible. In this case, the target function is calculated based on the color intensity distribution function in the original image. An example of applying a similar function to an image histogram is shown in Figure 4. This method can be used for both black-and-white and color images - separately for each color component. Both options were used in this study.

Figure 4, Results of applying chart alignment to an image

· Contrast enhancement - consists in the fact that for each pixel of the image a local minimum and maximum are found in some of its neighborhood and then this pixel is replaced by a local maximum if its original value is closer to the maximum, or a local minimum otherwise. Applies to black and white images.

The general diagram of the resulting classifier is shown schematically in Figure 5:

Figure 5, final classifier diagram

To implement the part of the model responsible for transforming input data and neural networks, the Python language and the Caffe library are used. Let us describe the structure of each of the networks in more detail.

Both multilayer perceptrons contain 4 hidden layers, and their overall configuration is described as follows:

· Input layer

Layer 1, 1500 neurons

Layer 2, 500 neurons

Layer 3, 300 neurons

Layer 4, 150 neurons

Output layer

Example Caffe configuration file describing this network, can be seen in Appendix 1. As for convolutional networks, their architecture was based on the well-known LeNet network, developed for classifying images from the ImageNet dataset. However, to accommodate the images in question, which are significantly smaller in size, the network was modified. Its brief description looks like this:

The diagram of this network is presented in Figure 6.

Figure 6, convolutional network diagram

Each of the neural networks driving the model is trained separately. After training neural networks special script in Python, for each of the networks for each of the images of the training set, obtains the classification result in the form of a list of probabilities of each class, selects the two most likely classes and writes the resulting values together with the real value of the image class to a file. The resulting file is then passed as a training set to a classifier (J48 and kStar) implemented in the Weka library. Accordingly, further classification is made using this library.

2.2 System architecture

Now, having considered the algorithm for recognizing road signs using neural networks and an additional classifier, let’s move on directly to the description of the developed system that uses this algorithm.

The developed system is an application with a web interface that allows the user to upload an image of a road sign and obtain a classification result for this sign using the described algorithm. This application consists of 4 modules: web application, neural network module, classification module and administrator interface. A schematic diagram of the interaction of modules is presented in Figure 7.

Figure 7, diagram of the classification system

The numbers in the diagram indicate the sequence of actions when the user works with the system. The user uploads an image. The user's request is processed by the web server and the downloaded image is transferred to the neural networks module, where all the necessary transformations are performed on the image (scaling, changing the color scheme, etc.), after which each of the neural networks generates its own prediction. Then the control logic of this module selects the two most likely predictions for each network and returns this data to the web server. The Web server transmits the received data about network predictions to the classification module, where they are processed and a final answer about the predicted image class is generated, which is returned to the Web server and from there to the user. In this case, the interaction between the user and the web server and the web server and neural network and classification modules is carried out through REST requests using the HTTP protocol. The image is transmitted in multipart form data format, and data on the results of the classifiers is transmitted in JSON format. This operating logic makes individual modules sufficiently isolated from each other, which allows them to be developed independently, including using different programming languages, and also, if necessary, to easily change the operating logic of each module separately without affecting the operating logic of others.

To implement the user interface in this system, HTML and Java Script languages were used, Java was used to implement the web server and classification module, and Python was used to implement the neural network module. The appearance of the system user interface is shown in Figure 8.

Figure 8. System user interface

The use of this system assumes that the neural network and classification modules already contain trained models. At the same time, an administrator interface is provided for training models, which is essentially a set of scripts in Python for training neural networks and console utility in Java to train the final classifier. These tools are not intended to be used frequently or by non-professional users, so they do not require a more advanced interface.

In general, the developed application successfully fulfills all the tasks assigned to it, including allowing the user to conveniently obtain a class prediction for the image he has selected. Therefore, the only question that remains open is the practical results of the work of the classifier used in this algorithm; it will be discussed in Chapter 3.

3. Results of experimental studies

.1 Initial data

The previously mentioned GTSRB (German Traffic Signs Recognition Benchmark) dataset was used as input data in this study. This dataset consists of 51840 images belonging to 43 classes. Moreover, the number of images belonging to different classes is different. The distribution of the number of images by class is presented in Figure 9.

Figure 9. Distribution of the number of images by class

The sizes of the input images also vary. The smallest image has a width of 15 pixels, and the largest one has a width of 250 pixels. The general distribution of image sizes is presented in Figure 10.

Figure 10. Image size distribution

The source images are presented in ppm format, that is, in the form of a file where each pixel corresponds to three numbers - the intensity values of the red, green and blue color components.

3.2 Data preprocessing

Before starting work, the source data was prepared accordingly - converted from PPM form to JPEG format, which the Caffe library can work with, randomly divided into a training and test set in a ratio of 80:20%, and also scaled. The classification algorithm uses images of two sizes - 45*45 (for training a multilayer perspetron on binarized data) and 60*60 (for training other networks), so for each image of the training and test set, instances of these two sizes were created. Also, the previously mentioned transformations (binarization, histogram normalization, contrast enhancement) were applied to each of the images, and the resulting images were saved in the LMDB (Lightning Memory-Mapped Database), which is a fast and efficient key-value store " This method of data storage provides the fastest and most comfortable work Caffe libraries. The Python Imaging Library (PIL) and scikit-image were used to convert the images. Examples of images obtained after each of the transformations are presented in Figure 11. The images stored in the database were later used for direct training of neural networks.

Figure 11. Results of applying transformations to an image

As for training neural networks, each of the networks was trained separately and the results of its work were evaluated, and then the final classifier was built and trained. However, before this, a simple network was built and trained, which was a perceptron with one hidden layer. Consideration of this network had two goals - to study how to work with the Caffe library on simple example and the formation of some benchmark for a more substantive assessment of the results of the work of other networks in comparison with it. Therefore, in the next section, consider each of the network models and the results of its work in more detail.

3.3 Results of individual models

The models implemented in this study include:

· Neural network with one hidden layer

· Multilayer neural network built on the basis of source data

· Multilayer neural network built on the basis of binarized data

· Convolutional network built on the basis of the original data

· Convolutional network built on the basis of RGB data after chart alignment

· Convolutional network built on greyscale data after chart alignment

Convolutional network built on greyscale data after contrast enhancement

· A combined model consisting of a combination of two multilayer neural networks and 4 convolutional ones.

Let's look at each of them in more detail.

There is a neural one with one hidden layer, although it does not belong to deep learning models, it nevertheless turns out to be very useful for implementation, firstly, as training material for working with the library, and secondly, as some basic algorithm for comparison with the work of other models . The undoubted advantages of this model include the ease of its construction and high learning speed.

This model was built for initial color images of size 45*45 pixels, while the hidden layer contained 500 neurons. Training the network took about 30 minutes, and the resulting prediction accuracy was 59.7%.

The second model built is a multilayer fully connected neural network. This model was built for binarized and color versions of smaller format images and contained 4 hidden layers. The network configuration is described as follows:

· Input layer

Layer 1, 1500 neurons

Layer 2, 500 neurons

Layer 3, 300 neurons

Layer 4, 150 neurons

Output layer

The model of this network is shown schematically in Figure 12.

Figure 12. Schematic of a multilayer perceptron

The final accuracy of the resulting model is 66.1% for binarized images and 81.5% for color ones. However - which justifies the construction of a model for binarized images, despite its lower accuracy - there were a number of images for which the binarized model was able to determine the correct class. In addition, the model based on color images required significantly more training time - about 5 hours compared to 1.5 hours for the binarized version.

The rest of the constructed models are in one way or another based on convolutional networks, since it is precisely such networks that have shown the greatest efficiency in tasks such as image recognition. The neural network architecture was based on the well-known LeNet network, developed for classifying images from the ImageNet dataset. However, to accommodate the images in question, which are significantly smaller in size, the network was modified. Short description network architecture:

· 3 convolution layers with kernel sizes 9, 3 and 3 respectively

· 3 layers of subsampling

· 3 fully connected layers with sizes of 100, 100 and 43 neurons

This network was separately trained on larger source images, histogram-equalized images (color preserved), histogram-equalized images converted to black-and-white, and finally, contrast-enhanced black-and-white images. The learning results are presented in Table 1:

Table 1. Convolutional network training results

You can see that the best results were shown by the network built on the basis of black and white images after histogram equalization. This can be explained by the fact that in the process of straightening the diagram, the quality of images, for example, the differences between the image and the background and the overall degree of brightness, have improved, while at the same time, unnecessary information contained in color and does not carry a significant semantic load - a person is able to easily recognize the same the very signs in black and white - but which noisy the image and complicate classification - have been eliminated.

Train each network on the training set using the backpropagation method (the set is the same for all networks, but different transformations are applied to the images)

2. For each instance of the training set, obtain the two most likely classes in descending order of probability from each network, save the resulting set (12 values in total) and the actual class label

Use the resulting data set - 12 attributes and a class label - as a training set for the final classifier

Assess the accuracy of the resulting model: for each instance of the test set, obtain the two most likely classes in descending order of probability from each network and the final class prediction based on this data set

Based on the results of performing the steps from this scheme, the final accuracy of the combined algorithm was calculated: 93% when using the J48 algorithm and 94.8% when using KStar. At the same time, an algorithm based on a decision tree shows slightly worse results, but has two important advantages: firstly, the tree obtained as a result of the algorithm clearly demonstrates the logic of classification and allows us to better understand the real structure of the data (for example, which of the networks gives the most accurate predictions for a certain type of signs and therefore its prediction uniquely determines the result), secondly, after building a model, this algorithm allows the classification of new entities very quickly, since classification requires only one pass through the tree from top to bottom. As for the KStar algorithm, during its operation there is actually no model building, and classification is based on searching for the most similar instances among the training set. Therefore, this algorithm, although it classifies entities, does not provide any additional information for them, and most importantly, the classification of each instance may require a significant amount of time, which may be unacceptable for tasks where it is necessary to obtain results very quickly, for example, when recognizing road signs when automatically driving a car.

Table 2 presents a general comparison of the results of all the considered algorithms.

Table 2. Comparison of algorithm results

Figure 13 shows the network training graph using the example of a convolutional network for greyscale data with histogram equalization (number of iterations on the x-axis, accuracy on the y-axis).

Figure 13. Convolutional network training graph

To summarize the results of the study, it is also useful to study the classification results and identify which signs turned out to be the easiest to classify, and which, on the contrary, are difficult to recognize. To do this, consider the output values of the J48 algorithm and the resulting contingency table (see Appendix 3). You can see that for some signs the classification accuracy is 100% or very close to them - for example, these are the signs “Stop” (class 14), “Give way” (class 13), “Main road” (class 12), “End” all restrictions” (class 32), “Through passage is prohibited” (class 15) (Figure 12). Most of these signs have a distinctive shape (“Main Road”) or special graphic elements that have no analogues on other signs (“The End of All Restrictions”).

Figure 12. Examples of easily recognizable road signs

Other signs are often mixed with each other, for example, such as passing on the left and passing on the right, or various speed limit signs (Figure 13).

Figure 13. Examples of commonly mixed characters

A striking pattern is that neural networks often mix signs that are symmetrical with each other - this is especially true for convolutional networks that look for local signs in an image and do not analyze the image as a whole - multilayer perceptrons are more suitable for classifying such images.

To summarize, we can say that with the help of convolutional neural networks and a combined algorithm built on their basis, it was possible to obtain good results in the classification of road signs - the accuracy of the resulting classifier is almost 95%, which allows us to obtain practical results, in addition, the proposed approach using An additional classifier for combining the results of neural networks has many opportunities for further improvement.

Conclusion

In this work, the problem of image recognition was studied in detail using artificial neural networks. The most currently relevant approaches to image recognition, including those using deep neural networks, were reviewed, and our own algorithm for image recognition was developed using the example of the problem of recognizing road signs using deep networks. Based on the results of the work, we can say that all the tasks set at the beginning of the work were completed:

An analytical review of the literature was conducted on the topic of using artificial neural networks for image recognition. According to the results this review it was found that the most effective and widespread recently are approaches to image recognition based on the use of deep convolutional networks

An algorithm was developed for image recognition using the example of the task of recognizing road signs, using an ensemble of neural networks consisting of two multilayer perceptrons and 4 deep convolutional networks, and using two types of additional classifier - J48 and KStar - to combine the results of individual networks and form the final prediction

A prototype system for image recognition was developed using the example of road signs based on the algorithm from clause 3, which provides the user with a web interface for loading the image and, using pre-trained models, classifies this image and displays the classification result to the user

The algorithm developed in step 3 was trained using the GTSRB dataset, and the results of each of its constituent networks and the final accuracy of the algorithm for two types of additional classifier were assessed separately. According to the results of experiments, the highest recognition accuracy, equal to 94.8%, is achieved when using an ensemble of neural networks and the KStar classifier, and among individual networks, the best results - an accuracy of 89.1% - were shown by a convolutional network that uses preliminary image conversion to grayscale and performs image histogram equalization.

Overall, this study confirms that deep artificial neural networks, especially convolutional networks, are currently the most effective and promising approach for image classification, as confirmed by the results of numerous studies and image recognition competitions.

List of used literature

1. Al-Azawi M. A. N. Neural Network Based Automatic Traffic Signs Recognition //International Journal of Digital Information and Wireless Communications (IJDIWC). - 2011. - T. 1. - No. 4. - pp. 753-766.

2. Baldi P. Autoencoders, Unsupervised Learning, and Deep Architectures //ICML Unsupervised and Transfer Learning. - 2012. - T. 27. - P. 37-50.

Bahlmann C. et al. A system for traffic sign detection, tracking, and recognition using color, shape, and motion information // Intelligent Vehicles Symposium, 2005. Proceedings. IEEE. - IEEE, 2005. - pp. 255-260.

Bastien F. et al. Theano: new features and speed improvements //arXiv preprint arXiv:1211.5590. - 2012.

Bengio Y., Goodfellow I., Courville A. Deep Learning. - MIT Press, book in preparation

Bergstra J. et al. Theano: A CPU and GPU math compiler in Python //Proc. 9th Python in Science Conf. - 2010. - P. 1-7.

Broggi A. et al. Real time road signs recognition // Intelligent Vehicles Symposium, 2007 IEEE. - IEEE, 2007. - pp. 981-986.

Canny J. A computational approach to edge detection // Pattern Analysis and Machine Intelligence, IEEE Transactions on. - 1986. - No. 6. - pp. 679-698.

Ciresan D., Meier U., Schmidhuber J. Multi-column deep neural networks for image classification //Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. - IEEE, 2012. - pp. 3642-3649.

Ciresan D. et al. A committee of neural networks for traffic sign classification //Neural Networks (IJCNN), The 2011 International Joint Conference on. - IEEE, 2011. - pp. 1918-1921.

11. Ciresan D. C. et al. Deep big multilayer perceptrons for digit recognition //Neural Networks: Tricks of the Trade. - Springer Berlin Heidelberg, 2012. - pp. 581-598.

Daugman J. G. Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression // Acoustics, Speech and Signal Processing, IEEE Transactions on. - 1988. - T. 36. - No. 7. - pp. 1169-1179.

Gao X. W. et al. Recognition of traffic signs based on their color and shape features extracted using human vision models // Journal of Visual Communication and Image Representation. - 2006. - T. 17. - No. 4. - pp. 675-685.

Goodfellow I. J. et al. Pylearn2: a machine learning research library //arXiv preprint arXiv:1308.4214. - 2013.

Han J., Kamber M., Pei J. Data mining: Concepts and techniques. - Morgan Kaufmann, 2006.

Harris C., Stephens M. A combined corner and edge detector //Alvey vision conference. - 1988. - T. 15. - P. 50.

Houben S. et al. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark //Neural Networks (IJCNN), The 2013 International Joint Conference on. - IEEE, 2013. - pp. 1-8.

Huang F. J., LeCun Y. Large-scale learning with svm and convolutional netw for generic object recognition //2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. - 2006.

Huttenlocher D. P., Ullman S. Object recognition using alignment //Proc. ICCV. - 1987. - T. 87. - P. 102-111.

Jia, Yangqing. "Caffe: An open source convolutional architecture for fast feature embedding." h ttp://caffe. berkeleyvision. org (2013).

Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks //Advances in neural information processing systems. - 2012. - P. 1097-1105.

Lafuente-Arroyo S. et al. Traffic sign classification invariant to rotations using support vector machines //Proceedings of Advabced Concepts for Intelligent Vision Systems, Brussels, Belgium. - 2004.

LeCun Y., Bengio Y. Convolutional networks for images, speech, and time series //The handbook of brain theory and neural networks. - 1995. - T. 3361. - P. 310.

LeCun Y. et al. Learning algorithms for classification: A comparison on handwritten digit recognition //Neural networks: the statistical mechanics perspective. - 1995. - T. 261. - P. 276.

Masci J. et al. Stacked convolutional auto-encoders for hierarchical feature extraction //Artificial Neural Networks and Machine Learning-ICANN 2011. - Springer Berlin Heidelberg, 2011. - pp. 52-59.

Matan O. et al. Handwritten character recognition using neural network architectures //Proceedings of the 4th USPS Advanced technology Conference. - 1990. - S. 1003-1011.

McCulloch W. S., Pitts W. A logical calculus of the ideas immanent in nervous activity //The bulletin of mathematical biophysics. - 1943. - T. 5. - No. 4. - pp. 115-133.

Minsky M., Seymour P. Perceptrons. - 1969.

Mitchell T. Generative and discriminative classifiers: naive Bayes and logistic regression, 2005 //Manuscript available at #"897281.files/image021.gif">

A review of neural network methods used in image recognition is provided. Neural network methods are methods based on the use of various types of neural networks (NN). The main areas of application of various neural networks for pattern and image recognition:

application for extracting key characteristics or features of given images,
classification of the images themselves or the characteristics already extracted from them (in the first case, the extraction of key characteristics occurs implicitly within the network),
solving optimization problems.

The architecture of artificial neural networks has some similarities with natural neural networks. NNs designed to solve various problems may differ significantly in their operating algorithms, but their main properties are as follows.

The neural network consists of elements called formal neurons, which are themselves very simple and connected to other neurons. Each neuron converts a set of signals received at its input into an output signal. It is the connections between neurons, encoded by weights, that play a key role. One of the advantages of neural networks (as well as a disadvantage when implementing them on a sequential architecture) is that all elements can function in parallel, thereby significantly increasing the efficiency of solving a problem, especially in image processing. In addition to the fact that neural networks can effectively solve many problems, they provide powerful flexible and universal learning mechanisms, which is their main advantage over other methods (probabilistic methods, linear separators, decision trees, etc.). Learning eliminates the need to select key features, their significance, and relationships between features. But nevertheless, the choice of the initial representation of the input data (vector in n-dimensional space, frequency characteristics, wavelets, etc.) significantly affects the quality of the solution and is a separate topic. NNs have good generalization ability (better than decision trees), i.e. can successfully extend the experience gained on the final training set to the entire set of images.

Let us describe the use of neural networks for image recognition, noting the possibilities of application for human recognition from a face image.

1. Multilayer neural networks

The architecture of a multilayer neural network (MNN) consists of sequentially connected layers, where the neuron of each layer is connected with its inputs to all the neurons of the previous layer, and the outputs of the next one. A neural network with two decision layers can approximate any multidimensional function with any accuracy. A neural network with one decision layer is capable of forming linear dividing surfaces, which greatly narrows the range of problems they can solve; in particular, such a network will not be able to solve an “exclusive or” type problem. A neural network with a nonlinear activation function and two decisive layers allows the formation of any convex regions in the solution space, and with three decisive layers - regions of any complexity, including non-convex ones. At the same time, the MNS does not lose its generalizing ability. MNNs are trained using the backpropagation algorithm, which is a gradient descent method in the space of weights in order to minimize the total network error. In this case, errors (more precisely, the correction values of the weights) propagate in the opposite direction from inputs to outputs, through the weights connecting neurons.

The simplest application of a single-layer neural network (called auto-associative memory) is to train the network to reconstruct fed images. By feeding a test image as input and calculating the quality of the reconstructed image, you can evaluate how well the network recognized the input image. The positive properties of this method are that the network can restore distorted and noisy images, but it is not suitable for more serious purposes.

Rice. 1. Multilayer neural network for image classification. The neuron with the maximum activity (here the first) indicates membership in the recognized class.

MNN is also used for direct image classification - either the image itself in some form or a set of previously extracted key characteristics of the image is supplied as input; at the output, the neuron with maximum activity indicates membership in the recognized class (Fig. 1). If this activity is below a certain threshold, then it is considered that the submitted image does not belong to any of the known classes. The learning process establishes the correspondence of the images supplied to the input with belonging to a certain class. This is called supervised learning. When applied to human recognition from a facial image, this approach is good for access control tasks for a small group of people. This approach ensures that the network directly compares the images themselves, but with an increase in the number of classes, the training and operation time of the network increases exponentially. Therefore, tasks such as finding a similar person in a large database require extracting a compact set of key characteristics on which to base the search.

An approach to classification using frequency characteristics of the entire image is described in. A single-layer neural network based on multi-valued neurons was used. 100% recognition was noted on the MIT database, but recognition was carried out among images for which the network was trained.

The use of MNNs for classifying face images based on characteristics such as distances between certain specific parts of the face (nose, mouth, eyes) is described in. In this case, these distances were fed to the input of the NS. Hybrid methods were also used - in the first, the results of processing by a hidden Markov model were fed to the input of the NN, and in the second, the result of the NN’s operation was fed to the input of the Markov model. In the second case, no advantages were observed, which suggests that the result of the NN classification is sufficient.

The application of a neural network for image classification is shown when the network input receives the results of image decomposition using the principal component method.

In classical MNN, interlayer neural connections are fully connected, and the image is represented as a one-dimensional vector, although it is two-dimensional. The convolutional neural network architecture aims to overcome these shortcomings. It used local receptor fields (provide local two-dimensional connectivity of neurons), shared weights (provide detection of certain features anywhere in the image) and hierarchical organization with spatial subsampling. Convolutional neural network (CNN) provides partial resistance to scale changes, displacements, rotations, and distortions. The architecture of a CNN consists of many layers, each of which has several planes, and the neurons of the next layer are connected only with a small number of neurons of the previous layer from the vicinity of the local region (as in the human visual cortex). The weights at each point of one plane are the same (convolutional layers). The convolutional layer is followed by a layer that reduces its dimension by local averaging. Then again the convolutional layer, and so on. In this way, a hierarchical organization is achieved. Later layers extract more General characteristics, less dependent on image distortion. The CNN is trained using the standard backpropagation method. A comparison of MNS and CNN showed significant advantages of the latter both in terms of speed and reliability of classification. A useful property of CNNs is that the characteristics generated at the outputs of the upper layers of the hierarchy can be used for classification using the nearest neighbor method (for example, calculating the Euclidean distance), and the CNN can successfully extract such characteristics for images that are not in the training set. CNNs are characterized by fast learning and operating speeds. Testing a CNN on an ORL database containing images of faces with slight changes in lighting, scale, spatial rotations, position and various emotions showed approximately 98% recognition accuracy, and for known faces, variants of their images were presented that were not in the training set. This result makes this architecture promising for further developments in the field of image recognition of spatial objects.

MNNs are also used to detect objects of a certain type. In addition to the fact that any trained MNN can, to some extent, determine whether images belong to “their” classes, it can be specially trained to reliably detect certain classes. In this case, the output classes will be classes that belong and do not belong to the given image type. A neural network detector was used to detect a face image in the input image. The image was scanned by a window of 20x20 pixels, which was fed to the input of the network, which decides whether a given area belongs to the class of faces. Training was carried out using both positive examples (various images of faces) and negative examples (images that are not faces). To increase the reliability of detection, a team of neural networks was used, trained with different initial weights, as a result of which the neural networks made errors in different ways, and the final decision was made by voting of the entire team.

Rice. 2. Principal components (eigenfaces) and decomposition of the image into principal components.

A neural network is also used to extract key image characteristics, which are then used for subsequent classification. In , a method of neural network implementation of the principal component analysis method is shown. The essence of the principal component analysis method is to obtain maximally decorated coefficients characterizing the input images. These coefficients are called principal components and are used for statistical image compression, in which a small number of coefficients are used to represent the entire image. A neural network with one hidden layer containing N neurons (which is much smaller than the dimension of the image), trained using the backpropagation method to restore the output image fed to the input, generates the coefficients of the first N principal components at the output of the hidden neurons, which are used for comparison. Typically, from 10 to 200 main components are used. As the number of a component increases, its representativeness decreases greatly, and it makes no sense to use components with large numbers. When using nonlinear activation functions of neural elements, nonlinear decomposition into principal components is possible. Nonlinearity allows variations in input data to be more accurately reflected. Applying principal component analysis to the decomposition of facial images, we obtain principal components called eigenfaces (holons in the work), which also have a useful property - there are components that mainly reflect such essential characteristics of a face as gender, race, emotions. When reconstructed, the components have a face-like appearance, with the former reflecting the most general shape of the face, the latter representing various small differences between faces (Fig. 2). This method is well suited for finding similar images of faces in large databases. The possibility of further reducing the dimension of the principal components using NN is also shown. By assessing the quality of the reconstruction of the input image, you can very accurately determine its membership in the class of faces.

A neural network is a mathematical model and its implementation in the form of a software or hardware-software implementation, which is made on the basis of modeling the activity of biological neural networks, which are networks of neurons in a biological organism. Scientific interest in this structure arose because the study of its model allows one to obtain information about a certain system. That is, such a model can have practical implementation in a number of branches of modern science and technology. The article discusses issues related to the use of neural networks to build image identification systems that are widely used in security systems. Issues related to the topic of image recognition algorithm and its application are explored in detail. Briefly provides information about the methodology for training neural networks.

neural networks

learning using neural networks

image recognition

local perception paradigm

security systems

1. Yann LeCun, J.S. Denker, S. Solla, R.E. Howard and L. D. Jackel: Optimal Brain Damage, in Touretzky, David (Eds), Advances in Neural Information Processing Systems 2 (NIPS*89). – 2000. – 100 p.

2. Zhigalov K.Yu. Methods of photorealistic vectorization of laser ranging data for the purpose of further use in GIS // News of higher educational institutions. Geodesy and aerial photography. – 2007. – No. 6. – P. 285–287.

3. Ranzato Marc'Aurelio, Christopher Poultney, Sumit Chopra and Yann LeCun: Efficient Learning of Sparse Representations with an Energy-Based Model, in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006). – 2010. – 400 p.

4. Zhigalov K.Yu. Preparing equipment for use in systems automated control highway construction // Natural and technical sciences. – M., 2014. – No. 1 (69). – pp. 285–287.

5. Y. LeCun and Y. Bengio: Convolutional Networks for Images, Speech, and Time-Series, in Arbib, M. A. (Eds) // The Handbook of Brain Theory and Neural Networks. – 2005. – 150 p.

6. Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds) // Neural Networks: Tricks of the trade. – 2008. – 200 p.

Today, technological and research progress is covering new horizons and progressing rapidly. One of them is modeling the surrounding natural world using mathematical algorithms. In this aspect, there are trivial ones, for example modeling sea oscillations, and extremely complex, non-trivial, multi-component tasks, for example modeling the functioning of the human brain. In the process of researching this issue, a separate concept was identified - a neural network. A neural network is a mathematical model and its implementation in the form of a software or hardware-software implementation, which is made on the basis of modeling the activity of biological neural networks, which are networks of neurons in a biological organism. Scientific interest in this structure arose because the study of its model allows one to obtain information about a certain system. That is, such a model can have practical implementation in a number of branches of modern science and technology.

A Brief History of the Development of Neural Networks

It is worth noting that initially the concept of “neural network” originates in the work of American mathematicians, neurolinguists and neuropsychologists W. McCulloch and W. Pitts (1943), where the authors first mentioned it, gave it a definition and made the first attempt to build a model neural network. Already in 1949, D. Hebb proposed the first learning algorithm. Then there was a series of studies in the field of neural learning, and the first working prototypes appeared around 1990-1991. last century. However, the computing power of the equipment of that time was not enough for sufficiently fast operation of neural networks. By 2010, the GPU power of video cards had greatly increased and the concept of programming directly on video cards appeared, which significantly (3-4 times) increased the performance of computers. In 2012, neural networks won the ImageNet championship for the first time, which marked their further rapid development and the emergence of the term Deep Learning.

IN modern world neural networks have a colossal scope; scientists consider research conducted in the field of studying the behavioral characteristics and states of neural networks to be extremely promising. The list of areas in which neural networks have been used is enormous. This includes pattern recognition and classification, forecasting, solving approximation problems, and some aspects of data compression, data analysis, and, of course, application in various security systems.

The study of neural networks is actively taking place in scientific communities today. different countries. When viewed in this way, it is presented as a special case of a number of pattern recognition methods, discriminant analysis, and clustering methods.

It should also be noted that over the past year, startups in the field of image recognition systems have received funding more than in the previous 5 years, which indicates a fairly high demand for this type of development in the end market.

Application of neural networks for image recognition

Let's consider standard problems solved by neural networks in application to images:

● identification of objects;

● recognition of parts of objects (for example, faces, hands, feet, etc.);

● semantic definition of object boundaries (allows you to leave only the boundaries of objects in the picture);

● semantic segmentation (allows you to divide an image into various individual objects);

● selection of surface normals (allows you to convert two-dimensional images into three-dimensional images);

● highlighting objects of attention (allows you to determine what a person would pay attention to in a given image).

It is worth noting that the problem of image recognition is of a bright nature; solving this problem is a complex and extraordinary process. When performing recognition, the object can be a human face, a handwritten number, as well as many other objects that are characterized by a number of unique features, which significantly complicates the identification process.

This study will examine an algorithm for creating and training a neural network to recognize handwritten characters. The image will be read as one of the inputs of the neural network, and one of the outputs will be used to output the result.

At this stage, it is necessary to briefly dwell on the classification of neural networks. Today there are three main types:

● convolutional neural networks (CNN);

● recurrent networks (deep learning);

● reinforcement learning.

One of the most common examples of constructing a neural network is the classic neural network topology. Such a neural network can be represented as a fully connected graph; its characteristic feature is the forward propagation of information and the reverse propagation of error signaling. This technology does not have recursive properties. An illustrative neural network with a classical topology can be depicted in Fig. 1.

Rice. 1. Neural network with the simplest topology

Rice. 2. Neural network with 4 layers of hidden neurons

One of the clearly significant disadvantages of this network topology is redundancy. Due to redundancy when feeding data in the form of, for example, a two-dimensional matrix, a one-dimensional vector can be obtained as input. Thus, to image a handwritten Latin letter, described using a 34x34 matrix, 1156 inputs will be required. This suggests that the computing power spent on implementing the hardware and software solution of this algorithm will be too large.

The problem was solved by the American scientist Yann Le Cun, who analyzed the work of Nobel Prize laureates in medicine T. Wtesel and D. Hubel. As part of the study they conducted, the visual cortex of the cat's brain was the object of study. Analysis of the results showed that the cortex contains a number of simple cells, as well as a number of complex cells. Simple cells responded to the image of straight lines received from visual receptors, and complex cells responded to forward movement in one direction. As a result, a principle for constructing neural networks was developed, called convolutional. The idea of this principle was that to implement the functioning of a neural network, an alternation of convolutional layers, which are usually denoted C - Layers, subsampling layers S - Layers and fully connected layers F - Layers at the output of the neural network is used.

The construction of a network of this kind is based on three paradigms: the local perception paradigm, the shared weights paradigm, and the subsampling paradigm.

The essence of the local perception paradigm is that each input neuron receives not the entire image matrix, but a part of it. The remaining parts are fed to other input neurons. In this case, one can observe the mechanism of parallelization; using a similar method, it is possible to save the topology of the image from layer to layer, processing it multidimensionally, that is, a number of neural networks can be used in the processing process.

The shared weights paradigm says that a small set of weights can be used for many connections. These sets are also called “kernels”. For the final result of image processing, we can say that the shared weights have a positive effect on the properties of the neural network, when studying the behavior of which, the ability to find invariants in images and filter noise components without processing them increases.

Based on the above, we can conclude that when applying the image convolution procedure on the basis of the kernel, an output image will appear, the elements of which will be main characteristic degree of compliance with the filter, that is, a feature map will be generated. This algorithm is shown in Fig. 3.

Rice. 3. Algorithm for generating a feature map

The subsampling paradigm is that the input image is reduced by reducing the spatial dimension of its mathematical equivalent - an n-dimensional matrix. The need for subsampling is expressed in invariance to the scale of the original image. When applying the technique of alternating layers, it becomes possible to generate new feature maps from existing ones, that is, the practical implementation of this method is that the ability to degenerate a multidimensional matrix into a vector, and then into a scalar quantity will be acquired.

Implementation of neural network training

Existing networks are divided into 3 classes of architectures from a training point of view:

● learning with a teacher (percepton);

● unsupervised learning (adaptive resonance networks);

● mixed learning (networks of radial basis functions).

One of the most important criteria for evaluating the performance of a neural network in the case of image recognition is the quality of image recognition. It is worth noting that to quantify the quality of image recognition using the functioning of a neural network, the root mean square error algorithm is most often used:

(1)

In this dependence, Ep is the p-th recognition error for a pair of neurons,

Dp is the expected output result of the neural network (usually the network should strive for 100% recognition, but this does not happen in practice yet), and the construction O(Ip,W)2 is the square of the network output, which depends on the p-th input and the set weight coefficients W. This design includes both convolution kernels and weight coefficients of all layers. The error calculation consists of calculating the arithmetic mean value for all pairs of neurons.

As a result of the analysis, a pattern was derived that the nominal value of the weight, when the error value is minimal, can be calculated based on relationship (2):

(2)

From this dependence we can say that the task of calculating the optimal weight is the arithmetic difference of the derivative of the first-order error function with respect to weight, divided by the derivative of the second-order error function.

The given dependencies make it possible to trivially calculate the error that is located in the output layer. Calculating the error in the hidden layers of neurons can be implemented using the backpropagation method. The main idea of the method is to propagate information, in the form of error signaling, from output neurons to input neurons, that is, in the direction opposite to the propagation of signals through a neural network.

It is also worth noting that the network is trained on specially prepared image databases, classified into a large number of classes, and takes quite a long time.
Today the largest database is ImageNet (www.image_net.org). It is freely accessible to academic institutions.

Conclusion

As a result of the above, it is worth noting that neural networks and algorithms implemented on the principle of their operation can find their application in fingerprint card recognition systems for internal affairs bodies. Often, it is the software component of the hardware and software complex aimed at recognizing such a unique complex image as a drawing, which is identification data, that does not fully solve the tasks assigned to it. A program implemented on the basis of algorithms based on a neural network will be much more effective.

To summarize, we can summarize the following:

● neural networks can find application in both image and text recognition;

● this theory makes it possible to talk about the creation of a new promising class of models, namely models based on intelligent modeling;

● neural networks are capable of learning, which indicates the possibility of optimizing the process of functioning. This feature is an extremely important option for the practical implementation of the algorithm;

● evaluation of a pattern recognition algorithm using a neural network study can have a quantitative value; accordingly, there are mechanisms for adjusting parameters to the required value by calculating the required weighting coefficients.

Today, further research into neural networks seems to be a promising area of research that will be successfully applied in even more branches of science and technology, as well as human activity. The main focus in the development of modern recognition systems is now shifting to the area of semantic segmentation of 3D images in geodesy, medicine, prototyping and other areas of human activity - these are quite complex algorithms and this is due to:

● lack of a sufficient number of reference image databases;

● lack of a sufficient number of free experts for initial training of the system;

● images are not stored in pixels, which requires additional resources from both the computer and the developers.

It should also be noted that today there are a large number of standard architectures for building neural networks, which significantly simplifies the task of building a neural network from scratch and reduces it to selecting a network structure suitable for a specific task.

Currently, there are quite a large number of innovative companies on the market engaged in image recognition using neural network technologies for system training. It is known for certain that they achieved an image recognition accuracy of around 95% when using a database of 10,000 images. However, all achievements relate to static images; with video, at the moment everything is much more complicated.

Bibliographic link

Markova S.V., Zhigalov K.Yu. APPLICATION OF A NEURAL NETWORK TO CREATE AN IMAGE RECOGNITION SYSTEM // Fundamental Research. – 2017. – No. 8-1. – P. 60-64;
URL: http://fundamental-research.ru/ru/article/view?id=41621 (access date: 03/24/2020). We bring to your attention magazines published by the publishing house "Academy of Natural Sciences"

Friends, we continue the story about neural networks, which we started last time, and about...

What is a neural network

A neural network in its simplest case is mathematical model, consisting of several layers of elements that perform parallel calculations. Initially, this architecture was created by analogy with the smallest computing elements of the human brain - neurons. The minimal computational elements of an artificial neural network are also called neurons. Neural networks usually consist of three or more layers: an input layer, a hidden layer (or layers) and an output layer (Fig. 1), in some cases the input and output layers are not taken into account, and then the number of layers in the network is counted by the number of hidden layers. This type of neural network is called a perceptron.

Rice. 1. The simplest perceptron

An important feature of a neural network is its ability to learn from examples, this is called supervised learning. The neural network is trained on large quantities examples consisting of input-output pairs (input and output corresponding to each other). In object recognition tasks, such a pair will be the input image and the corresponding label - the name of the object. Training a neural network is an iterative process that reduces the deviation of the network output from a given “teacher response”—the label corresponding to a given image (Fig. 2). This process consists of steps called training epochs (they usually number in the thousands), at each of which the “weights” of the neural network—the parameters of the hidden layers of the network—are adjusted. At the end of the training process, the performance of the neural network is usually good enough to perform the task for which it was trained, although the optimal set of parameters that perfectly recognizes all images is often impossible to find.

Rice. 2. Neural network training

What are deep neural networks

Deep, or depth, neural networks are neural networks consisting of several hidden layers (Fig. 3). This figure is a representation of a deep neural network, giving the reader a general idea of what a neural network looks like. However, the actual architecture of deep neural networks is much more complex.

Rice. 3. Neural network with many hidden layers

The creators of convolutional neural networks, of course, were first inspired by the biological structures of the visual system. The first computational models based on the concept of hierarchical organization of primate visual flow are known as the Fukushima Neocognitron (Fig. 4). Current understanding of the physiology of the visual system is similar to the type of information processing in convolutional networks, at least for fast object recognition.

Rice. 4. Diagram showing connections between layers in the Neocognitron model.

This concept was later implemented by Canadian researcher Yann LeCun in his convolutional neural network, which he created for handwritten character recognition. This neural network consisted of two types of layers: convolutional layers and subsampling layers (or subsampling layers). In it, each layer has a topographic structure, that is, each neuron is associated with a fixed point in the source image, as well as with a receptive field (the area of the input image that is processed by a given neuron). At each location in each layer, there are a number of different neurons, each with its own set of input weights, associated with the neurons in the rectangular patch of the previous layer. Different input rectangular fragments with the same set of weights are connected to neurons from different locations.

The general architecture of a deep neural network for pattern recognition is shown in Figure 5. The input image is represented as a set of pixels or small areas of the image (for example, 5-by-5 pixels)

Rice. 5. Convolutional neural network diagram

Typically, deep neural networks are depicted in a simplified form: as processing stages, which are sometimes called filters. Each stage differs from the other in a number of characteristics, such as the size of the receptive field, the type of features the network learns to recognize in a given layer, and the type of computation performed at each stage.

The areas of application of deep neural networks, including convolutional networks, are not limited to face recognition. They are widely used for recognizing speech and audio signals, processing readings from various types of sensors, or for segmenting complex multi-layer images (such as satellite maps) or medical images (X-rays, fMRI images - see).

Neural networks in biometrics and face recognition

To achieve high recognition accuracy, the neural network is pretrained on a large array of images, for example, such as in the MegaFace database. This is the main training method for face recognition.

Rice. 6. MegaFace database contains 1 million images of more than 690 thousand people

Once the network is trained to recognize faces, the face recognition process can be described as follows (Fig. 7). First, the image is processed using a face detector: an algorithm that identifies a rectangular portion of the image containing a face. This fragment is normalized in order to be easier to process by the neural network: best result will be achieved if all input images are the same size, color, etc. The normalized image is fed to the input of the neural network for processing by the algorithm. This algorithm is usually a unique development of a company to improve the quality of recognition, but there are also “standard” solutions for this task. The neural network builds a unique feature vector, which is then transferred to the database. The search engine compares it with all feature vectors stored in the database and produces a search result in the form of a certain number of names or profiles of users with similar facial features, each of which is assigned a certain number. This number represents the degree of similarity of our feature vector to the one found in the database.

Rice. 7. Face recognition process

Determining the quality of the algorithm

Accuracy

When we choose which algorithm to apply to an object or face recognition task, we must have a means of comparing the performance of different algorithms. In this part we will describe the tools with which this is done.

The performance of a facial recognition system is assessed using a set of metrics that correspond to typical scenarios for using the system for authentication using biometrics.

Typically, the performance of any neural network can be measured in terms of accuracy: after setting the parameters and completing the training process, the network is tested on a test set for which we have the teacher's response, but which is separate from the training set. Typically, this parameter is a quantitative measure: a number (often a percentage) that shows how well the system is able to recognize new objects. Another typical measure is the error (can be expressed as either a percentage or a numerical equivalent). However, there are more precise measures for biometrics.

In biometrics in general and facial recognition biometrics in particular, there are two types of applications: verification and identification. Verification is the process of confirming a specific identity by comparing an image of an individual (a facial feature vector or another feature vector, such as a retina or fingerprint) with one or more previously stored templates. Identification is the process of determining the identity of an individual. Biometric samples are collected and compared with all templates in the database. There is identification in a closed set of characteristics if it is assumed that the person exists in the database. Thus, recognition combines one or both of the terms verification and identification.

Often, in addition to the direct result of the comparison, it is necessary to assess the level of “confidence” of the system in its decision. This value is called the term “level of similarity” (or similarity score). A higher similarity score indicates that the two biometric samples being compared are more similar.

There are a number of methods for assessing the quality of system operation (both for verification and identification tasks). We will tell you about them next time. And you stay with us and feel free to leave comments and ask questions.

NOTES

Fukushima (1980) “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics.
LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard and L.D. Jackel (1989) “Backpropagation Applied to Handwritten Zip Code Recognition”, Neural Computation, vol. 1, pp., 541−551.
Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, Stefano Ermon Deep Gaussian Process for Crop Yield Prediction Based on Remote Sensing Data.
Ian Goodfellow, Yoshua Bengio, Aaron Courville (2016) Deep Learning. MIT press.
Poh, C-H. Chan, J. Kittler, Julian Fierrez (UAM), and Javier Galbally (UAM) (2012) Description of Metrics For the Evaluation of Biometric Performance.

Quite a lot has already been said as one of the tools for solving difficult-to-formalize problems. And here, on the hub, it was shown how to use these networks for image recognition, in relation to the task of cracking captcha. However, there are quite a few types of neural networks. And is a classic fully connected neural network (FNN) so good for the task of image recognition (classification)?

1. Problem

So, we are going to solve the problem of image recognition. This could be recognition of faces, objects, symbols, etc. I propose to start by considering the problem of recognizing handwritten numbers. This task is good for a number of reasons:

To recognize a handwritten character, it is quite difficult to create a formalized (not intellectual) algorithm, and this becomes clear once you look at the same number written by different people

The task is quite relevant and relates to OCR (optical character recognition)

There is a freely distributed database of handwritten symbols available for download and experimentation

There are quite a few articles on this topic and it is very easy and convenient to compare different approaches

It is proposed to use the MNIST database as input data. This database contains 60,000 training pairs (image - label) and 10,000 test pairs (images without labels). Images are normalized by size and centered. The size of each number is no more than 20x20, but they are inscribed in a square measuring 28x28. An example of the first 12 digits from the MNIST training set is shown in the figure:

Thus the problem is formulated as follows: create and train a neural network to recognize handwritten characters, taking their images as input and activating one of 10 outputs. By activation we mean the value 1 at the output. The values of the remaining outputs should (ideally) be equal to -1. I will explain later why a scale is not used.

2. “Ordinary” neural networks.

Most people understand “conventional” or “classical” neural networks as fully connected feedforward neural networks with backpropagation:

As the name suggests, in such a network each neuron is connected to each other, the signal goes only in the direction from the input layer to the output layer, there are no recursions. We will call such a network abbreviated as PNS.

First you need to decide how to submit input data. The simplest and almost no alternative solution for PNS is to express a two-dimensional image matrix as a one-dimensional vector. Those. for an image of a handwritten number measuring 28x28, we will have 784 inputs, which is no longer small. What happens next is what many conservative scientists don’t like about neural network scientists and their methods - the choice of architecture. But they don’t like it, because the choice of architecture is pure shamanism. There are still no methods that make it possible to unambiguously determine the structure and composition of a neural network based on the description of the problem. In defense, I will say that for problems that are difficult to formalize, it is unlikely that such a method will ever be created. In addition, there are many different network reduction techniques (for example OBD), as well as various heuristics and rules of thumb. One of these rules states that the number of neurons in the hidden layer must be at least an order of magnitude more quantity inputs. If we take into account that the transformation itself from an image to a class indicator is quite complex and significantly non-linear, one layer is not enough. Based on all of the above, we roughly estimate that the number of neurons in the hidden layers will be of the order of 15000 (10,000 in the 2nd layer and 5000 in the third). Moreover, for a configuration with two hidden layers, the number customizable and learnable connections there will be 10 million between the inputs and the first hidden layer + 50 million between the first and second + 50 thousand between the second and the output, if we assume that we have 10 outputs, each of which represents a number from 0 to 9. Total roughly 60,000,000 connections. It’s not for nothing that I mentioned that they are customizable - this means that when training for each of them, you will need to calculate the error gradient.

What can you do, the beauty of artificial intelligence requires sacrifice. But if you think about it, it comes to mind that when we convert an image into a linear chain of bytes, we irretrievably lose something. Moreover, with each layer this loss only gets worse. That’s right - we lose the topology of the image, i.e. the relationship between its individual parts. In addition, the recognition task implies the ability of the neural network to be resistant to small shifts, rotations and changes in image scale, i.e. it must extract from the data certain invariants that do not depend on the handwriting of a particular person. So what should a neural network be like in order to be not very computationally complex and, at the same time, more invariant to various image distortions?

3. Convolutional neural networks

A solution to this problem was found by an American scientist of French origin, Yann LeCun, inspired by the work of Nobel laureates in the field of medicine Torsten Nils Wiesel and David H. Hubel. These scientists examined the visual cortex of the cat's brain and found that there are so-called simple cells that respond particularly strongly to straight lines at different angles and complex cells that respond to lines moving in one direction. Yann LeCun proposed the use of so-called convolutional neural networks.

6. Results

The program on matlabcentral includes a file of an already trained neural network, as well as a GUI to demonstrate the results of the work. Below are examples of recognition:

The link contains a comparison table of recognition methods based on MNIST. First place goes to convolutional neural networks with a result of 0.39% recognition errors. Most of these misrecognized images are not recognized correctly by every person. In addition, the work used elastic distortions of input images, as well as unsupervised preliminary training. But these methods will be discussed in another article.

Links.

Yann LeCun, J. S. Denker, S. Solla, R. E. Howard and L. D. Jackel: Optimal Brain Damage, in Touretzky, David (Eds), Advances in Neural Information Processing Systems 2 (NIPS*89), Morgan Kaufman, Denver, CO, 1990
Y. LeCun and Y. Bengio: Convolutional Networks for Images, Speech, and Time-Series, in Arbib, M. A. (Eds), The Handbook of Brain Theory and Neural Networks, MIT Press, 1995
Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998
Ranzato Marc" Aurelio, Christopher Poultney, Sumit Chopra and Yann LeCun: Efficient Learning of Sparse Representations with an Energy-Based Model, in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006), MIT Press , 2006