A program for finding patterns in code. Index and search functions in excel

User guide.

NatClass

the name of the operation

Construction of classification and analysis of genomic sequences.

Conditions under which the operation can be performed

Operation order

Preparatory actions

Basic actions in the required sequence.

The input data to the program is two samples of sequences in FASTA format: Positive Sequences (samples of genomic sequences, Negative Sequences (samples of random sequences, or contrasting genomic sequences).

To load training data, use the menu command Source ->AddPositiveSequences(Fig. 1) or toolbar button. A wizard appears on the screen and prompts you to specify the name of the file with a positive/negative sample of sequences.

Project files previously saved using this program can serve as input data. Such project files can store all the data that has been loaded or received at the time of saving.

2.Setting program parameters. Starting the pattern generation process.

On the first tab, “ Rules ”, there are elements of searching for patterns (Fig. 2). You need to set the parameters for searching for patterns and click the “Start” button.

The search parameters are:

Confidence interval : minimum level of conditional probability;

Min. Level of CP : threshold for the Fisher test value;

Size of finish Buffer : number of detected patterns;

Size of Sub Buffers : Pattern auxiliary buffer size.

The operating mode is also selected: fixed positions ( Fixed positions ) or sliding window mode ( Shift positions ). The latter is used for recognition along a long genomic sequence and requires specifying the window size ( Width of scanning frame)

The program allows you to pause the pattern generation process (by clicking the “Pause” button), or stop the process (by clicking the “Stop” button).

Rice. 2. Bookmark elements for searching for patterns.

At the end of the process of searching for patterns, the program displays the message “The process of searching for patterns has been successfully completed.” As a result, the found patterns are presented to the user in the order in which they were discovered (Fig. 3).

Rice. 3 Discovered patterns.

3. Construction of ideal class objects.

In addition to the detected patterns, the program outputNatClassso are the ideal representatives of classes. For their construction it is used“Objects” tab ” programs (Fig. 4). Ideal objects can be constructed either from initial objects from a positive training sample (option “ original objects "), or according to patterns (" regularities "). You can also choose one of three options for the construction algorithm ( idealization type ), setting priorities between removing and adding features. After constructing the ideal objects, the program accordingly assigns the learning objects to one of the detected classes, or recognizes them as belonging to a new class “ New " By analogy with the process of generating patterns, the process of idealization can be suspended(by pressing the “Pause” button), or stop (by pressing the “Stop” button).

Rice. 4. Laying out the elements for constructing ideal objects.

At the end of the process of searching for patterns, the program displays the message “The idealization process has been successfully completed.”

4. Application of the obtained patterns. Calculation of recognition errors.

Classes tab contains functions for processing the received output data (Fig. 5).

The following functions are available here for analyzing counting results: classification of control samples (“ Classification "), recognition relative to available classes (" Recognition Control Data "), counting recognition errors (" Recognition Errors Count”), Bootstrap procedure.

To load control sequences, use the menu command Control ->AddControlPositive.

When counting recognition errors, the program will produce the optimal result (build a histogram), but the user can independently correct it by setting either the recognition threshold (“Recogn Level "), or the value of the first type error (“1 st level error”).

Rice. 5. Bookmark “ Classes”.

When you right-click on an ideal object, the option to delete the object appears (“ Delete ", rice. 6), show class objects (“ Show Objects ", rice. 7), patterns (“ Show Regularities "), prediction matrix (" Prediction matrix "), recognition matrix (" Recognition Matrix”, Fig. 8)

Rice. 6 Operations performed by the program with an ideal object.

Rice. 7 Operation of displaying objects of class “ Showobjects”.

Rice. 8 Operation of displaying the recognition matrix of objects of class “ RecognitionMatrix”.

Recognition results and errors are saved by the program in the form html tables.

Test case: Construction of classification and analysis of transcription factor binding sites (TFBS) EGR 1.

The input to the program is given as a positive sample: SSTF EGR 1:

>S1916;

gtccgtgggt

>S4809;

ttggggggcga

>S6067;

gagggggcgg

file EGR1_pos.seq.

As negative - random sequences generated with the same nucleotide frequencies as positive sequences:

>S1916 ;_ N1_H1_W1;

gggtcttggc

>S1916 ;_ N1_H2_W1;

gggcgtttcg

>S1916 ;_ N1_H3_W1;

ggtggggctct

file neg _2200. seq

To download input data, see User Manual, point 1

2.Setting program parameters. Starting the pattern generation process.

Search parameters set:

Confidence interval: 0.05 ;

Min. Level of CP: 0 ,8 ;

Size of finish Buffer: 2000;

Size of Sub Buffers: 100.

The program discovered 2000 patterns (Fig. 9).

Rice. 9 Patterns that satisfy the search parameters.

3. Construction of ideal class objects.

As a result of the program, one class was discovered. The ideal class object and prediction matrix are shown in Fig. 10.

Rice. 10. Ideal class object and prediction matrix for SSTF EGR1.

4. Application of the obtained patterns. Calculation of recognition errors.

Sequences generated at the same nucleotide frequency as the positive sequences were taken as negative controls. File control_ neg _1000. seq. The program carried out classification, calculation of the weight of each object, and recognition (Fig. 11).

Rice. 11. Classification and recognition of control objects for STTF EGR1.

To combat a combinatorial explosion, a “combinatorial crowbar” is required. There are two tools that allow you to practically solve complex combinatorial problems. The first is massive parallelization of calculations. And here it is important not only to have a large number of parallel processors, but also to select an algorithm that allows you to parallelize the task and load all available computing power.

The second tool is the principle of limitation. The main method using the principle of boundedness is the method of “random subspaces”. Sometimes combinatorial problems allow for strong restrictions on the initial conditions and at the same time retain the hope that even after these restrictions, enough information will remain in the data so that the required solution can be found. There can be many options for how to limit the initial conditions. Not all of them can be successful. But if, nevertheless, there is a possibility that there are successful options for restrictions, then a complex problem can be divided into a large number of limited problems, each of which can be solved much more simply than the original one.

By combining these two principles, we can construct a solution to our problem.

Combinatorial space

Let's take the input bit vector and number its bits. Let's create combinatorial “points”. Let’s bring several random bits of the input vector to each point (figure below). Observing the input, each of these points will not see the whole picture, but only a small part of it, determined by which bits converged at the selected point. So in the figure below, the leftmost point with index 0 monitors only bits 1, 6, 10 and 21 of the original input signal. Let's create quite a lot of such points and call their set a combinatorial space.

Combinatorial space

What is the meaning of this space? We assume that the input signal is not random, but contains certain patterns. Patterns can be of two main types. Something in the input description may appear slightly more often than others. For example, in our case, individual letters appear more often than their combinations. In bit coding, this means that certain combinations of bits occur more often than others.

Another type of pattern is when, in addition to the input signal, there is an accompanying learning signal, and something contained in the input signal turns out to be associated with something contained in the learning signal. In our case, active output bits are a response to a combination of certain input bits.

If you look for patterns “head-on”, that is, looking at the entire input and entire output vectors, then it is not very clear what to do and where to move. If you start to build hypotheses about what may depend on what, then a combinatorial explosion immediately occurs. The number of possible hypotheses turns out to be monstrous.

A classic method widely used in neural networks is gradient descent. It is important for him to understand which direction to move. This is usually not difficult when there is only one output goal. For example, if we want to train a neural network to write numbers, we show it images of numbers and indicate what kind of number it sees. The network understands “how and where to go down”. If we show pictures with several numbers at once and name all these numbers at the same time, without indicating where something is, then the situation becomes much more complicated.

When points of a combinatorial space with a very limited “view” (random subspaces) are created, it turns out that some points may be lucky and see the pattern, if not entirely pure, then at least in a significantly purified form. Such a limited view will allow, for example, to carry out gradient descent and obtain a pure pattern. The probability for an individual point to stumble upon a pattern may not be very high, but you can always select such a number of points to guarantee that any pattern will “pop up somewhere.”

Of course, if the size of the points is made too narrow, that is, the number of bits in the points is chosen approximately equal to how many bits are expected in the pattern, then the size of the combinatorial space will begin to tend to the number of options for a complete search of possible hypotheses, which returns us to the combinatorial explosion. But, fortunately, you can increase the visibility of points, reducing their total number. This reduction is not given for free, combinatorics is “transferred to points,” but up to a certain point it is not fatal.

Let's create an output vector. We simply reduce several points of the combinatorial space to each output bit. We will choose which points these will be at random. The number of points falling into one bit will correspond to how many times we want to reduce the combinatorial space. Such an output vector will be a hash function for the state vector of the combinatorial space. We'll talk about how this condition is considered a little later.

In general, for example, as shown in the figure above, the size of the input and output may be different. In our example with string recoding, these sizes are the same.

Receptor clusters

How to look for patterns in combinatorial space? Each point sees its own fragment of the input vector. If what she sees turns out to have quite a lot of active bits, then we can assume that what she sees is some kind of pattern. That is, a set of active bits that hits the point can be called a hypothesis about the presence of a pattern. Let us remember this hypothesis, that is, we fix the set of active bits visible at a point. In the situation shown in the figure below, it is clear that at point 0 bits 1, 6 and 21 must be fixed.

Fixing bits in a cluster

We will call the record of the number of one bit a receptor for this bit. This implies that the receptor monitors the state of the corresponding bit of the input vector and reacts when a one appears there.

We will call a set of receptors a receptor cluster or a receptive cluster. When an input vector is presented, the receptors of the cluster respond if the corresponding positions of the vector contain ones. For a cluster, you can count the number of triggered receptors.

Since our information is encoded not by individual bits, but by a code, the accuracy with which we formulate a hypothesis depends on how many bits we take into the cluster. Attached to the article is the text of a program that solves the problem of string conversion. By default, the program has the following settings:

input vector length - 256 bits;
output vector length – 256 bits;
a single letter is encoded with 8 bits;
line length - 5 characters;
number of offset contexts - 10;
combinatorial space size – 60000;
number of bits intersecting at a point – 32;
cluster creation threshold – 6;
the threshold for partial cluster activation is 4.

With such settings, almost every bit that is in the code of one letter is repeated in the code of another letter, or even in the codes of several letters. Therefore, a single receptor cannot reliably indicate a pattern. Two receptors indicate a letter much better, but they can also indicate a combination of completely different letters. We can introduce a certain length threshold, starting from which we can reliably judge whether the code fragment we need is in the cluster.

Let's introduce a minimum threshold for the number of receptors required to form a hypothesis (in the example it is 6). Let's start learning. We will provide the source code and the code that we want to get as an output. For the source code, it is easy to calculate how many active bits fall into each of the points in the combinatorial space. We will select only those points that are connected to the active bits of the output code and for which the number of active bits of the input code included in it will be no less than the threshold for creating a cluster. At such points we will create clusters of receptors with corresponding sets of bits. Let's save these clusters exactly at the points where they were created. In order not to create duplicates, we first check that these clusters are unique to these points and that the points do not already contain exactly the same clusters.

Let's say the same thing in other words. From the output vector we know which bits should be active. Accordingly, we can select points in the combinatorial space associated with them. For each such point, we can formulate a hypothesis that what it now sees on the input vector is the pattern that is responsible for the activity of the bit to which this point is connected. We cannot say from one example whether this hypothesis is true or not, but no one is stopping us from putting forward an assumption.

Education. Memory consolidation

During the learning process, each new example creates a huge number of hypotheses, most of which are incorrect. We are required to test all these hypotheses and weed out false ones. We can do this by observing whether these hypotheses are confirmed in subsequent examples. In addition, when creating a new cluster, we remember all the bits that the point sees, and this, even if it contains a pattern, also random bits that got there from other concepts that do not affect our output, and which in our case are noise. Accordingly, it is required not only to confirm or refute that the memorized combination of bits contains the desired pattern, but also to clear this combination of noise, leaving only a “pure” rule.

There are different approaches to solving the problem. I will describe one of them without claiming that it is the best. I went through many options, this one captivated me with its quality of work and simplicity, but this does not mean that it cannot be improved.

It is convenient to think of clusters as autonomous computers. If each cluster can test its own hypothesis and make decisions independently of the others, then this is very good for potential parallelization of calculations. Each receptor cluster, after creation, begins an independent life. He monitors incoming signals, accumulates experience, changes himself and, if necessary, makes a decision on self-destruction.

A cluster is a set of bits about which we assumed that there is a pattern inside it associated with the operation of the output bit to which the point containing this cluster is connected. If there is a pattern, then most likely it affects only part of the bits, and we do not know in advance which one. Therefore, we will record all moments when a significant number of receptors are activated in the cluster (in the example, at least 4). It is possible that at these moments the pattern, if there is one, manifests itself. When certain statistics accumulate, we can try to determine whether there is something natural in such partial cluster activations or not.

An example of statistics is shown in the figure below. The plus at the beginning of the line shows that at the moment the cluster was partially triggered, the output bit was also active. The cluster bits are formed from the corresponding bits of the input vector.

Chronicle of partial activation of a receptor cluster

What should interest us about these statistics? We care about which bits work together more often than others. Don't confuse this with the most common bits. If we calculate the frequency of its occurrence for each bit and take the most common bits, then this will be an averaging, which is not at all what we need. If several stable patterns converge at a point, then when averaging, the average “irregularity” between them will be obtained. In our example, it is clear that lines 1, 2 and 4 are similar to each other, and lines 3, 4 and 6 are also similar. We need to choose one of these patterns, preferably the strongest, and clear it of unnecessary bits.

The most common combination that appears as certain bits firing together is for this statistic. To calculate the principal component, you can use the Hebb filter. To do this, you can specify a vector with unit initial weights. Then get the cluster activity by multiplying the vector of weights by the current state of the cluster. And then shift the weights towards the current state, the more strongly, the higher this activity. To prevent the weights from growing uncontrollably, after changing the weights they must be normalized, for example, to the maximum value from the vector of weights.

This procedure is repeated for all available examples. As a result, the vector of weights gets closer and closer to the main component. If the existing examples are not enough to converge, then you can repeat the process several times using the same examples, gradually reducing the learning speed.

The main idea is that as it approaches the main component, the cluster begins to react more and more strongly to samples similar to it and less and less to others, due to this, learning in the right direction goes faster than “bad” examples try to spoil it. The result of such an algorithm after several iterations is shown below.

The result obtained after several iterations of isolating the first principal component

If we now trim the cluster, that is, leave only those receptors with high weights (for example, above 0.75), then we will get a pattern cleared of unnecessary noise bits. This procedure can be repeated several times as statistics accumulate. As a result, we can understand whether there is any pattern in the cluster, or whether we have collected a random set of bits together. If there is no pattern, then trimming the cluster will result in a fragment that is too short. In this case, such a cluster can be removed as a failed hypothesis.

In addition to trimming the cluster, you need to make sure that exactly the desired pattern is caught. The source string contains codes of several letters, each of them is a pattern. Any of these codes can be "caught" by the cluster. But we are only interested in the code of the letter that affects the formation of the output bit. For this reason, most hypotheses will be false and must be rejected. This can be done based on the criteria that partial or even complete activation of the cluster will too often not coincide with the activity of the desired output bit. Such clusters must be deleted. The process of such control and removal of unnecessary clusters along with their “trimming” can be called memory consolidation.

The process of accumulating new clusters is quite fast; each new experience generates several thousand new hypothesis clusters. It is advisable to carry out training in stages with a break for “sleep”. When a critical number of clusters are created, it is necessary to switch to “idle” operation mode. In this mode, previously remembered experiences are scrolled through. But at the same time, new hypotheses are not created, but only old ones are tested. As a result of “sleep,” it is possible to remove a huge percentage of false hypotheses and leave only those that have passed the test. After “sleep,” the combinatorial space is not only cleared and ready to receive new information, but it is also much more confident in what was learned “yesterday.”

Combinatorial space output

As clusters accumulate statistics and undergo consolidation, clusters will appear that are sufficiently similar that their hypothesis is either true or close to true. We will take such clusters and monitor when they are fully activated, that is, when all the receptors of the cluster are active.

Next, we will form the output from this activity as a hash of the combinatorial space. At the same time, we will take into account that the longer the cluster, the higher the chance that we have caught the pattern. For short clusters, there is a possibility that the combination of bits arose by chance as a combination of other concepts. To increase noise immunity, we will use the idea of boosting, that is, we will require that for short clusters the activation of the output bit occurs only when there are several such operations. In the case of long clusters, we will assume that a single operation is sufficient. This can be represented through the potential that arises when clusters are triggered. This potential is higher the longer the cluster. The potentials of points connected to one output bit are added. If the resulting potential exceeds a certain threshold, the bit is activated.

After some training, the output begins to reproduce a part that matches what we want to get (picture below).

An example of how a combinatorial space works during the learning process (about 200 steps). Above is the source code, in the middle is the required code, below is the code predicted by the combinatorial space.

Gradually, the output of the combinatorial space begins to better reproduce the required output code. After several thousand training steps, the output is reproduced with fairly high accuracy (figure below).

An example of how a trained combinatorial space works. Above is the source code, in the middle is the required code, below is the code predicted by the combinatorial space.

To visualize how it all works, I recorded a video with the learning process. In addition, perhaps my explanations will help you better understand this whole kitchen.

Strengthening the rules

Inhibitory receptors can be used to identify more complex patterns. That is, introduce patterns that block the operation of certain affirmative rules when a certain combination of input bits appears. This looks like the creation, under certain conditions, of a cluster of receptors with inhibitory properties. When such a cluster is triggered, it will not increase, but decrease the potential of the point.

It is not difficult to come up with rules for testing inhibitory hypotheses and trigger the consolidation of inhibitory receptive clusters.

Since inhibitory clusters are created at specific points, they do not affect the blocking of the output bit in general, but block its operation from the rules detected at this particular point. It is possible to complicate the connection architecture and introduce inhibitory rules that are common to a group of points or to all points connected to the output bit. It looks like you can come up with a lot more interesting things, but for now let’s focus on the simple model described.

Random Forest

The described mechanism allows you to find patterns that in Data Mining are usually called “if-then” type rules. Accordingly, one can find something in common between our model and all those methods that are traditionally used to solve such problems. Perhaps the closest to us is “random forest”.

This method starts with the idea of "random subspaces". If there are too many variables in the source data and these variables are weakly but correlated, then it becomes difficult to isolate individual patterns using the full amount of data. In this case, it is possible to create subspaces in which both the variables used and the training examples will be limited. That is, each subspace will contain only part of the input data, and this data will not be represented by all the variables, but by a random limited set of them. For some of these subspaces, the chances of detecting a pattern that is difficult to see in the full data are greatly increased.

Then, in each subspace, a decision tree is trained on a limited set of variables and training examples. A decision tree is a tree-like structure (figure below), at the nodes of which input variables (attributes) are checked. Based on the results of checking the conditions at the nodes, the path from the top to the terminal node is determined, which is usually called a leaf of the tree. The leaf of the tree contains the result, which can be the value of some quantity or the class number.

Example of a decision tree

For decision trees, there are various learning algorithms that allow you to build a tree with more or less optimal attributes in its nodes.

At the final stage, the idea of boosting is applied. The deciding trees form a voting committee. Based on the collective opinion, the most plausible answer is created. The main advantage of boosting is the ability to combine many “bad” algorithms (the result of which is only slightly better than random) to obtain an arbitrarily “good” final result.

Our algorithm, which exploits combinatorial space and receptor clusters, uses the same fundamental ideas as the random forest method. Therefore, it is not surprising that our algorithm works and produces good results.

Biology of learning

Actually, this article describes the software implementation of the mechanisms that were described in the previous parts of the series. Therefore, we will not repeat everything from the very beginning, we will only note the main points. If you have forgotten about how a neuron works, you can re-read it.

There are many different receptors located on the neuron membrane. Most of these receptors are “free floating”. The membrane creates an environment for receptors in which they can move freely, easily changing their position on the surface of the neuron (Sheng, M., Nakagawa, T., 2002) (Tovar K. R., Westbrook G. L., 2002).

Membrane and receptors

In the classical approach, the reasons for such “freedom” of receptors are usually not emphasized. When a synapse increases its sensitivity, this is accompanied by the movement of receptors from the extrasynaptic space into the synaptic cleft (Malenka R.C., Nicoll R.A., 1999). This fact is tacitly perceived as justification for the mobility of receptors.

In our model, we can assume that the main reason for the mobility of receptors is the need to form clusters from them “on the fly.” That is, the picture looks like this. A variety of receptors, sensitive to various neurotransmitters, drift freely along the membrane. The information signal generated in the minicolumn causes the release of neurotransmitters by the axon endings of neurons and astrocytes. Each synapse where neurotransmitters are emitted, in addition to the main neurotransmitter, has its own unique additive that identifies this particular synapse. Neurotransmitters splash out from the synaptic clefts into the surrounding space, due to which a specific cocktail of neurotransmitters appears at each place of the dendrite (points of combinatorial space) (the ingredients of the cocktail indicate the bits that hit the point). Those freely wandering receptors that at this moment find their neurotransmitter in this cocktail (receptors of specific bits of the input signal) move into a new state - the search state. In this state, they have a short time (until the next beat occurs) during which they can meet other “active” receptors and create a common cluster (a cluster of receptors sensitive to a certain combination of bits).

Metabotropic receptors, and we are talking about them, have a rather complex shape (figure below). They consist of seven transmembrane domains that are connected by loops. In addition, they have two free ends. Due to electrostatic charges of different signs, the free ends can “stick” to each other through the membrane. Due to such connections, receptors are combined into clusters.

Single metabotropic receptor

After unification, the joint life of receptors in the cluster begins. It can be assumed that the position of the receptors relative to each other can vary widely and the cluster can take on bizarre shapes. If we assume that receptors that work together will tend to take a place closer to each other, for example, due to electrostatic forces, then an interesting consequence will arise. The closer such “joint” receptors are, the stronger their joint attraction will be. As they get closer, they will begin to strengthen each other's influence. This behavior reproduces the behavior of the Hebb filter, which selects the first principal component. The more precisely the filter is tuned to the main component, the stronger its reaction is when it appears in the example. Thus, if after a number of iterations jointly triggered receptors end up together in the conditional “center” of the cluster, and “extra” receptors are at a distance, at its edges, then, in principle, such “extra” receptors can at some point self-destruct, then there is simply to break away from the cluster. And then we will get cluster behavior similar to what is described above in our computational model.

Clusters that have undergone consolidation can move somewhere “to a safe haven,” for example, to a synaptic cleft. There is a postsynaptic seal, to which clusters of receptors can anchor, losing the mobility they no longer need. There will be ion channels nearby that they can control through G proteins. Now these receptors will begin to influence the formation of the local postsynaptic potential (point potential).

The local potential consists of the joint influence of nearby activating and inhibitory receptors. In our approach, activators are responsible for recognizing patterns that call for activating the output bit, while inhibitors are responsible for identifying patterns that block the action of local rules.

Synapses (points) are located on the dendritic tree. If somewhere on this tree there is a place where several activating receptors fire at once in a small area and this is not blocked by inhibitory receptors, then a dendritic spike occurs, which spreads to the body of the neuron and, upon reaching the axon hillock, causes a spike in the neuron itself. A dendritic tree connects many synapses to a single neuron, much like generating the output bit of a combinatorial space.

Combining signals from different synapses of the same dendritic tree may not be a simple logical addition, but may be more complex and implement some kind of tricky boosting algorithm.

Let me remind you that the basic element of the cortex is the cortical minicolumn. In a mini-column, about a hundred neurons are located one below the other. At the same time, they are tightly enveloped in connections, which are much more abundant inside the minicolumn than the connections going to neighboring minicolumns. The entire cerebral cortex is a space of such mini-columns. One minicolumn neuron can correspond to one output bit, all neurons of one cortical minicolumn can be an analogue of the output binary vector.

The receptor clusters described in this chapter create the memory responsible for pattern-seeking. Previously, we described how to create a holographic event memory using receptor clusters. These are two different types of memory that perform different functions, although they are based on common mechanisms.

Dream

In a healthy person, sleep begins with the first stage of slow-wave sleep, which lasts 5-10 minutes. Then comes the second stage, which lasts about 20 minutes. Another 30-45 minutes occur during the third and fourth stages. After this, the sleeper returns to the second stage of slow-wave sleep, after which the first episode of REM sleep occurs, which has a short duration of about 5 minutes. During REM sleep, the eyeballs very often and periodically make rapid movements under closed eyelids. If you wake up a sleeping person at this time, then in 90% of cases you can hear a story about a vivid dream. This entire sequence is called a cycle. The first cycle lasts 90-100 minutes. Then the cycles are repeated, with the proportion of slow-wave sleep decreasing and the proportion of REM sleep gradually increasing, the last episode of which in some cases can reach 1 hour. On average, with full healthy sleep, there are five complete cycles.

It can be assumed that the main work of clearing clusters of receptors that have accumulated during the day occurs in sleep. In the computational model, we described the “idle” training procedure. The old experience is presented to the brain without causing the formation of new clusters. The goal is to test existing hypotheses. This verification consists of two stages. The first is calculating the main component of the pattern and checking that the number of bits responsible for it is sufficient for clear identification. The second is checking the truth of the hypothesis, that is, that the pattern turned out to be at the right point associated with the desired output bit. It can be assumed that some of the stages of night sleep are associated with such procedures.

All processes associated with changes in cells are accompanied by the expression of certain proteins and transcription factors. There are proteins and factors that have been shown to be involved in the formation of new experiences. So, it turns out that their number increases greatly during wakefulness and decreases sharply during sleep.

The concentration of proteins can be seen and assessed by staining a section of brain tissue with a dye that selectively reacts to the required protein. Similar observations have shown that the most widespread changes for proteins associated with memory occur during sleep (Chiara Cirelli, Giulio Tononi, 1998) (Cirelli, 2002) (figures below).

Arc protein distribution in the rat parietal cortex after three hours of sleep (S) and after three hours of spontaneous wakefulness (W) (Cirelli, 2002)

Distribution of the transcription factor P-CREB in the coronal regions of the rat parietal cortex after three hours of sleep (S) and in the case of three hours of sleep deprivation (SD) (Cirelli, 2002)

Such reasoning about the role of sleep fits well with the well-known feature - “the morning is wiser than the evening.” In the morning we have a much better understanding of what was not particularly clear yesterday. Everything becomes clearer and more obvious. It is possible that we owe this precisely to the large-scale clearing of receptor clusters that occurred during sleep. False and dubious hypotheses are removed, reliable ones undergo consolidation and begin to participate more actively in information processes.

During the simulation, it was clear that the number of false hypotheses was many thousands of times greater than the number of true ones. Since one can only be distinguished from another by time and experience, the brain has no choice but to accumulate all this information ore in the hope of finding grams of radium in it over time. As new experience is gained, the number of clusters with hypotheses that require testing is constantly growing. The number of clusters formed per day and containing ore that has yet to be processed may exceed the number of clusters responsible for encoding the proven experience accumulated over the entire previous life. The brain's resource for storing raw hypotheses that require testing should be limited. It seems that during the 16 hours of daytime wakefulness, clusters of receptors almost completely fill all available space. When this moment comes, the brain begins to force us to go into sleep mode to allow it to perform consolidation and clear free space. Apparently, the complete clearing process takes about 8 hours. If you wake us up earlier, some of the clusters will remain unprocessed. This is where the phenomenon of fatigue builds up occurs. If you don’t get enough sleep for several days, then you will have to make up for lost sleep. Otherwise, the brain begins to “emergency” delete clusters, which does not lead to anything good, since it deprives us of the opportunity to gain knowledge from the experience gained. Event memory is likely to be preserved, but patterns will remain undetected.
By the way, my personal advice: do not neglect quality sleep, especially if you are studying. Don't try to save money on sleep so you can get more done. Sleep is no less important in learning than attending lectures and reviewing material in practical classes. It is not for nothing that children, during those periods of development when the accumulation and synthesis of information is most active, spend most of their time sleeping.

Brain performance

The assumption about the role of receptive clusters allows us to take a fresh look at the issue of brain speed. Earlier we said that each mini-column of the cortex, consisting of hundreds of neurons, is an independent computing module that considers the interpretation of incoming information in a separate context. This allows one cortical zone to consider up to a million possible interpretation options simultaneously.

Now we can assume that each receptor cluster can work as an autonomous computational element, performing the entire cycle of calculations to test its hypothesis. There can be hundreds of millions of such clusters in the cortical column alone. This means that although the frequencies at which the brain operates are far from the frequencies at which modern computers operate, there is no need to worry about the speed of the brain. Hundreds of millions of receptor clusters working in parallel in each mini-column of the cortex make it possible to successfully solve complex problems that are on the border with a combinatorial explosion. There will be no miracles. But you can learn to walk on the edge.

meaning

neural networks

neuron

consciousness

Add tags

Let's look at one of the useful options offered by Microsoft Excel. By the way, you can buy a licensed version of this program in our online store at a discount. Prices and versions can be viewed.

Today we will talk about Conditional Formatting. It is intended to highlight table cells that have common features. This could be the same font, values, background, etc. This operation provides for various configurations: the severity of the check, the content of matches, their identity and variability.

Let's start by launching the Microsoft Excel program containing the table we need. Next, we select the range of cells that need processing. We are talking about the commonality of columns and cells that form part of a table, or several unrelated areas of the table.
Next we need to go through the following path:

The program has a wide range of capabilities, in particular: you can select the highlighting of the cells included in the selection, there is an option to select the background fill (the program provides 6 color solutions), variations of fonts and table frames. You can select “CUSTOM FORMAT”, which allows you to create your own version of the cells. To deselect matching cells, click OK.

Using the EQUALS function

If the cells you need to select have a very specific meaning, use the “EQUAL” item in the “CONDITIONAL FORMATTING” list, located in the “RULES FOR SELECTING CELLS” section. In the dialog box that opens, mark the cells you are interested in that require duplicate detection, and their address will appear in the adjacent dialog box. Having mastered these simple skills, you can significantly reduce the time spent processing tabular data and grouping common values.

Video: Finding matches in Excel

This tutorial explains the main benefits of the functions INDEX And SEARCH in Excel, which make them more attractive compared to VLOOKUP. You will see several examples of formulas that will help you easily cope with many complex tasks that the function faces. VLOOKUP powerless.

In several recent articles, we have made every effort to clarify for novice users function basics VLOOKUP and show examples of more complex formulas for advanced users. We will now try, if not dissuade you from using VLOOKUP, then at least show alternative ways to implement vertical search in Excel.

Why do we need this? - you ask. Yes, because VLOOKUP Search is not the only search feature in Excel, and its many limitations can prevent you from getting the results you want in many situations. On the other hand, functions INDEX And SEARCH– more flexible and have a number of features that make them more attractive compared to VLOOKUP.

Basic information about INDEX and MATCH

Since the purpose of this tutorial is to show the capabilities of functions INDEX And SEARCH to implement vertical search in Excel, we will not dwell on their syntax and application.

Here we present the minimum necessary to understand the essence, and then we will examine in detail examples of formulas that show the advantages of using INDEX And SEARCH instead of VLOOKUP.

INDEX – function syntax and usage

Function INDEX(INDEX) in Excel returns a value from an array at the given row and column numbers. The function has this syntax:

Each argument has a very simple explanation:

array(array) is the range of cells from which you want to extract the value.
row_num(line_number) is the number of the line in the array from which you want to extract the value. If not specified, then an argument is required column_num(column_number).
column_num(column_number) is the number of the column in the array from which you want to extract the value. If not specified, then an argument is required row_num(line_number)

If both arguments are specified, then the function INDEX returns the value from the cell at the intersection of the specified row and column.

Here is a simple example of a function INDEX(INDEX):

INDEX(A1:C10,2,3)
=INDEX(A1:C10,2,3)

Formula searches a range A1:C10 and returns the cell value in 2nd line and 3m column, that is, from a cell C2.

Very simple, right? However, in practice, you do not always know which row and column you need, and therefore you need the help of the function SEARCH.

MATCH - function syntax and usage

Function MATCH(MATCH) in Excel searches for a specified value in a range of cells and returns the relative position of that value in the range.

For example, if in the range B1:B3 contains the values New-York, Paris, London, then the following formula will return the number 3 , since “London” is the third element in the list.

MATCH("London",B1:B3,0)
=MATCH("London";B1:B3;0)

Function MATCH(MATCH) has the following syntax:

MATCH(lookup_value,lookup_array,)
MATCH(lookup_value, lookup_array, [match_type])

lookup_value(search_value) is the number or text you are looking for. The argument can be a value, including a boolean, or a cell reference.
lookup_array(viewed_array) – the range of cells in which the search occurs.
match_type(match_type) – This argument tells the function SEARCH, whether you want to find an exact or approximate match:
- 1 or not specified– finds the maximum value less than or equal to the desired value. The array being viewed must be ordered in ascending order, that is, from smallest to largest.
- 0 – finds the first value equal to the desired one. For combination INDEX/SEARCH you always need an exact match, so the third argument to the function SEARCH must be equal 0 .
- -1 – finds the smallest value greater than or equal to the search value. The array being viewed must be sorted in descending order, that is, from largest to smallest.

At first glance, the benefit of the function SEARCH is doubtful. Who needs to know the position of an element in a range? We want to know the meaning of this element!

Let us remind you that the relative position of the value we are looking for (i.e. row and/or column number) is exactly what we need to specify for the arguments row_num(line_number) and/or column_num(column_number) functions INDEX(INDEX). As you remember, the function INDEX can return the value at the intersection of the given row and column, but it cannot determine which row and column we are interested in.

How to Use INDEX and MATCH in Excel

Now that you know the basic information about these two functions, I believe that it is already becoming clear how the functions SEARCH And INDEX can work together. SEARCH determines the relative position of the search value in a given range of cells, and INDEX uses that number (or numbers) and returns the result from the corresponding cell.

Still not entirely clear? Present the functions INDEX And SEARCH in this form:

INDEX(,(MATCH ( search value,the column in which we are looking,0))
=INDEX( the column from which we extract;(MATCH( search value;the column in which we are looking;0))

I think it will be even easier to understand with an example. Suppose you have the following list of state capitals:

Let's find the population of one of the capitals, for example, Japan, using the following formula:

INDEX($D$2:$D$10,MATCH("Japan",$B$2:$B$10,0))
=INDEX($D$2:$D$10,MATCH("Japan",$B$2:$B$10,0))

Now let's look at what each element of this formula does:

Function MATCH(MATCH) looks for the value “Japan” in the column B, and specifically – in cells B2:B10, and returns a number 3 , since “Japan” is in third place on the list.
Function INDEX(INDEX) uses 3 for argument row_num(row_number), which specifies from which row the value should be returned. Those. we get a simple formula:
INDEX($D$2:$D$10,3)
=INDEX($D$2:$D$10,3)
The formula says something like this: look in cells from D2 before D10 and extract the value from the third row, that is, from the cell D4, since counting starts from the second line.

This is the result you get in Excel:

Important! Number of rows and columns in the array that the function uses INDEX(INDEX), must match the argument values row_num(line_number) and column_num(column_number) functions MATCH(MATCH). Otherwise, the result of the formula will be erroneous.

Wait, wait... why can't we just use a function VLOOKUP(VPR)? Is there any point in wasting time trying to figure out mazes? SEARCH And INDEX?

VLOOKUP("Japan",$B$2:$D$2,3)
=VLOOKUP("Japan",$B$2:$D$2,3)

In this case, there is no point! The purpose of this example is purely for demonstration purposes so that you can understand how the functions SEARCH And INDEX work in pairs. The following examples will show you the true power of the bundle. INDEX And SEARCH, which easily copes with many difficult situations when VLOOKUP finds himself in a dead end.

Why is INDEX/MATCH better than VLOOKUP?

When deciding what formula to use for a vertical search, most Excel gurus believe that INDEX/SEARCH much better than VLOOKUP. However, many Excel users still resort to using VLOOKUP, because this function is much simpler. This happens because very few people fully understand all the benefits of switching from VLOOKUP per bunch INDEX And SEARCH, and no one wants to waste time studying a more complex formula.

4 Main Benefits of Using MATCH/INDEX in Excel:

1. Search from right to left. As any competent Excel user knows, VLOOKUP cannot look to the left, which means that the value being sought must necessarily be in the leftmost column of the range being examined. In case of SEARCH/INDEX, the search column can be either on the left or right side of the search range. Example: will show this feature in action.

2. Safely add or remove columns. Formulas with function VLOOKUP stop working or return erroneous values if you remove or add a column to a lookup table. For function VLOOKUP any inserted or removed column will change the result of the formula because syntax VLOOKUP requires you to specify the entire range and the specific column number from which you want to extract data.

For example, if you have a table A1:C10, and you want to retrieve data from a column B, then you need to set the value 2 for argument col_index_num(column_number) functions VLOOKUP, like this:

VLOOKUP("lookup value",A1:C10,2)
=VLOOKUP("lookup value";A1:C10;2)

If you later insert a new column between the columns A And B, then the value of the argument will have to be changed from 2 on 3 , otherwise the formula will return the result from the newly inserted column.

Using SEARCH/INDEX You can remove or add columns to the range being examined without distorting the result, since the column containing the desired value is directly defined. Indeed, this is a big advantage, especially when you have to work with large amounts of data. You can add and remove columns without worrying about having to fix every function you use VLOOKUP.

3. There is no limit on the size of the searched value. Using VLOOKUP, remember that the length of the searched value is limited to 255 characters, otherwise you risk getting an error #VALUE!(#VALUE!). So, if the table contains long rows, the only workable solution is to use INDEX/SEARCH.

Let's say you use this formula with VLOOKUP, which searches in cells from B5 before D10 the value specified in the cell A2:

VLOOKUP(A2,B5:D10,3,FALSE)
=VLOOKUP(A2,B5:D10,3,FALSE)

The formula will not work if the value in the cell is A2 longer than 255 characters. Instead, you need to use a similar formula INDEX/SEARCH:

INDEX(D5:D10,MATCH(TRUE,INDEX(B5:B10=A2,0),0))
=INDEX(D5:D10,MATCH(TRUE,INDEX(B5:B10=A2,0),0))

4. Higher operating speed. If you work with small tables, then the difference in Excel performance will most likely not be noticeable, especially in the latest versions. If you work with large tables that contain thousands of rows and hundreds of search formulas, Excel will work much faster if you use SEARCH And INDEX instead of VLOOKUP. In general, this replacement increases the speed of Excel by 13% .

Influence VLOOKUP Excel performance is especially noticeable if the workbook contains hundreds of complex array formulas, such as VLOOKUP+SUM . The fact is that checking each value in the array requires a separate function call VLOOKUP. Therefore, the more values an array contains and the more array formulas your table contains, the slower Excel works.

On the other hand, a formula with functions SEARCH And INDEX it simply performs a search and returns the result, performing similar work noticeably faster.

INDEX and MATCH - examples of formulas

Now that you understand the reasons why you should learn functions SEARCH And INDEX, let's get to the fun part and see how you can apply theoretical knowledge in practice.

How to search from the left side using MATCH and INDEX

Any textbook on VLOOKUP says that this function cannot look to the left. Those. if the column you are looking at is not the leftmost one in the search range, then there is no chance of getting from VLOOKUP desired result.

Functions SEARCH And INDEX Excel is much more flexible and doesn't care where the column with the value you want to retrieve is located. For example, let's return to the table with state capitals and population. This time we will write the formula SEARCH/INDEX, which will show what place the capital of Russia (Moscow) occupies in terms of population.

As you can see in the figure below, the formula does this job perfectly:

INDEX($A$2:$A$10,MATCH("Russia",$B$2:$B$10,0))

Now you should have no problem understanding how this formula works:

First, let's use the function MATCH(MATCH), which finds the position of “Russia” in the list:
MATCH("Russia",$B$2:$B$10,0))
=MATCH("Russia",$B$2:$B$10,0))
Next, set the range for the function INDEX(INDEX) from which to extract the value. In our case it is A2:A10.
Then we combine both parts and get the formula:
INDEX($A$2:$A$10;MATCH("Russia";$B$2:$B$10,0))
=INDEX($A$2:$A$10,MATCH("Russia",$B$2:$B$10,0))

Clue: The correct solution is to always use absolute references for INDEX And SEARCH, so that the search ranges do not get lost when copying the formula to other cells.

Calculations using INDEX and MATCH in Excel (AVERAGE, MAX, MIN)

You can nest other Excel functions within INDEX And SEARCH, for example, to find the minimum, maximum, or closest to the average value. Here are several options for formulas in relation to the table from:

1. MAX(MAX). The formula finds the maximum in a column D C the same line:

INDEX($C$2:$C$10,MATCH(MAX($D$2:I$10),$D$2:D$10,0))
=INDEX($C$2:$C$10,MATCH(MAX($D$2:I$10),$D$2:D$10,0))

Result: Beijing

2. MIN(MIN). The formula finds the minimum in a column D and returns the value from the column C the same line:

INDEX($C$2:$C$10,MATCH(MIN($D$2:I$10),$D$2:D$10,0))
=INDEX($C$2:$C$10,MATCH(MIN($D$2:I$10),$D$2:D$10,0))

Result: Lima

3. AVERAGE(AVERAGE). The formula calculates the average of a range D2:D10, then finds the closest one to it and returns the value from the column C the same line:

INDEX($C$2:$C$10,MATCH(AVERAGE($D$2:D$10),$D$2:D$10,1))
=INDEX($C$2:$C$10,MATCH(AVERAGE($D$2:D$10),$D$2:D$10,1))

Result: Moscow

Things to remember when using the AVERAGE function with INDEX and MATCH

Using the function AVERAGE in combination with INDEX And SEARCH, as the third argument of the function SEARCH most often you will need to indicate 1 or -1 in case you are not sure that the range you are viewing contains a value equal to the average. If you are sure that such a value exists, put 0 to find an exact match.

If you indicate 1 , the values in the lookup column should be ordered in ascending order, and the formula will return the maximum value less than or equal to the average.
If you indicate -1 , the values in the lookup column should be ordered in descending order and the minimum value greater than or equal to the average will be returned.

In our example, the values in the column D are ordered in ascending order, so we use the collation type 1 . Formula INDEX/SEARCHPOZ returns “Moscow”, since the population of the city of Moscow is the closest smaller to the average value (12,269,006).

How to use INDEX and MATCH to search a known row and column

This formula is equivalent two-dimensional search VLOOKUP and allows you to find the value at the intersection of a specific row and column.

In this example the formula INDEX/SEARCH will be very similar to the formulas that we have already discussed in this lesson, with only one difference. Guess which one?

As you remember, the function syntax INDEX(INDEX) allows three arguments:

INDEX(array,row_num,)
INDEX(array, row_number, [column_number])

And I congratulate those of you who guessed it!

Let's start by writing down the formula template. To do this, let’s take the formula that is already familiar to us INDEX/SEARCH and add another function to it SEARCH, which will return the column number.

INDEX(Your table ,(MATCH(, column to search in,0)),(MATCH(, line to search in,0))
=INDEX(Your table ,(MATCH( value for vertical search,column to search in,0)),(MATCH( value for horizontal search,line to search in,0))

Note that for two-dimensional search you need to specify the entire table in the argument array(array) functions INDEX(INDEX).

Now let's try this pattern in practice. Below you see a list of the most populated countries in the world. Let's say our task is to find out the population of the United States in 2015.

Okay, let's write down the formula. When I need to create a complex formula in Excel with nested functions, I first write down each nested function separately.

So let's start with two functions SEARCH, which will return the row and column numbers for the function INDEX:

MATCH for column– we are looking in the column B, or rather in the range B2:B11, the value that is specified in the cell H2(USA). The function will look like this:
MATCH($H$2,$B$1:$B$11,0)
=MATCH($H$2,$B$1:$B$11,0)
4 , since “USA” is the 4th list element in the column B(including title).
MATCH for string– we are looking for the cell value H3(2015) in line 1 , that is, in cells A1:E1:
MATCH($H$3,$A$1:$E$1,0)
=MATCH($H$3,$A$1:$E$1,0)

The result of this formula will be 5 , since “2015” is in the 5th column.

Now we insert these formulas into the function INDEX and voila:

INDEX($A$1:$E$11,MATCH($H$2,$B$1:$B$11,0),MATCH($H$3,$A$1:$E$1,0))
=INDEX($A$1:$E$11,MATCH($H$2,$B$1:$B$11,0),MATCH($H$3,$A$1:$E$1,0))

If you replace the functions SEARCH based on the values they return, the formula will become easy and understandable:

INDEX($A$1:$E$11,4,5))
=INDEX($A$1:$E$11,4,5))

This formula returns the value at the intersection 4th lines and 5th column in range A1:E11, that is, the cell value E4. Just? Yes!

Multi-criteria search with INDEX and MATCH

In the tutorial on VLOOKUP we showed an example of a formula with function VLOOKUP to search by multiple criteria. However, a significant limitation of this solution was the need to add an auxiliary column. Good news: formula INDEX/SEARCH can search across values in two columns, without the need to create a helper column!

Suppose we have a list of orders and we want to find the amount based on two criteria − buyer's name(Customer) and product(Product). The matter is complicated by the fact that one buyer can buy several different products at once, and the names of buyers in the table on the sheet Lookup table arranged in random order.

Here's the formula INDEX/SEARCH solves the problem:

(=INDEX("Lookup table"!$A$2:$C$13,MATCH(1,(A2="Lookup table"!$A$2:$A$13)*
(B2="Lookup table"!$B$2:$B$13),0),3))
(=INDEX("Lookup table"!$A$2:$C$13;MATCH(1,(A2="Lookup table"!$A$2:$A$13)*
(B2="Lookup table"!$B$2:$B$13);0);3))

This formula is more complex than others that we discussed earlier, but armed with knowledge of the functions INDEX And SEARCH You will defeat her. The hardest part is the function SEARCH, I think it needs to be explained first.

MATCH(1,(A2="Lookup table"!$A$2:$A$13),0)*(B2="Lookup table"!$B$2:$B$13)
MATCH(1;(A2="Lookup table"!$A$2:$A$13);0)*(B2="Lookup table"!$B$2:$B$13)

In the formula shown above, the value we are looking for is 1 , and the search array is the result of the multiplication. Okay, what should we multiply and why? Let's look at everything in order:

Take the first value in the column A(Customer) on sheet Main table and compare it with all the customer names in the table on the sheet Lookup table(A2:A13).
If a match is found, the equation returns 1 (TRUE), and if not - 0 (LIE).
Next, we do the same for the column values B(Product).
Then we multiply the results obtained (1 and 0). Only if matches are found in both columns (i.e. both criteria are true), you will receive 1 . If both criteria are false, or only one of them is satisfied, you will receive 0 .

Now you understand why we asked 1 , what is the desired value? It is correct that the function SEARCH returned the position only when both criteria were met.

Note: In this case, you must use the third optional argument of the function INDEX. It is necessary because in the first argument we specify the entire table and must tell the function which column to retrieve the value from. In our case this is the column C(Sum), and so we entered 3 .

And finally, because we need to check every cell in the array, this formula must be an array formula. You can see this by the curly braces it is enclosed in. So when you're done entering the formula, don't forget to click Ctrl+Shift+Enter.

If everything is done correctly, you will get the result as in the figure below:

INDEX and MATCH combined with IFERROR in Excel

As you have probably already noticed (more than once), if you enter an incorrect value, for example, one that is not in the array being viewed, the formula INDEX/SEARCH reports an error #N/A(#N/A) or #VALUE!(#VALUE!). If you want to replace such a message with something more understandable, you can insert a formula with INDEX And SEARCH into a function IFERROR.

Function syntax IFERROR very simple:

IFERROR(value,value_if_error)
IFERROR(value,value_if_error)

Where is the argument value(value) is the value being checked for an error (in our case, the result of the formula INDEX/SEARCH); and the argument value_if_error(value_if_error) is the value to be returned if the formula throws an error.

For example, you can insert into a function IFERROR like this:

IFERROR(INDEX($A$1:$E$11,MATCH($G$2,$B$1:$B$11,0),MATCH($G$3,$A$1:$E$1,0)),
"No matches found. Try again!") =IFERROR(INDEX($A$1:$E$11,MATCH($G$2,$B$1:$B$11,0),MATCH($G$3,$A$1 :$E$1;0));
"No matches found. Try again!")

And now, if someone enters an incorrect value, the formula will produce this result:

If you prefer to leave the cell empty in case of an error, you can use quotes (“”) as the value of the second argument of the function IFERROR. Like this:

IFERROR(INDEX(array,MATCH(lookup_value,lookup_array,0),"")
IFERROR(INDEX(array,MATCH(lookup_value,looked_array,0),"")

I hope that you found at least one formula described in this textbook useful. If you have encountered other search problems for which you could not find a suitable solution among the information in this lesson, feel free to describe your problem in the comments, and we will all try to solve it together.

A program for finding patterns in code. Index and search functions in excel - the best alternative for vpr