Improving internal optimization with the help of competitors. N-grams in linguistics Program using letter n grams

Using N-grams

General use of N-grams

  • extracting data to cluster a series of satellite images of the Earth from space to then decide which specific parts of the Earth are in the image,
  • search for genetic sequences,
  • in the field of genetics are used to determine which specific animal species DNA samples are collected from,
  • in computer compression,
  • Using N-grams, audio-related data is typically indexed.

N-grams are also widely used in natural language processing.

Using N-Grams for Natural Language Processing Needs

In the field of natural language processing, N-grams are used primarily for prediction based on probabilistic models. The N-gram model calculates the probability of the last word of an N-gram if all previous ones are known. When using this approach to model language, it is assumed that the occurrence of each word depends only on previous words.

Another application of N-grams is to detect plagiarism. If you divide the text into several small fragments, represented by n-grams, they can be easily compared with each other, and thus obtain the degree of similarity of the controlled documents. N-grams are often successfully used for text and language categorization. In addition, they can be used to create functions that allow you to gain knowledge from text data. Using N-grams you can efficiently find candidates to replace misspelled words.

Google Research Projects

Google research centers have used N-gram models for a wide range of research and development. These include projects such as statistical translation from one language to another, speech recognition, spelling correction, information extraction, and much more. For the purposes of these projects, text corpora containing several trillion words were used.

Google decided to create its own educational building. The project is called Google teracorpus and it contains 1,024,908,267,229 words collected from public websites.

Methods for extracting n-grams

Due to the frequent use of N-grams to solve various problems, a reliable and fast algorithm is needed to extract them from text. A suitable n-gram extraction tool should be able to handle unlimited text size, be fast, and make efficient use of available resources. There are several methods for extracting N-grams from text. These methods are based on different principles:

Notes

see also


Wikimedia Foundation. 2010.

  • n-tv
  • N-cadherin

See what “N-gram” is in other dictionaries:

    GRAM- (French gramme, from Greek gramma trait). Unit of French weight = weight of 1 cubic centimeter of distilled water = 22.5 Russian. shares Dictionary of foreign words included in the Russian language. Chudinov A.N., 1910. GRAM unit of measure of weight in France ... Dictionary of foreign words of the Russian language

    gram- gram, gen. pl. grams and acceptable (in oral speech after numerals) grams. One hundred grams (grams). In defense of the new form of the genus. plural case number of grams was spoken by the expert on the Russian language, writer K. Chukovsky. This is what he wrote in the book “Alive as Life”: ... ... Dictionary of difficulties of pronunciation and stress in modern Russian language

    GRAM- GRAM, gram, husband. (from Greek gramma sign, letter). The basic unit of weight in the metric system, equal to the weight of 1 cubic centimeter of water. A gram weighs about 1/400 of a pound. ❖ Gram atom (physical) the number of grams of a substance equal to its atomic weight.... ... Dictionary Ushakova

    gram-roentgen- gram roentgen/n, gram roentgen/na, gen. pl. gram roentgen and gram roentgen... Together. Apart. Hyphenated.

    gram- Gram, this simple word could not have been included in the dictionary of errors, if not for two circumstances; firstly, if you want to show off with absolutely correct language, then when you come to the store, stun the seller with the correct one: Weigh me two hundred grams (not... ... Dictionary of Russian language errors

    GRAM-ATOM- GRAM ATOM, the amount of an element whose mass, in grams, is equal to its ATOMIC MASS. It was replaced by the SI unit mole. For example, one gram of hydrogen atom (H, atomic mass = 1) is equal to one gram. b>GRAM EQUIVALENT, weight in grams of that... ... Scientific and technical encyclopedic dictionary

    GRAM- GRAM, ah, kind. pl. grams and grams, husband. Unit of mass in decimal system measures, one thousandth of a kilogram. Not a gram (no) of anything (colloquial) not at all, not at all. This man (doesn’t) have an ounce of conscience. | adj. gram, oh, oh. Intelligent... ... Ozhegov's Explanatory Dictionary

    gram- A; pl. genus. grams and grams; m. [French] gramme] Unit of mass in the metric system, one thousandth of a kilogram. ◊ Not a (single) gram. Not at all, not at all. In whom l. not an ounce of falsehood. No one has an ounce of conscience. * * * gram (French ... encyclopedic Dictionary

    Gram Zenob Théophile- (Gramme) (1826 1901), electrical engineer. Born in Belgium, worked in France. Received a patent for a practically usable electric generator with a ring armature (1869). Founded the industrial production of electrical machines. * * * GRAM Zenob... ... encyclopedic Dictionary

    gram-atom- the amount of a substance in grams, numerically equal to its atomic mass. The term is not recommended for use. In SI, the amount of a substance is expressed in moles. * * * GRAM ATOM GRAM ATOM, the amount of a substance in grams, numerically equal to its atomic mass (cm ... encyclopedic Dictionary

    gram molecule- the amount of a substance in grams, numerically equal to its molecular weight. The term is not recommended for use. In SI, the amount of a substance is expressed in moles. * * * GRAM MOLECULE GRAM MOLECULE, the amount of a substance in grams, numerically equal to its ... ... encyclopedic Dictionary

,

Considered N-grams as a means of fixing linguistic reality and as a model construct. The connection between the model N-grams and formal grammars. Attention is drawn to the shortcomings and contradictions associated with the use of probabilistic models.

Introduction

Let's start with a formal definition. Let some finite alphabet be given VT={wi), Where wi– a separate symbol. A set of chains (strings) of finite length consisting of alphabetic characters VT, called a language in the alphabet VT and is designated L(VT). A separate chain of tongue L(VT) we will call it a statement in this language. In its turn, N-gram in the alphabet VT is called a chain length N. N-gram can coincide with some statement, be a substring of it, or not be included in the statement at all L(VT).

Here are some examples N-gram.

3. , N-grams of the Russian language. // This collection.

4. Glanz S. Medical and biological statistics. Per. from English edited by And. M., 1999.

5. Descriptive linguistics. Preface to G. Gleason's book "Introduction to Descriptive Linguistics." M., 1959.

6. Theoretical and applied linguistics. M., 1968.

8. , Pausing during automatic speech synthesis. // Theory and practice of speech research. M. 1999.

9. Minsky M. The wit and logic of the cognitive unconscious. // New in foreign linguistics. Vol. XXIII. M., 1988.

10. Slobin D., Green J. Psycholinguistics. M., 1976

11. Probability theory. M., 1972.

12. Fu K. Structural methods in pattern recognition. M., 1977.

13. Harris T. Theory of branching random processes. M., 1966.

14. Brill E. et al. Beyond N-grams: Can linguistic sophistication improve language modeling?

15. Booth T. Probability Representation of Formal Languages. // IEEE Annual Symp. Switching and Automata Theory. 1969.

16. Jelinek F. Self-Organized Language Modeling for Speech Recognition. // Readings in Speech Recognition. 1989.

17. Jelinek F., Lafferty J. Computation of the probability of initial substring generation by stochastic context-free grammar. // Computational Linguistics, vol.

18. Harris Z. S. Method in Structural Linguistics. Chicago, 1951.

19. Lashley K. The problem of serial order in behavior. // Psycholinguistics: A book of readings, N. Y. 1961.

20. Schlesinger E. Sentence Structure and the Reading Process. Mouton. 1968.

21. Shieber S. Evidence against the context-freeness of natural language. // Linguistics and Philosophy, vol.

22. Sola Pool I. Trends in Content Analysis Today. // Psycholinguistics: A book of readings, N. Y. 1961

23. Stolcke A., Segal J. Precise n-gram probabilities from stochastic context-free grammars. // Proceedings of the 32nd Annual Meeting of ACL. 1994.

These algorithms are designed to search through previously unknown text, and can be used, for example, in text editors, document viewers, or web browsers to search the page. They do not require pre-processing of text and can work with a continuous stream of data.

Linear search

Simple sequential application of a given metric (for example, Levenshtein metric) to words from the input text. When using a constrained metric, this method allows for optimal performance. But, at the same time, the more k, the more the operating time increases. Asymptotic time estimate - O(kn).

Bitap (also known as Shift-Or or Baeza-Yates-Gonnet, and its modification by Wu-Manber)

Algorithm Bitap and its various modifications are most often used for fuzzy search without indexing. A variation of it is used, for example, in the Unix utility agrep, which performs functions similar to standard grep, but with error support search query and even providing limited options for using regular expressions.

The idea of ​​this algorithm was first proposed by citizens Ricardo Baeza-Yates And Gaston Gonnet, publishing a related article in 1992.
The original version of the algorithm deals only with character substitutions, and, in fact, calculates the distance Hemming. But a little later Sun Wu And Udi Manber proposed a modification of this algorithm to calculate the distance Levenshtein, i.e. introduced support for insertions and deletions, and developed the first version of the agrep utility based on it.






Resulting value

Where k- number of mistakes, j- symbol index, s x - character mask (in the mask, unit bits are located in positions corresponding to the positions of this character in the request).
Whether a request is matched or not is determined by the very last bit of the resulting vector R.

The high speed of this algorithm is ensured by bit parallelism of calculations - in one operation it is possible to carry out calculations on 32 or more bits simultaneously.
At the same time, the trivial implementation supports searching for words no longer than 32. This limitation is determined by the width of the standard type int(on 32-bit architectures). Larger dimensional types can also be used, but this may slow down the algorithm to some extent.

Despite the fact that the asymptotic running time of this algorithm O(kn) coincides with that of the linear method, it is much faster with long queries and the number of errors k more than 2.

Testing

Testing was carried out on a text of 3.2 million words, the average word length was 10.
Exact search
Search time: 3562 ms
Search using the Levenshtein metric
Search time at k=2: 5728 ms
Search time at k=5: 8385 ms
Search using the Bitap algorithm with Wu-Manber modifications
Search time at k=2: 5499 ms
Search time at k=5: 5928 ms

It is obvious that a simple search using a metric, unlike the Bitap algorithm, is highly dependent on the number of errors k.

However, when it comes to searching large, unchanging texts, the search time can be significantly reduced by pre-processing such text, also called indexing.

Fuzzy search algorithms with indexing (Offline)

A feature of all fuzzy search algorithms with indexing is that the index is built using a dictionary compiled from the source text or a list of records in a database.

These algorithms use different approaches to solving the problem - some of them use reduction to exact search, others use the properties of the metric to construct various spatial structures, and so on.

First of all, in the first step, a dictionary is built from the source text, containing words and their positions in the text. You can also count the frequencies of words and phrases to improve the quality of search results.

It is assumed that the index, like the dictionary, is entirely loaded into memory.

Tactical and technical characteristics of the dictionary:

  • Source text - 8.2 gigabytes of materials from the Moshkov library (lib.ru), 680 million words;
  • Dictionary size - 65 megabytes;
  • Number of words - 3.2 million;
  • The average word length is 9.5 characters;
  • Mean square word length (can be useful when evaluating some algorithms) - 10.0 characters;
  • Alphabet - capitals letters A-Z, without E (to simplify some operations). Words containing non-alphabetic characters are not included in the dictionary.
The dependence of the size of the dictionary on the volume of text is not strictly linear - up to a certain volume, a basic framework of words is formed, ranging from 15% for 500 thousand words to 5% for 5 million, and then the dependence approaches linear, slowly decreasing and reaching 0.5% for 680 million words Subsequent maintenance of growth is ensured mostly through rare words.

Sampling expansion algorithm

This algorithm is often used in spell-checking systems (i.e., spell-checkers), where the size of the dictionary is small, or where speed is not the main criterion.
It is based on reducing the fuzzy search problem to the exact search problem.

From the original query, a set of “erroneous” words is constructed, for each of which an exact search is then performed in the dictionary.

Its running time strongly depends on the number k of errors and on the size of the alphabet A, and in the case of using a binary dictionary search it is:

For example, when k = 1 and words of length 7 (for example, “Crocodile”) in the Russian alphabet, many erroneous words will be about 450 in size, that is, it will be necessary to make 450 queries to the dictionary, which is quite acceptable.
But already at k = 2 the size of such a set will be more than 115 thousand options, which corresponds to a complete search of a small dictionary, or 1/27 in our case, and, therefore, the operating time will be quite long. At the same time, we must not forget that for each of these words it is necessary to conduct a search for an exact match in the dictionary.

Peculiarities:
The algorithm can be easily modified to generate “erroneous” options according to arbitrary rules, and, moreover, does not require any preliminary processing of the dictionary, and, accordingly, additional memory.
Possible improvements:
It is not possible to generate the entire set of “erroneous” words, but only those that are most likely to occur in a real situation, for example, words taking into account common spelling errors or typing errors.

This method was invented quite a long time ago, and is the most widely used, since its implementation is extremely simple and it provides sufficient good performance. The algorithm is based on the principle:
“If word A matches word B, taking into account several errors, then with a high degree of probability they will have at least one common substring of length N.”
These substrings of length N are called N-grams.
During indexing, a word is broken down into these N-grams, and then the word is included in the lists for each of these N-grams. During the search, the query is also divided into N-grams, and for each of them a list of words containing such a substring is sequentially searched.

The most frequently used in practice are trigrams - substrings of length 3. Choosing a larger value of N leads to a limitation on the minimum word length at which errors can already be detected.

Peculiarities:
The N-gram algorithm does not find all possible misspelled words. If we take, for example, the word VOTKA, and decompose it into trigrams: VO T KA → VO T ABOUT T TO T KA - you can notice that they all contain the error T. Thus, the word “VODKA” will not be found, since it does not contain any of these trigrams, and will not be included in the corresponding lists. Thus, the shorter the length of the word and the more errors it contains, the higher the chance that it will not be included in the lists corresponding to the N-grams of the query, and will not be present in the result.

Meanwhile, the N-gram method leaves full scope for using your own metrics with arbitrary properties and complexity, but you have to pay for it - when using it, you still need to sequentially search about 15% of the dictionary, which is quite a lot for large dictionaries.

Possible improvements:
You can split N-gram hash tables by word length and by N-gram position in a word (modification 1). How the length of the search word and the query cannot differ by more than k, and the positions of an N-gram in a word can differ by no more than k. Thus, it will be necessary to check only the table corresponding to the position of this N-gram in the word, as well as k tables on the left and k tables on the right, i.e. Total 2k+1 neighboring tables.

You can further reduce the size of the set required to view by dividing the tables by word length, and similarly viewing only neighboring 2k+1 tables (modification 2).

This algorithm is described in the article by L.M. Boytsov. "Hashing by signature." It is based on a fairly obvious representation of the “structure” of a word in the form of bit bits, used as a hash (signature) in a hash table.

During indexing, such hashes are calculated for each of the words, and the correspondence of the list of dictionary words to this hash is entered into the table. Then, during the search, a hash is calculated for the request and all neighboring hashes that differ from the original hash by no more than k bits are searched. For each of these hashes, the list of words corresponding to it is searched.

The process of calculating a hash - each bit of the hash is associated with a group of characters from the alphabet. Bit 1 in position i in a hash means that the source word contains a character from i-th alphabet groups. The order of the letters in a word has absolutely no meaning.

Removing one character will either not change the hash value (if there are still characters from the same alphabet group in the word), or the bit corresponding to this group will change to 0. When inserting, in the same way, either one bit will become 1, or there will be no changes. When replacing characters, everything is a little more complicated - the hash can either remain unchanged or change in 1 or 2 positions. During permutations, no changes occur at all, because the order of symbols when constructing a hash, as noted earlier, is not taken into account. Thus, to fully cover k errors, you need to change at least 2k bit in hash.

Operating time, on average, with k “incomplete” (insertions, deletions and transpositions, as well as a small part of replacements) errors:

Peculiarities:
Due to the fact that when replacing one character two bits can change at once, an algorithm that implements, for example, distortions of no more than 2 bits simultaneously will not actually produce the full volume of results due to the absence of a significant (depending on the ratio of the hash size to the alphabet) part of the words with two substitutions (and what larger size hash, the more often replacing a character will lead to distortion of two bits at once, and the less complete the result will be). In addition, this algorithm does not allow prefix search.

BK-trees

Trees Burkhard-Keller are metric trees, algorithms for constructing such trees are based on the property of the metric to meet the triangle inequality:

This property allows metrics to form metric spaces of arbitrary dimension. Such metric spaces are not necessarily Euclidean, so, for example, metrics Levenshtein And Damerau-Levenshtein form non-Euclidean space. Based on these properties, it is possible to construct a data structure that searches in such a metric space, which is Barkhard-Keller trees.

Improvements:
You can use the ability of some metrics to calculate distance with a constraint, setting an upper limit equal to the sum of the maximum distance to the vertex's children and the resulting distance, which will speed up the process a little:

Testing

Testing was carried out on a laptop with Intel Core Duo T2500 (2GHz/667MHz FSB/2MB), 2Gb RAM, OS - Ubuntu 10.10 Desktop i686, JRE - OpenJDK 6 Update 20.

Testing was carried out using the Damerau-Levenshtein distance and the number of errors k = 2. The size of the index is indicated along with the dictionary (65 MB).

Index size: 65 MB
Search time: 320ms / 330ms
Completeness of results: 100%

N-gram (original)
Index size: 170 MB
Index creation time: 32 s
Search time: 71ms / 110ms
Completeness of results: 65%
N-gram (modification 1)
Index size: 170 MB
Index creation time: 32 s
Seek time: 39ms / 46ms
Completeness of results: 63%
N-gram (modification 2)
Index size: 170 MB
Index creation time: 32 s
Search time: 37ms / 45ms
Completeness of results: 62%

Index size: 85 MB
Index creation time: 0.6 s
Search time: 55ms
Completeness of results: 56.5%

BK-trees
Index size: 150 MB
Index creation time: 120 s
Seek time: 540ms
Completeness of results: 63%

Total

Most indexed fuzzy search algorithms are not truly sublinear (i.e., have an asymptotic running time O(log n) or lower), and their operating speed usually depends directly on N. Nevertheless, multiple improvements and improvements make it possible to achieve sufficiently short operating times even with very large volumes of dictionaries.

There are also many more diverse and ineffective methods based, among other things, on adapting various techniques and techniques already used elsewhere to a given subject area. Among these methods is the adaptation of prefix trees (Trie) to fuzzy search problems, which I ignored due to its low efficiency. But there are also algorithms based on original approaches, for example, the algorithm Maassa-Nowak, which, although it has a sublinear asymptotic running time, is extremely inefficient due to the huge constants hidden behind such a time estimate, which manifest themselves in the form of a huge index size.

Practical use of fuzzy search algorithms in real search engines is closely related to phonetic algorithms, lexical stemming algorithms - isolating the base part of different word forms of the same word (for example, Snowball and Yandex mystem provide this functionality), as well as ranking based on statistical information, or using complex sophisticated metrics.

  • Levenshtein distance (with cutoff and prefix option);
  • Damerau-Levenshtein distance (with cutoff and prefix option);
  • Bitap algorithm (Shift-OR / Shift-AND with Wu-Manber modifications);
  • Sampling expansion algorithm;
  • N-gram method (original and with modifications);
  • Signature hashing method;
  • BK-trees.
I wanted to make the code easy to understand, and at the same time efficient enough for practical use. It was not my task to squeeze the last juice out of the JVM. Enjoy.

It is worth noting that in the process of studying this topic, I came up with some of my own developments that allow me to reduce search time by an order of magnitude due to a moderate increase in the index size and some restrictions on the freedom to choose metrics. But that's a completely different story.

I want to implement some n-gram applications (preferably in PHP).

Which type of n-grams is more suitable for most purposes? Word level or character level n-gram level? How can one implement an n-gram tokenizer in PHP?

Firstly, I would like to know what N-grams are. It's right? This is how I understand n-grams:

Sentence: "I live in New York."

word-level birams (2 for n): "# I", "I live", "live in", "in New York", "NY #"

character level birams (2 for n): "#I", "I #", "#l", "li", "iv", "ve", "e #", "#i", "in", "n #", "#N", "NY", "Y #"

Once you have this array of n-gram parts, you toss the duplicates and add a counter for each frequency part:

word level bigrams:

character level bigrams:

Is it correct?

Also, I'd like to know more about what you can do with n-grams:

  • How can I determine the language of a text using n-grams?
  • Is it possible to do machine translation using n-grams even if you don't have a bilingual corpus?
  • How to create a spam filter (spam, ham)? Combine n-grams with Bayesian filter?
  • How can I find a topic? For example: is there a text about basketball or dogs? My approach (do the following with the Wikipedia article for "dogs" and "basketball"): plot n-gram vectors for both documents, normalize them, calculate the Manhattan/Euclidean distance, the closer the result is to 1, the higher the similarity will be

How do you feel about my application, especially the last one?

I hope you can help me. Thanks in advance!

2 answers

Word n-grams will generally be more useful for most of the text analysis applications you mentioned, with the possible exception of language definition, where something like character trigrams might give top scores. Effectively, you would create a vector of n-grams for the text body in each language you are interested in, and then compare the frequencies of the trigrams in each corpus with the trigrams in the document you are classifying. For example, the trigram the probably appears much more often in English language than in German and will provide some level of statistical correlation. Once you have documents in n-gram format, you have a choice of many algorithms for further analysis, Baysian Filters, N Nearest Neighbor, Support Vector Machines, etc.

Of the applications you mentioned, machine translation is probably the most far-fetched, since n-grams alone won't get you very far down the road. Converting the input file to an n-gram representation is just a way to put the data into a format for further feature analysis, but as you lose a lot of contextual information, it may not be useful for translation.

One thing to note is that it is not enough to create a vector for one document and a vector for another document if the dimensions do not match. That is, the first entry in the vector cannot be the in one document and is in another, or the algorithms will not work. You'll end up with vectors like , since most documents won't contain more than the n-grams you're interested in. This "lining" also requires that you determine in advance which ngrams you will include in your analysis. This is often implemented as a two-pass algorithm to first decide the statistical significance of the various n-grams to decide what to keep. Google "feature selection" for more information.

Word-based n-grams plus vector machine support in in a great way to identify the topic, but to train the classifier you need a large corpus of text pre-classified into on-topic and off-topic. You will find a large number of research work, explaining different approaches to this problem on a site such as citeseerx. I would not recommend the Euclidean distance approach to this problem, as it does not weight individual n-grams based on statistical significance, so two documents that include the , a , is , and of will be considered a better match than two documents , which included Baysian. Removing stop words from your n-grams of interest would improve this slightly.

You are correct about the definition of n-grams.

You can use word-level n-grams for search-type applications. The character level n-grams can be used more to analyze the text itself. For example, to identify the language of a text, I would use letter frequencies compared to established language frequencies. That is, the text should approximately correspond to the frequency of occurrence of letters in this language.

An n-gram tokenizer for words in PHP can be done using strtok:

For characters use split:

You can then simply split the array however you like into any number of n-grams.

Bayesian filters need to be trained to be used as spam filters that can be used in combination with n-grams. However, you need to give him a lot of input in order for him to learn.

Your last approach sounds decent as it learns the context of the page... this is still however quite difficult to do, but n-grams seem like a good starting point for this.


Definition Examples of applied problems Creating an n-gram language model Calculating the probability of n-grams Eliminating the sparsity of the training corpus o Add-one Smoothing o Witten-Bell Discounting o Good-Turing Discounting o Katzs Backoff o Deleted Interpolation Estimating the n-gram language model using entropy Contents


N-gram (English N-gram) is a subsequence of N elements of some sequence. Let's look at sequences of words. Unigrams cat, dog, horse,... Bigrams little cat, big dog, strong horse,... Trigrams little cat eats, big dog barks, strong horse runs,... Definition


Examples of applied problems Speech recognition. Some words with different spellings are pronounced the same. The task is to choose the correct word in the context. Generation of texts on a given topic. Example: Yandex.Abstracts. Search for semantic errors. He is trying to fine out - from the point of view of syntax it is true, from the point of view of semantics it is not. He is trying to find out – that’s right. trying to find out is found in English texts much more often than trying to fine out, which means that if you have statistics, you can find and eliminate an error of this kind


Creating a language model of n-grams To solve the listed application problems, you need to create a language model of N-grams. To create a model you need to: 1. Calculate the probabilities of n-grams in the training corpus. 2. Eliminate the problem of hull sparsity using one of the smoothing methods. 3. Assess the quality of the resulting n-gram language model using entropy.


Calculating the probability of N-grams (1) In the training corpus, certain n-grams occur with different frequencies. For each n-gram, we can count how many times it appears in the corpus. Based on the data obtained, a probabilistic model can be built, which can then be used to estimate the probability of n-grams in some test corpus.


Calculating the probability of N-grams (2) Let's look at an example. Let the corpus consist of one sentence: They picnicked by the pool, then laid back on the grass and looked at the stars. Let's select n-grams. Unigrams: They, picknicked, by, … Bigrams: They picnicked, picknicked by, by the, … Trigrams They picknicked by, picknicked by the, by the pool, …


Calculating the probability of N-grams (3) Now you can count the n-grams. All identified bi- and trigrams appear once in the corpus. All unigrams, with the exception of the word the, also appear once. The word the appears three times. Now that we know how many times each n-gram occurs, we can build a probabilistic model of n-grams. In the case of unigrams, the probability of the word u can be calculated using the formula: For example, for the word the the probability will be equal to 3/16 (since there are 16 words in the corpus, 3 of which are the word the). Number of occurrences of the word u in the training corpus They picnicked by the pool, then lay back on the grass and looked at the stars


1, the probability is calculated slightly differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each word of a bigram as some event, then the probability" title=" Calculating the probability of N-grams (4) For n-grams, where n>1, the probability is calculated slightly differently. Consider the case of bigrams: let it be necessary to calculate the probability bigrams the pool If we consider each word of the bigram as some event, then" class="link_thumb"> 9 !} Calculating the probability of N-grams (4) For n-grams, where n>1, the probability is calculated slightly differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each word of the bigram as an event, then the probability of a set of events can be calculated using the formula: Thus, the probability of the bigram is the pool:, where 1, the probability is calculated slightly differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each word of the bigram as some event, then belief "> 1, the probability is calculated somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each word of the bigram as some event, then the probability of a set of events can be calculated using the formula : Thus, the probability of the bigram the pool:, where "> 1, the probability is calculated slightly differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each word of a bigram as some event, then the probability" title=" Calculating the probability of N-grams (4) For n-grams, where n>1, the probability is calculated slightly differently. Consider the case of bigrams: let it be necessary to calculate the probability bigrams the pool If we consider each word of the bigram as some event, then"> title="Calculating the probability of N-grams (4) For n-grams, where n>1, the probability is calculated slightly differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each word of the bigram as some event, then"> !}


Calculating the probability of N-grams (5) Now consider calculating the probability of an arbitrary n-gram (or sentence of length n). Expanding the case of bigrams, we obtain the probability formula for n-grams: Calculating the probability using such a formula is not easy, so a simplification is introduced - use a history of a fixed length, i.e. Thus, calculating the probability of a sentence comes down to calculating the conditional probability of the N-grams that make up this sentence:




Eliminating corpus sparsity (1) Problem with the unsmoothed language model of n-grams: for some n-grams the probability can be greatly underestimated (or even zero), although in reality (in the test corpus) these n-grams can occur quite often . Reason: limited training corps and its specificity. Solution: by reducing the probability of some n-grams, increase the probability of those n-grams that were not encountered (or were encountered quite rarely) in the training corpus.




Removing corpus sparsity (3) Sparsity removal algorithms use the following concepts: Types – different words (sequences of words) in the text. Tokens – all words (sequences of words) in the text. They picnicked by the pool, then lay back on the grass and looked at the stars – 14 types, 16 tokens





Add-one smoothing (4) The method provokes a strong error in calculations (for example, on the previous slide it was shown that for the word Chinese the number of bigrams was reduced by 8 times). Tests have shown that the unsmoothed model often shows more accurate results. Consequently, the method is interesting only from a theoretical point of view.


Witten-Bell Discounting (1) Based on a simple idea: use data about n-grams occurring in the training corpus to estimate the probability of missing n-grams. The idea of ​​the method is taken from compression algorithms: two types of events are considered - a new symbol (type) was encountered and a symbol (token) was encountered. Probability formula for all missing n-grams (i.e. the probability of encountering an n-gram in the test corpus that was not in the training corpus): N is the number of tokens in the training corpus, T is the number of types that have already been encountered in the training corpus






Witten-Bell Discounting (4) =>=> =>"> =>"> =>" title="Witten-Bell Discounting (4) =>=>"> title="Witten-Bell Discounting (4) =>=>"> !}




Good-Turing Discounting (1) Idea: for n-grams that occurred zero times (s times), the score is proportional to the number of n-grams that occurred once (s + 1 times). Let's look at an example: Suppose 18 fish were caught. Total caught different types– 6, and only one representative was caught from three species. We need to find the probability that the next fish will belong to a new species. There are 7 possible species in total (6 species have already been caught).








Katzs Backoff (2) Coefficient α is necessary for the correct distribution of the residual probability of N-grams in accordance with the probability distribution of (N-1)-grams. If you do not enter α, the estimate will be erroneous, because the equality will not be satisfied: The calculation of α is given at the end of the report. Evaluating a Language Model Using Entropy (1) Entropy is a measure of uncertainty. Using entropy, we can determine the most appropriate N-gram language model for a given applied problem. Binary entropy formula: Example: Calculate the entropy of a coin toss test. Answer: 1 bit, provided that the results of the experiment are equally probable (either side appears with probability 1/2).




Evaluating a language model using entropy (3) To compare different language models, cross-entropy is used: The closer the cross-entropy value H(p,m) is to the real entropy H(p), the better the language model: In our case, H(p ) is the entropy of the test corpus. m(w) – language model (for example, N-gram model)


Evaluating a language model using entropy (4) There is another method for assessing the quality of a language model, based on the so-called. connectivity indicator (perplexity). Idea: calculate the probability of the entire test corpus. A better model will show a higher probability. Perplexity formula: Thus, the lower the perplexity, the better model. You can interpret perplexity as the average number of words that can come after a certain word (i.e., the higher the perplexity, the higher the ambiguity, and therefore, the worse the language model). Relationship between perplexity and binary entropy:


Estimation of a language model using entropy (5) As an example, consider the perplexity values ​​for a certain corpus, obtained using trained models of unigrams, bigrams and trigrams: In the case of trigrams, perplexity is the smallest, because disambiguation is facilitated by the largest history length of all models (equal to 2) when calculating the conditional probabilities of trigrams. UnigramBigramTrigram Perplexity