This document discusses methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation involves measuring a model's performance on a test set using metrics like perplexity, which is based on how well the model predicts the test set. Extrinsic evaluation embeds the model in an application and measures the application's performance. The document also covers techniques for dealing with unknown words like replacing low-frequency words with <UNK> and estimating its probability from training data.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Simply put, semantic analysis is the process of drawing meaning from text. It allows computers to understand and interpret sentences, paragraphs, or whole documents, by analyzing their grammatical structure, and identifying relationships between individual words in a particular context.
This talk is about how we applied deep learning techinques to achieve state-of-the-art results in various NLP tasks like sentiment analysis and aspect identification, and how we deployed these models at Flipkart
This file contains the concepts of Class P, Class NP, NP- completeness, Travelling Salesman Person problem, Clique Problem, Vertex cover problem, Hamiltonian problem, FFT and DFT.
word sense disambiguation, wsd, thesaurus-based methods, dictionary-based methods, supervised methods, lesk algorithm, michael lesk, simplified lesk, corpus lesk, graph-based methods, word similarity, word relatedness, path-based similarity, information content, surprisal, resnik method, lin method, elesk, extended lesk, semcor, collocational features, bag-of-words features, the window, lexical semantics, computational semantics, semantic analysis in language technology.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
Simply put, semantic analysis is the process of drawing meaning from text. It allows computers to understand and interpret sentences, paragraphs, or whole documents, by analyzing their grammatical structure, and identifying relationships between individual words in a particular context.
This talk is about how we applied deep learning techinques to achieve state-of-the-art results in various NLP tasks like sentiment analysis and aspect identification, and how we deployed these models at Flipkart
This file contains the concepts of Class P, Class NP, NP- completeness, Travelling Salesman Person problem, Clique Problem, Vertex cover problem, Hamiltonian problem, FFT and DFT.
word sense disambiguation, wsd, thesaurus-based methods, dictionary-based methods, supervised methods, lesk algorithm, michael lesk, simplified lesk, corpus lesk, graph-based methods, word similarity, word relatedness, path-based similarity, information content, surprisal, resnik method, lin method, elesk, extended lesk, semcor, collocational features, bag-of-words features, the window, lexical semantics, computational semantics, semantic analysis in language technology.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
DELAB - sequence generation seminar
Title
Open vocabulary problem
Table of contents
1. Open vocabulary problem
1-1. Open vocabulary problem
1-2. Ignore rare words
1-3. Approximative Softmax
1-4. Back-off Models
1-5. Character-level model
2. Solution1: Byte Pair Encoding(BPE)
3. Solution2: WordPieceModel(WPM)
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
The best known natural language processing tool is GPT-3, from OpenAI, which uses AI and statistics to predict the next word in a sentence based on the preceding words. NLP practitioners call tools like this “language models,” and they can be used for simple analytics tasks, such as classifying documents and analyzing the sentiment in blocks of text, as well as more advanced tasks, such as answering questions and summarizing reports. Language models are already reshaping traditional text analytics, but GPT-3 was an especially pivotal language model because, at 10x larger than any previous model upon release, it was the first large language model, which enabled it to perform even more advanced tasks like programming and solving high school–level math problems. The latest version, called InstructGPT, has been fine-tuned by humans to generate responses that are much better aligned with human values and user intentions, and Google’s latest model shows further impressive breakthroughs on language and reasoning.
For businesses, the three areas where GPT-3 has appeared most promising are writing, coding, and discipline-specific reasoning. OpenAI, the Microsoft-funded creator of GPT-3, has developed a GPT-3-based language model intended to act as an assistant for programmers by generating code from natural language input. This tool, Codex, is already powering products like Copilot for Microsoft’s subsidiary GitHub and is capable of creating a basic video game simply by typing instructions. This transformative capability was already expected to change the nature of how programmers do their jobs, but models continue to improve — the latest from Google’s DeepMind AI lab, for example, demonstrates the critical thinking and logic skills necessary to outperform most humans in programming competitions.
Models like GPT-3 are considered to be foundation models — an emerging AI research area — which also work for other types of data such as images and video. Foundation models can even be trained on multiple forms of data at the same time, like OpenAI’s DALL·E 2, which is trained on language and images to generate high-resolution renderings of imaginary scenes or objects simply from text prompts. Due to their potential to transform the nature of cognitive work, economists expect that foundation models may affect every part of the economy and could lead to increases in economic growth similar to the industrial revolution.
Intrinsic and Extrinsic Evaluations of Word EmbeddingsJinho Choi
In this paper, we first analyze the semantic composition of word embeddings by cross-referencing their clusters with the manual lexical database, WordNet. We then evaluate a variety of word embedding approaches by comparing their contributions to two NLP tasks. Our experiments show that the word embedding clusters give high correlations to the synonym and hyponym sets in WordNet, and give 0.88% and 0.17% absolute improvements in accuracy to named entity recognition and part-of-speech tagging, respectively.
Similar to NLP_KASHK:Evaluating Language Model (20)
At the end of this lecture students should be able to;
Define the C standard functions for managing file input output.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Define the declaration C strings.
Compare fixed length and variable length string.
Apply strings for functions.
Define string handling functions.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Define, initialize and access to the C stuctuers.
Develop programs using structures in arrays and functions.
Use structures within structures and structures as pointers.
Define, initialize and access to the C unions.
Compare and contrast C structures and unions.
Define memory allocation and de-allocation methods in C.
Develop programs using memory allocation functions.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Define the C standard functions for managing input output.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Define the C pointers and its usage in computer programming.
Describe pointer declaration and initialization.
Apply C pointers for expressions.
Experiment on pointer operations.
Identify NULL pointer concept.
Experiment on pointer to pointer, pointer arrays, arrays with pointers and functions with pointers.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Describe the C arrays.
Practice the declaration, initialization and access linear arrays.
Practice the declaration, initialization and access two dimensional arrays.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Describe the looping structures in C programming language.
Practice the control flow of different looping structures in C programming language.
Practice the variants in control flow of different looping structures in C programming language.
Apply taught concepts for writing programs.
COM1407: Program Control Structures – Decision Making & BranchingHemantha Kulathilake
At the end of this lecture students should be able to;
Define the operation of if, if-else, nested if-else, switch and conditional operator.
Justify the control flow of the program under the aforementioned C language constructs.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Define the terms operators, operands, operator precedence and associativity.
Describe operators in C programming language.
Practice the effect of different operators in C programming language.
Justify evaluation of expressions in programming.
Apply taught concepts for writing programs.
COM1407: Type Casting, Command Line Arguments and Defining Constants Hemantha Kulathilake
At the end of this lecture students should be able to;
Define type cast and type promotion in C programming language.
Define command line arguments in C Programming language.
Declare constants according to the C programming.
Apply math.h header file for problem solving.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Define Keywords / Reserve Words in C programming language.
Define Identifiers, Variable, Data Types, Constants and statements in C Programming language.
Justify the internal process with respect to the variable declaration and initialization.
Apply Variable Declaration and Variable initialization statement.
Assigning values to variables.
Apply taught concepts for writing programs.
At the end of this lecture students should be able to;
Describe features of C programming language.
Justify the terminology related to computer programming.
Define the editing, compiling, linking, debugging stages of C programming
Recognize the basic structure of a C program
Apply comments for C programs to improve readability.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
2. Extrinsic Evaluation
• The best way to evaluate the performance of a language
model is to embed it in an application and measure how
much the application improves.
• Such end-to-end evaluation is called extrinsic evaluation.
• Extrinsic evaluation is the only way to evaluation know if a
particular improvement in a component is really going to
help the task at hand.
• Thus, for speech recognition, we can compare the
performance of two language models by running the
speech recognizer twice, once with each language model,
and seeing which gives the more accurate transcription.
3. Intrinsic Evaluation
• Unfortunately, running big NLP systems end-
to-end is often very expensive.
• Instead, it would be nice to have a metric that
can be used to quickly evaluate potential
improvements in a language model.
• An intrinsic evaluation metric is one that
measures the quality of a model independent
of any application.
4. Intrinsic Evaluation (Cont…)
• For an intrinsic evaluation of a language model we need a test set.
• The probabilities of an N-gram model training set come from the
corpus it is trained on, the training set or training corpus.
• We can then measure the quality of an N-gram model by its
performance on some unseen test set data called the test set or
test corpus.
• We will also sometimes call test sets and other datasets that are
not in our training sets held out corpora because we hold them out
from the training data.
• So if we are given a corpus of text and want to compare two
different N-gram models, we divide the data into training and test
sets, train the parameters of both models on the training set, and
then compare how well the two trained models fit the test set.
5. Intrinsic Evaluation (Cont…)
• But what does it mean to “fit the test set”?
– Whichever model assigns a higher probability to
the test set—meaning it more accurately predicts
the test set—is a better model.
• Given two probabilistic models, the better
model is the one that has a tighter fit to the
test data or that better predicts the details of
the test data, and hence will assign a higher
probability to the test data.
6. Intrinsic Evaluation (Cont…)
• Since our evaluation metric is based on test set probability,
it’s important not to let the test sentences into the training
set.
• Suppose we are trying to compute the probability of a
particular “test” sentence.
• If our test sentence is part of the training corpus, we will
mistakenly assign it an artificially high probability when it
occurs in the test set.
• We call this situation training on the test set.
• Training on the test set introduces a bias that makes the
probabilities all look too high, and causes huge inaccuracies
in perplexity ( perplexity means the probability-based
metric).
7. Development Test
• Sometimes we use a particular test set so often that we implicitly tune to its
characteristics.
• We then need a fresh test set that is truly unseen.
• In such cases, we call the initial test set the development test set or, devset.
• How do we divide our data into training, development, and test sets?
• We want our test set to be as large as possible, since a small test set may be
accidentally unrepresentative, but we also want as much training data as possible.
• At the minimum, we would want to pick the smallest test set that gives us enough
statistical power to measure a statistically significant difference between two
potential models.
• In practice, we often just divide our data into 80% training, 10% development, and
10% test.
• Given a large corpus that we want to divide into training and test, test data can
either be taken from some continuous sequence of text inside the corpus, or we
can remove smaller “stripes” of text from randomly selected parts of our corpus
and combine them into a test set.
8. Perplexity
• In practice we don’t use raw probability as our
metric for evaluating language models, but a
variant called perplexity.
• The perplexity (sometimes called PP for short) of
a language model on a test set is the inverse
probability of the test set, normalized by the
number of words. For a test setW = w1, w2, ……,
wN:
9. Perplexity (Cont…)
• We can use the chain rule to expand the
probability of W:
• Thus, if we are computing the perplexity of W
with a bigram language model, we get:
10. Perplexity (Cont…)
• Note that because of the inverse in previous equations, the
higher the conditional probability of the word sequence,
the lower the perplexity.
• Thus, minimizing perplexity is equivalent to maximizing the
test set probability according to the language model.
• What we generally use for word sequence in those
equations is the entire sequence of words in some test set.
• Since this sequence will cross many sentence boundaries,
we need to include the begin- and end-sentence markers
<s> and </s> in the probability computation.
• We also need to include the end-of-sentence marker </s>
(but not the beginning-of-sentence marker <s>) in the total
count of word tokens N.
11. Perplexity (Cont…)
• There is another way to think about perplexity: as the weighted
average branching factor of a language.
• The branching factor of a language is the number of possible next
words that can follow any word.
• Consider the task of recognizing the digits in English (zero, one,
two,..., nine), given that each of the 10 digits occurs with equal
probability P = 1/10.
• The perplexity of this mini-language is in fact 10.
• To see that, imagine a string of digits of length N.
12. Perplexity for Comparing Different N-
gram Models
• We trained unigram, bigram, and trigram grammars on 38
million words (including start-of-sentence tokens) from the
Wall Street Journal, using a 19,979 word vocabulary.
• We then computed the perplexity of each of these models
on a test set of 1.5 million words with following equation.
• The table below shows the perplexity of a 1.5 million word
WSJ test set according to each of these grammars.
13. Perplexity for Comparing Different N-
gram Models (Cont…)
• As we see above, the more information the N-
gram gives us about the word sequence, the
lower the perplexity.
• Note that in computing perplexities, the N-gram
model P must be constructed without any
knowledge of the test set or any prior knowledge
of the vocabulary of the test set.
• Any kind of knowledge of the test set can cause
the perplexity to be artificially low.
• The perplexity of two language models is only
comparable if they use identical vocabularies.
14. Generalization and Zeros
• The statistical models are likely to be pretty
useless as predictors if the training sets and the
test sets are as different.
• How should we deal with this problem when we
build N-gram models?
• One way is to be sure to use a training corpus
that has a similar genre to whatever task we are
trying to accomplish.
• To build a language model for translating legal
documents, we need a training corpus of legal
documents.
15. Generalization and Zeros (Cont…)
• Matching genres is still not sufficient.
• Our models may still be subject to the problem of sparsity.
• For any N-gram that occurred a sufficient number of times,
we might have a good estimate of its probability.
• But because any corpus is limited, some perfectly
acceptable English word sequences are bound to be
missing from it.
• That is, we’ll have many cases of putative “zero probability
N-grams” that should really have some non-zero
probability.
• Consider the words that follow the bigram denied the in
the WSJ Treebank3 corpus, together with their counts:
16. Generalization and Zeros (Cont…)
• To build a language model for a question-
answering system, we need a training corpus
of questions.
17. The Zeros
• Zeros— things that don’t ever occur in the training set
but do occur in the test set—are a problem for two
reasons.
– First, their presence means we are underestimating the
probability of all sorts of words that might occur, which will
hurt the performance of any application we want to run on
this data.
– Second, if the probability of any word in the test set is 0,
the entire probability of the test set is 0.
• By definition, perplexity is based on the inverse probability of
the test set.
• Thus if some words have zero probability, we can’t compute
perplexity at all, since we can’t divide by 0!
18. Unknown Words
• The previous section discussed the problem of words
whose bigram probability is zero.
• But what about words we simply have never seen before?
• Closed Vocabulary
– Sometimes we have a language task in which this can’t happen
because we know all the words that can occur.
– In such a closed vocabulary system the test set can only contain
words from this lexicon, and there will be no unknown words.
– This is a reasonable assumption in some domains, such as
speech recognition or machine translation, where we have a
pronunciation dictionary or a phrase table that are fixed in
advance, and so the language model can only use the words in
that dictionary or phrase table.
19. Unknown Words (Cont…)
• Open Vocabulary
– In other cases we have to deal with words we
haven’t seen before, which we’ll call unknown
words, or out of vocabulary (OOV) words.
– The percentage of OOV words that appear in the
test set is called the OOV rate.
– An open vocabulary system is one in which we
model these potential unknown words in the test
set by adding a pseudo-word called <UNK>.
20. Train the Probabilities of Unknown
Words
• There are two common ways to train the
probabilities of the unknown word model <UNK>.
• 1st Method:
– Turn the problem back into a closed vocabulary one
by choosing a fixed vocabulary in advance:
– Convert in the training set any word that is not in this
set (any OOV word) to the unknown word token
<UNK> in a text normalization step.
– Estimate the probabilities for <UNK> from its counts
just like any other regular word in the training set.
21. Train the Probabilities of Unknown
Words (Cont…)
• 2nd Method
– The second alternative, in situations where we don’t
have a prior vocabulary in advance, is to create such a
vocabulary implicitly, replacing words in the training
data by <UNK> based on their frequency.
– For example we can replace by <UNK> all words that
occur fewer than n times in the training set, where n is
some small number, or equivalently select a
vocabulary size V in advance (say 50,000) and choose
the top V words by frequency and replace the rest by
UNK.
– In either case we then proceed to train the language
model as before, treating <UNK> like a regular word.
22. Train the Probabilities of Unknown
Words (Cont…)
• The exact choice of <UNK> model does
have an effect on metrics like perplexity.
• A language model can achieve low perplexity
by choosing a small vocabulary and assigning
the unknown word a high probability.
• For this reason, perplexities should only be
compared across language models with the
same vocabularies.