1. Chapter 2 : N – gram Language
Models
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2021)
2. Outline
Introduction
The role of language models
Simple N-gram models
Estimating parameters and smoothing
Evaluating language models
3/29/2024 2
3. Introduction
Language models assign a probability that a sentence is a legal
string in a language.
Language Models are useful component of many NLP systems,
such as:
Automatic Speech Recognition (ASR),
Optical Character Recognition (OCR), and
Machine Translation (MT).
3/29/2024 3
4. Introduction …
Language Models Definition:
Impossible to recover W successfully in all cases – ambiguity.
Instead, minimize probability of error.
Choosing estimate of W out of a number of options.
Ŵ – for which the probability given signal Y is greatest.
Ŵ = max(∀ i : p(Ŵi | Y) )
Language model – computational mechanism for obtaining these
conditional probabilities.
3/29/2024 4
5. Introduction …
Language models answer the question:
How likely is a string of English words good English?
Help with reordering:
PLM(the house is small) > PLM(small the is house)
Help with word choice:
PLM(I am going home) > PLM(I am going house)
3/29/2024 5
6. Introduction …
What is a statistical language model?
A stochastic process model for word sequences. A mechanism for
computing the probability of:
p(w1, . . . ,wT )
Statistical language modeling
Goal: create a statistical model so that one can calculate the
probability of a sequence of tokens s = w1, w2,…, wn in a language.
General approach:
3/29/2024 6
Training corpus
Probabilities of the
observed elements
s
P(s)
7. Role of Language Models
Why are language models interesting?
Important component of a speech recognition system.
Helps discriminate between similar sounding words.
Helps reduce search costs.
In statistical machine translation, a language model characterizes
the target language, captures fluency.
For selecting alternatives in summarization, generation.
Text classification (style, reading level, language, topic, . . . )
Language models can be used for more than just words:
Letter sequences (language identification)
Speech act sequence modeling
Case and punctuation restoration
3/29/2024 7
8. Role of Language Models…
Uses of Language Models:
Speech recognition
“I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”
OCR & Handwriting recognition
More probable sentences are more likely correct readings.
Machine translation
More likely sentences are probably better translations.
Generation
More likely sentences are probably better NL generations.
Context sensitive spelling correction
“Their are problems wit this sentence.”
3/29/2024 8
9. Role of Language Models…
Completion Prediction
A language model also supports predicting the completion of a
sentence.
Please turn off your cell _____
Your program does not ______
Predictive text input systems can guess what you are typing and
give choices on how to complete it.
3/29/2024 9
10. Simple N-Gram Models
An n-gram model is a type of probabilistic language model for
predicting the next item in such a sequence in the form of a (n-1) -
order Markov model.
N-gram models are now widely used in probability, communication
theory, computational linguistics (for instance, statistical natural
language processing), computational biology (for instance, biological
sequence analysis), and data compression.
Two benefits of n-gram models (and algorithms that use them) are
simplicity and scalability – with larger n, a model can store more
context with a well-understood space–time tradeoff, enabling small
experiments to scale up efficiently.
Simple n-gram models are easy to train on unsupervised corpora and
can provide useful estimates of sentence likelihood.
3/29/2024 10
11. Simple N-Gram Models…
Estimate probability of each word given prior context.
P(phone | Please turn off your cell)
Number of parameters required grows exponentially with the
number of words of prior context.
An n-gram model uses only N1 words of prior context.
Unigram: P(phone)
Bigram: P(phone | cell)
Trigram: P(phone | your cell)
3/29/2024 11
12. Simple N-Gram Models…
The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history.
In particular, in a kth-order Markov model, the next state only
depends on the k most recent states, therefore an n-gram model
is a (N1)-order Markov model.
Use the previous N-1 words in a sequence to predict the next
word.
Language Model (LM)
unigrams, bigrams, trigrams, 4 grams, 5 grams…
How do we train these models?
Very large corpora
3/29/2024 12
14. Estimating Probabilities
N-gram conditional probabilities can be estimated from raw
text based on the relative frequency of word sequences.
To have a consistent probabilistic model, append a unique start
(<s>) and end (</s>) symbol to every sentence and treat these
as additional words.
3/29/2024 14
15. Example
Here are some text normalized sample user queries (a sample of
9332 sentences is on the website):
Berkeley Restaurant Project Senetences:
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
3/29/2024 15
17. Example
Bigram estimates of sentence probabilities:
P(<s> i want english food </s>)
= P(i | <s>) P(want | i) P(english | want)
= P(food | english) P(</s> | food)
= 0.25 x .33 x .0011 x .5 x .68 = .000031
P(<s> i want chinese food </s>)
= P(i | <s>) P(want | i) P(chinese | want)
= P(food | chinese) P(</s> | food)
= 0.25 x .33 x .0065 x .52 x .68 = .00019
3/29/2024 17
18. Example
What kinds of knowledge?
P(english | want) = 0.0011
P(chinese | want) = 0.0065 (More of the world)
P(to | want) = 0.66 (more about the grammar)
P(eat | to) = 0.28
P(food | to) = 0.0 (contingent zero)
P(want | spend) = 0.0 (more about the grammar)
P(i | <s>) = 0.25
3/29/2024 18
19. Example
Practical Issues: We do every thing in log space,
To avoid underflow, (Arithmetic under flow)
To make easy computing (Adding is faster than Multiplication).
P1 x P2 x P3 x P4 = logP1 + logP2 + logP3 + logP4
3/29/2024 19
20. Simple N-Gram Models …
Train and Test Corpora
A language model must be trained on a large corpus of text to
estimate good parameter values.
Model can be evaluated based on its ability to predict a high
probability for a disjoint (held-out) test corpus (testing on the
training corpus would give an optimistically biased estimate).
Ideally, the training (and test) corpus should be representative of
the actual application data.
May need to adapt a general model to a small amount of new (in-
domain) data by adding highly weighted small corpus to original
training data.
3/29/2024 20
21. Simple N-Gram Models …
Train and Test Corpora…
Unknown Words:
How to handle words in the test corpus that did not occur in
the training data, i.e. out of vocabulary (OOV) words?
Train a model that includes an explicit symbol for an unknown
word (<UNK>).
Choose a vocabulary in advance and replace other words in the
training corpus with <UNK>.
Replace the first occurrence of each word in the training data
with <UNK>.
3/29/2024 21
22. Estimating Parameters and
Smoothing
Estimating Parameters
Parameter estimation is fundamental to many statistical
approaches to NLP.
Because of the high-dimensional nature of natural language, it is
often easy to generate an extremely large number of features.
The challenge of parameter estimation is to find a combination of
the typically noisy, redundant features that accurately predicts the
target output variable and avoids over fitting.
List of potential parameter estimators:
Maximum Entropy (ME) estimation with L2 regularization, the
Averaged Perceptron (AP), Boosting, ME estimation with L1
regularization using a novel optimization algorithm, and BLasso,
which is a version of Boosting with Lasso (L1) regularization, etc
3/29/2024 22
23. Estimating Parameters and
Smoothing…
Estimating Parameters…
Intuitively, this can be achieved either
By selecting a small number of highly-effective features and
ignoring the others, or
By averaging over a large number of weakly informative features.
The first intuition motivates feature selection methods such as
Boosting and Blasso which usually work best when many features
are completely irrelevant.
L1 or Lasso regularization of linear models embeds feature
selection into regularization so that both an assessment of the
reliability of a feature and the decision about whether to remove it
are done in the same framework, and has generated a large
amount of interest in the NLP community recently.
3/29/2024 23
24. Estimating Parameters and
Smoothing…
Estimating Parameters…
If on the other hand most features are noisy but at least weakly
correlated with the target, it may be reasonable to attempt to
reduce noise by averaging over all of the features.
ME estimators with L2 regularization, which have been widely
used in NLP tasks tend to produce models that have this property.
In addition, the perceptron algorithm and its variants, e.g., the
voted or averaged perceptron, is becoming increasingly popular
due to their competitive performance, simplicity in
implementation and low computational cost in training.
3/29/2024 24
25. Estimating Parameters and
Smoothing…
Smoothing (to keep the ML from assigning zero probability)
Since there are a combinatorial number of possible word
sequences, many rare (but not impossible) combinations never
occur in training, so maximum likelihood estimates (MLE)
incorrectly assigns zero to many parameters (a.k.a. sparse data).
If a new combination occurs during testing, it is given a
probability of zero and the entire sequence gets a probability of
zero (i.e. infinite perplexity).
In practice, parameters are smoothed (a.k.a. regularized) to
reassign some probability mass to unseen events.
Adding probability mass to unseen events requires removing it from
seen ones (discounting) in order to maintain a joint distribution that
sums to 1.
3/29/2024 25
26. Estimating Parameters and
Smoothing…
Smoothing…
“Hallucinate” additional training data in which each possible N-
gram occurs exactly once and adjust estimates accordingly.
3/29/2024 26
Smoothing…
where V is the total number of possible (N1)-grams (i.e. the
vocabulary size for a bigram model).
Tends to reassign too much mass to unseen events, so can be
adjusted to add 0<<1 (normalized by V instead of V).
27. Estimating Parameters and
Smoothing…
Smoothing…
Advanced Smoothing (discounting)
Many advanced techniques have been developed to improve
smoothing for language models.
Laplace smoothing (simple approach)
Add-k smoothing
Backoff and Interpolation
Kneser-Ney smoothing
Class-based (cluster) N-grams
3/29/2024 27
28. Evaluating Language Model
Ideally, evaluate use of model in end application (extrinsic
evaluation)
Realistic approach
Expensive (time consuming)
Evaluate the ability of the model using test corpus and metrics
(intrinsic evaluation: independent of any application).
Less realistic
Cheaper
Verify at least once that intrinsic evaluation correlates with an
extrinsic one.
3/29/2024 28
29. Evaluating Language Model …
Perplexity
Measure of how well a model “fits” the test data.
Uses the probability that the model assigns to the test corpus.
Normalizes for the number of words in the test corpus and takes
the inverse.
3/29/2024 29
Measures the weighted average branching factor in predicting
the next word (lower pp(w) is better).
30. Evaluating Language Model …
Sample Perplexity Evaluation (for different n-gram models)
Models trained on 38 million words from the Wall Street Journal
(WSJ) using a 19,979 word vocabulary.
Evaluate on a disjoint set of 1.5 million WSJ words. (test set)
3/29/2024 30
Unigram Bigram Trigram
Perplexity 962 170 109
31. Summary of Language Model
Limitations of LM (n-gram) so far:
P(word / full history) is too expensive.
P(word / previous few words) is feasible
The approach give us the local context only! It has lack of the
global context.
Other approaches:
Neural Networks
Recurrent Neural Network (RNN – Most recent words)
Long Short Term Memory (LSTM – limited to a few hundred
words due to their inherently complex sequential paths from the
previous unit to the current unit)
Transformer (new model – in 2017 Google paper)
3/29/2024 31
35. Speech Processing …
Speech Analysis/Synthesis:
Speech analysis enables to identify words and analyze audio
patterns to detect emotions and stress in a speaker's voice.
Speech Synthesis is the artificial production of human speech.
The modern task of speech synthesis, also called text-to-speech or
TTS, is to produce speech (acoustic waveforms) from text input.
Speech Recognition is the process by which a computer (or a
machine) converts the voice signal into the corresponding text
or command through identification and understanding.
Speech Coding: is the process of obtaining a compact
representation of voice signals for efficient transmission over
band-limited wired and wireless channels.
3/29/2024 35
36. Speech Recognition
Application areas:
Human-computer interaction:
While many tasks are better solved with visual or pointing interfaces,
speech has the potential to be a better interface than the keyboard
for tasks where full natural language communication is useful, or for
which keyboards are not appropriate. This includes hands-busy or
eyes-busy applications, such as where the user has objects to
manipulate or equipment to control.
Telephony:
Where speech recognition is already used for example in spoken
dialogue systems for entering digits, recognizing “yes” to accept
collect calls, finding out airplane or train information, and call-
routing (“Accounting, please”, “Prof. Regier, please”).
In some applications, a multimodal interface combining speech and
pointing can be more efficient than a graphical user interface without
speech
3/29/2024 36
37. Speech Recognition…
Application areas…
Dictation
Automatic Speech Recognition (ASR) can also be applied to
dictation, that is, transcription of extended monologue by a single
specific speaker.
Dictation is common in fields such as law and is also important as
part of augmentative communication (interaction between computers
and humans with some disability resulting in the inability to type, or
the inability to speak)
3/29/2024 37
38. Speech Recognition…
The general problem of automatic transcription of speech by
any speaker in any environment is still far from solved.
The SR process can be attributed to pattern recognition and
matching.
Speech features can be extracted from the original speech signal,
which should have been pre-processed and analysed, and finally
SR template is constructed
During the voice recognition, voice template stored in the system
is to be compared to the characteristics of the input voice signal,
according to certain algorithms and strategies, to identify the
optimal template for matching the inputting voice, and finally to
output recognition results.
3/29/2024 38
39. Speech Recognition Process
Speech Recognition (SR) process generally involves the
following several key modules:
Signal pre-processing,
Speech feature extraction,
Matching training-library template, and
Outputting the matching results
3/29/2024 39
40. Speech Recognition Process…
SR process generally involves the following several key
modules:
Signal pre-processing module for sampling voice signal, removing
noise impact caused by the equipment and the environment, and
involves the selection of speech recognition unit.
Speech feature-extraction module: extract the acoustic parameters
that reflect the essential characteristics of voice, such as voice
frequency and amplitude.
The matching module: process calculating voice speed and the
likelihood probability between the input characteristics according
to certain criteria such as word formation rules, grammar rules,
semantic rules, and determines the semantic information of the
inputting voice.
3/29/2024 40
41. Parameters of Speech Recognition
Task
Vocabulary size
Digit Recognition
Large Vocabulary
How fluent, natural, or conversational the speech is:
Isolated Word
Continuous Speech
Read Speech
Conversational Speech
Channel and Noise
Accent or Speaker-class Characteristics
3/29/2024 41
42. Parameters of Speech Recognition
Task…
One dimension of variation in speech recognition tasks is
Vocabulary size:
Speech recognition is easier if the number of distinct words we
need to recognize is smaller. So tasks with a two word
vocabulary, like yes versus no detection, or an eleven word
vocabulary, like recognizing sequences of digits, in what is called
the digits task , are relatively easy.
On the other end, tasks with large vocabularies, like transcribing
human-human telephone conversations, or transcribing broadcast
news, tasks with vocabularies of 64,000 words or more, are much
harder.
3/29/2024 42
43. Parameters of Speech Recognition
Task…
A second dimension of variation is how fluent, natural, or
conversational the speech is.
Isolated word recognition, in which each word is surrounded by
some sort of pause, is much easier than recognizing continuous
speech
Continuous speech in which words run into each other and have to
be segmented. Continuous speech tasks themselves vary greatly in
difficulty.
3/29/2024 43
44. Parameters of Speech Recognition
Task…
Continuous speech in which words run into each other and have to
be segmented. Continuous speech tasks themselves vary greatly in
difficulty.
For example, human-to-machine speech turns out to be far easier to
recognize than human-to-human speech. That is, recognizing speech
of humans talking to machines, either reading out loud in read
speech (which simulates the dictation task), or conversing with
speech dialogue systems, is relatively easy.
Recognizing the speech of two humans talking to each other, in
conversational speech recognition, for example for transcribing a
business meeting or a telephone conversation, is much harder.
It seems that when humans talk to machines, they simplify their
speech quite a bit, talking more slowly and more clearly.
3/29/2024 44
45. Parameters of Speech Recognition
Task…
A third dimension of variation is channel and noise.
The dictation task (and much laboratory research in speech
recognition) is done with high quality, head mounted
microphones.
Head mounted microphones eliminate the distortion that occurs in
a table microphone as the speaker’s head moves around.
Noise of any kind also makes recognition harder.
Thus recognizing a speaker dictating in a quiet office is much
easier than recognizing a speaker in a noisy car on the highway
with the window open.
3/29/2024 45
46. Parameters of Speech Recognition
Task…
A final dimension of variation is accent or speaker-class
characteristics.
Speech is easier to recognize if the speaker is speaking a standard
dialect, or in general one that matches the data the system was
trained on.
Recognition is thus harder on foreign accented speech, or speech
of children (unless the system was specifically trained on exactly
these kinds of speech).
3/29/2024 46
47. Parameters of Speech Recognition
Task…
The table shows the rough percentage of incorrect words (the
word error rate, or WER) from state-of-the-art systems on
different ASR tasks.
3/29/2024 47
48. Parameters of Speech Recognition
Task…
A final dimension of variation is accent or speaker-class
characteristics…
Variation due to noise and accent increases the error rates quite
a bit.
The word error rate on strongly Japanese-accented or Spanish
accented English has been reported to be about 3 to 4 times
higher than for native speakers on the same task.
Adding automobile noise with a 10dB SNR (signal-to-noise
ratio) can cause error rates to go up by 2 to 4 times.
3/29/2024 48
49. Large-Vocabulary Continuous
Speech Recognition (LVCSR)
Large vocabulary generally means that the systems have a
vocabulary of roughly 20,000 to 60,000 words.
Continuous means that the words are run together naturally.
Algorithms can be speaker independent; that is, they are able to
recognize speech from people whose speech the system has
never been exposed to before. (Speaker independent algorithms are considered
for this discussion)
3/29/2024 49
50. Speech Recognition Architecture
The task of speech recognition is to take as input an acoustic
waveform and produce as output a string of words.
HMM-based speech recognition systems view this task using
the metaphor of the noisy channel.
The intuition of the noisy channel model is to treat the acoustic
waveform as an “noisy” version of the string of words, i.e.. a
version that has been passed through a noisy communications
channel.
3/29/2024 50
51. Speech Recognition Architecture…
This channel introduces “noise” which makes it hard to
recognize the “true” string of words. The goal is then to build a
model of the channel so that one can figure out how it modified
this “true” sentence and hence recover it.
The insight of the noisy channel model is that if we know how
the channel distorts the source, we could find the correct source
sentence for a waveform by taking every possible sentence in
the language, running each sentence through our noisy channel
model, and seeing if it matches the output.
It is then possible to select the best matching source sentence as
desired source sentence.
3/29/2024 51
56. Individual Assignment - Two
Review the paper given below
Paper-2: A Comparative Study of Parameter Estimation
Methods for Statistical Natural Language Processing
3/29/2024 56