SlideShare a Scribd company logo
1 of 56
Chapter 2 : N – gram Language
Models
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2021)
Outline
 Introduction
 The role of language models
 Simple N-gram models
 Estimating parameters and smoothing
 Evaluating language models
3/29/2024 2
Introduction
 Language models assign a probability that a sentence is a legal
string in a language.
 Language Models are useful component of many NLP systems,
such as:
Automatic Speech Recognition (ASR),
Optical Character Recognition (OCR), and
Machine Translation (MT).
3/29/2024 3
Introduction …
 Language Models Definition:
Impossible to recover W successfully in all cases – ambiguity.
Instead, minimize probability of error.
Choosing estimate of W out of a number of options.
Ŵ – for which the probability given signal Y is greatest.
Ŵ = max(∀ i : p(Ŵi | Y) )
 Language model – computational mechanism for obtaining these
conditional probabilities.
3/29/2024 4
Introduction …
 Language models answer the question:
 How likely is a string of English words good English?
 Help with reordering:
 PLM(the house is small) > PLM(small the is house)
 Help with word choice:
 PLM(I am going home) > PLM(I am going house)
3/29/2024 5
Introduction …
 What is a statistical language model?
 A stochastic process model for word sequences. A mechanism for
computing the probability of:
p(w1, . . . ,wT )
 Statistical language modeling
 Goal: create a statistical model so that one can calculate the
probability of a sequence of tokens s = w1, w2,…, wn in a language.
 General approach:
3/29/2024 6
Training corpus
Probabilities of the
observed elements
s
P(s)
Role of Language Models
 Why are language models interesting?
 Important component of a speech recognition system.
 Helps discriminate between similar sounding words.
 Helps reduce search costs.
 In statistical machine translation, a language model characterizes
the target language, captures fluency.
 For selecting alternatives in summarization, generation.
 Text classification (style, reading level, language, topic, . . . )
 Language models can be used for more than just words:
 Letter sequences (language identification)
 Speech act sequence modeling
 Case and punctuation restoration
3/29/2024 7
Role of Language Models…
 Uses of Language Models:
 Speech recognition
 “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”
 OCR & Handwriting recognition
 More probable sentences are more likely correct readings.
 Machine translation
 More likely sentences are probably better translations.
 Generation
 More likely sentences are probably better NL generations.
 Context sensitive spelling correction
 “Their are problems wit this sentence.”
3/29/2024 8
Role of Language Models…
 Completion Prediction
 A language model also supports predicting the completion of a
sentence.
 Please turn off your cell _____
 Your program does not ______
 Predictive text input systems can guess what you are typing and
give choices on how to complete it.
3/29/2024 9
Simple N-Gram Models
 An n-gram model is a type of probabilistic language model for
predicting the next item in such a sequence in the form of a (n-1) -
order Markov model.
 N-gram models are now widely used in probability, communication
theory, computational linguistics (for instance, statistical natural
language processing), computational biology (for instance, biological
sequence analysis), and data compression.
 Two benefits of n-gram models (and algorithms that use them) are
simplicity and scalability – with larger n, a model can store more
context with a well-understood space–time tradeoff, enabling small
experiments to scale up efficiently.
 Simple n-gram models are easy to train on unsupervised corpora and
can provide useful estimates of sentence likelihood.
3/29/2024 10
Simple N-Gram Models…
 Estimate probability of each word given prior context.
 P(phone | Please turn off your cell)
 Number of parameters required grows exponentially with the
number of words of prior context.
 An n-gram model uses only N1 words of prior context.
 Unigram: P(phone)
 Bigram: P(phone | cell)
 Trigram: P(phone | your cell)
3/29/2024 11
Simple N-Gram Models…
 The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history.
 In particular, in a kth-order Markov model, the next state only
depends on the k most recent states, therefore an n-gram model
is a (N1)-order Markov model.
 Use the previous N-1 words in a sequence to predict the next
word.
 Language Model (LM)
 unigrams, bigrams, trigrams, 4 grams, 5 grams…
 How do we train these models?
 Very large corpora
3/29/2024 12
Simple N-Gram Models…
 N-Gram Model Formulas
3/29/2024 13
Estimating Probabilities
 N-gram conditional probabilities can be estimated from raw
text based on the relative frequency of word sequences.
 To have a consistent probabilistic model, append a unique start
(<s>) and end (</s>) symbol to every sentence and treat these
as additional words.
3/29/2024 14
Example
 Here are some text normalized sample user queries (a sample of
9332 sentences is on the website):
 Berkeley Restaurant Project Senetences:
 can you tell me about any good cantonese restaurants close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day
3/29/2024 15
Example
3/29/2024 16
Example
 Bigram estimates of sentence probabilities:
 P(<s> i want english food </s>)
 = P(i | <s>) P(want | i) P(english | want)
 = P(food | english) P(</s> | food)
 = 0.25 x .33 x .0011 x .5 x .68 = .000031
 P(<s> i want chinese food </s>)
 = P(i | <s>) P(want | i) P(chinese | want)
 = P(food | chinese) P(</s> | food)
 = 0.25 x .33 x .0065 x .52 x .68 = .00019
3/29/2024 17
Example
 What kinds of knowledge?
 P(english | want) = 0.0011
 P(chinese | want) = 0.0065 (More of the world)
 P(to | want) = 0.66 (more about the grammar)
 P(eat | to) = 0.28
 P(food | to) = 0.0 (contingent zero)
 P(want | spend) = 0.0 (more about the grammar)
 P(i | <s>) = 0.25
3/29/2024 18
Example
 Practical Issues: We do every thing in log space,
 To avoid underflow, (Arithmetic under flow)
 To make easy computing (Adding is faster than Multiplication).
 P1 x P2 x P3 x P4 = logP1 + logP2 + logP3 + logP4
3/29/2024 19
Simple N-Gram Models …
 Train and Test Corpora
 A language model must be trained on a large corpus of text to
estimate good parameter values.
 Model can be evaluated based on its ability to predict a high
probability for a disjoint (held-out) test corpus (testing on the
training corpus would give an optimistically biased estimate).
 Ideally, the training (and test) corpus should be representative of
the actual application data.
 May need to adapt a general model to a small amount of new (in-
domain) data by adding highly weighted small corpus to original
training data.
3/29/2024 20
Simple N-Gram Models …
 Train and Test Corpora…
 Unknown Words:
 How to handle words in the test corpus that did not occur in
the training data, i.e. out of vocabulary (OOV) words?
 Train a model that includes an explicit symbol for an unknown
word (<UNK>).
 Choose a vocabulary in advance and replace other words in the
training corpus with <UNK>.
 Replace the first occurrence of each word in the training data
with <UNK>.
3/29/2024 21
Estimating Parameters and
Smoothing
 Estimating Parameters
 Parameter estimation is fundamental to many statistical
approaches to NLP.
 Because of the high-dimensional nature of natural language, it is
often easy to generate an extremely large number of features.
 The challenge of parameter estimation is to find a combination of
the typically noisy, redundant features that accurately predicts the
target output variable and avoids over fitting.
 List of potential parameter estimators:
 Maximum Entropy (ME) estimation with L2 regularization, the
Averaged Perceptron (AP), Boosting, ME estimation with L1
regularization using a novel optimization algorithm, and BLasso,
which is a version of Boosting with Lasso (L1) regularization, etc
3/29/2024 22
Estimating Parameters and
Smoothing…
 Estimating Parameters…
 Intuitively, this can be achieved either
 By selecting a small number of highly-effective features and
ignoring the others, or
 By averaging over a large number of weakly informative features.
 The first intuition motivates feature selection methods such as
Boosting and Blasso which usually work best when many features
are completely irrelevant.
 L1 or Lasso regularization of linear models embeds feature
selection into regularization so that both an assessment of the
reliability of a feature and the decision about whether to remove it
are done in the same framework, and has generated a large
amount of interest in the NLP community recently.
3/29/2024 23
Estimating Parameters and
Smoothing…
 Estimating Parameters…
 If on the other hand most features are noisy but at least weakly
correlated with the target, it may be reasonable to attempt to
reduce noise by averaging over all of the features.
 ME estimators with L2 regularization, which have been widely
used in NLP tasks tend to produce models that have this property.
 In addition, the perceptron algorithm and its variants, e.g., the
voted or averaged perceptron, is becoming increasingly popular
due to their competitive performance, simplicity in
implementation and low computational cost in training.
3/29/2024 24
Estimating Parameters and
Smoothing…
 Smoothing (to keep the ML from assigning zero probability)
 Since there are a combinatorial number of possible word
sequences, many rare (but not impossible) combinations never
occur in training, so maximum likelihood estimates (MLE)
incorrectly assigns zero to many parameters (a.k.a. sparse data).
 If a new combination occurs during testing, it is given a
probability of zero and the entire sequence gets a probability of
zero (i.e. infinite perplexity).
 In practice, parameters are smoothed (a.k.a. regularized) to
reassign some probability mass to unseen events.
 Adding probability mass to unseen events requires removing it from
seen ones (discounting) in order to maintain a joint distribution that
sums to 1.
3/29/2024 25
Estimating Parameters and
Smoothing…
 Smoothing…
 “Hallucinate” additional training data in which each possible N-
gram occurs exactly once and adjust estimates accordingly.
3/29/2024 26
 Smoothing…
 where V is the total number of possible (N1)-grams (i.e. the
vocabulary size for a bigram model).
 Tends to reassign too much mass to unseen events, so can be
adjusted to add 0<<1 (normalized by V instead of V).
Estimating Parameters and
Smoothing…
 Smoothing…
 Advanced Smoothing (discounting)
 Many advanced techniques have been developed to improve
smoothing for language models.
 Laplace smoothing (simple approach)
 Add-k smoothing
 Backoff and Interpolation
 Kneser-Ney smoothing
 Class-based (cluster) N-grams
3/29/2024 27
Evaluating Language Model
 Ideally, evaluate use of model in end application (extrinsic
evaluation)
 Realistic approach
 Expensive (time consuming)
 Evaluate the ability of the model using test corpus and metrics
(intrinsic evaluation: independent of any application).
 Less realistic
 Cheaper
 Verify at least once that intrinsic evaluation correlates with an
extrinsic one.
3/29/2024 28
Evaluating Language Model …
 Perplexity
 Measure of how well a model “fits” the test data.
 Uses the probability that the model assigns to the test corpus.
 Normalizes for the number of words in the test corpus and takes
the inverse.
3/29/2024 29
 Measures the weighted average branching factor in predicting
the next word (lower pp(w) is better).
Evaluating Language Model …
 Sample Perplexity Evaluation (for different n-gram models)
 Models trained on 38 million words from the Wall Street Journal
(WSJ) using a 19,979 word vocabulary.
 Evaluate on a disjoint set of 1.5 million WSJ words. (test set)
3/29/2024 30
Unigram Bigram Trigram
Perplexity 962 170 109
Summary of Language Model
 Limitations of LM (n-gram) so far:
 P(word / full history) is too expensive.
 P(word / previous few words) is feasible
 The approach give us the local context only! It has lack of the
global context.
 Other approaches:
 Neural Networks
 Recurrent Neural Network (RNN – Most recent words)
 Long Short Term Memory (LSTM – limited to a few hundred
words due to their inherently complex sequential paths from the
previous unit to the current unit)
 Transformer (new model – in 2017 Google paper)
3/29/2024 31
NLP Applications: Speech
Recognition
3/29/2024 32
Outline
 Speech Processing
 Speech Recognition
 Speech Recognition Process
 Parameters of Speech Recognition Task
 Large-Vocabulary Continuous Speech Recognition
 Speech Recognition Architecture
 Feature Extraction
3/29/2024 33
Speech Processing
3/29/2024 34
Speech Processing …
 Speech Analysis/Synthesis:
Speech analysis enables to identify words and analyze audio
patterns to detect emotions and stress in a speaker's voice.
Speech Synthesis is the artificial production of human speech.
The modern task of speech synthesis, also called text-to-speech or
TTS, is to produce speech (acoustic waveforms) from text input.
 Speech Recognition is the process by which a computer (or a
machine) converts the voice signal into the corresponding text
or command through identification and understanding.
 Speech Coding: is the process of obtaining a compact
representation of voice signals for efficient transmission over
band-limited wired and wireless channels.
3/29/2024 35
Speech Recognition
 Application areas:
Human-computer interaction:
While many tasks are better solved with visual or pointing interfaces,
speech has the potential to be a better interface than the keyboard
for tasks where full natural language communication is useful, or for
which keyboards are not appropriate. This includes hands-busy or
eyes-busy applications, such as where the user has objects to
manipulate or equipment to control.
Telephony:
Where speech recognition is already used for example in spoken
dialogue systems for entering digits, recognizing “yes” to accept
collect calls, finding out airplane or train information, and call-
routing (“Accounting, please”, “Prof. Regier, please”).
In some applications, a multimodal interface combining speech and
pointing can be more efficient than a graphical user interface without
speech
3/29/2024 36
Speech Recognition…
 Application areas…
Dictation
Automatic Speech Recognition (ASR) can also be applied to
dictation, that is, transcription of extended monologue by a single
specific speaker.
Dictation is common in fields such as law and is also important as
part of augmentative communication (interaction between computers
and humans with some disability resulting in the inability to type, or
the inability to speak)
3/29/2024 37
Speech Recognition…
 The general problem of automatic transcription of speech by
any speaker in any environment is still far from solved.
 The SR process can be attributed to pattern recognition and
matching.
 Speech features can be extracted from the original speech signal,
which should have been pre-processed and analysed, and finally
SR template is constructed
 During the voice recognition, voice template stored in the system
is to be compared to the characteristics of the input voice signal,
according to certain algorithms and strategies, to identify the
optimal template for matching the inputting voice, and finally to
output recognition results.
3/29/2024 38
Speech Recognition Process
 Speech Recognition (SR) process generally involves the
following several key modules:
 Signal pre-processing,
 Speech feature extraction,
 Matching training-library template, and
 Outputting the matching results
3/29/2024 39
Speech Recognition Process…
 SR process generally involves the following several key
modules:
 Signal pre-processing module for sampling voice signal, removing
noise impact caused by the equipment and the environment, and
involves the selection of speech recognition unit.
 Speech feature-extraction module: extract the acoustic parameters
that reflect the essential characteristics of voice, such as voice
frequency and amplitude.
 The matching module: process calculating voice speed and the
likelihood probability between the input characteristics according
to certain criteria such as word formation rules, grammar rules,
semantic rules, and determines the semantic information of the
inputting voice.
3/29/2024 40
Parameters of Speech Recognition
Task
 Vocabulary size
 Digit Recognition
 Large Vocabulary
 How fluent, natural, or conversational the speech is:
 Isolated Word
 Continuous Speech
 Read Speech
 Conversational Speech
 Channel and Noise
 Accent or Speaker-class Characteristics
3/29/2024 41
Parameters of Speech Recognition
Task…
 One dimension of variation in speech recognition tasks is
Vocabulary size:
 Speech recognition is easier if the number of distinct words we
need to recognize is smaller. So tasks with a two word
vocabulary, like yes versus no detection, or an eleven word
vocabulary, like recognizing sequences of digits, in what is called
the digits task , are relatively easy.
 On the other end, tasks with large vocabularies, like transcribing
human-human telephone conversations, or transcribing broadcast
news, tasks with vocabularies of 64,000 words or more, are much
harder.
3/29/2024 42
Parameters of Speech Recognition
Task…
 A second dimension of variation is how fluent, natural, or
conversational the speech is.
 Isolated word recognition, in which each word is surrounded by
some sort of pause, is much easier than recognizing continuous
speech
 Continuous speech in which words run into each other and have to
be segmented. Continuous speech tasks themselves vary greatly in
difficulty.
3/29/2024 43
Parameters of Speech Recognition
Task…
 Continuous speech in which words run into each other and have to
be segmented. Continuous speech tasks themselves vary greatly in
difficulty.
 For example, human-to-machine speech turns out to be far easier to
recognize than human-to-human speech. That is, recognizing speech
of humans talking to machines, either reading out loud in read
speech (which simulates the dictation task), or conversing with
speech dialogue systems, is relatively easy.
 Recognizing the speech of two humans talking to each other, in
conversational speech recognition, for example for transcribing a
business meeting or a telephone conversation, is much harder.
 It seems that when humans talk to machines, they simplify their
speech quite a bit, talking more slowly and more clearly.
3/29/2024 44
Parameters of Speech Recognition
Task…
 A third dimension of variation is channel and noise.
 The dictation task (and much laboratory research in speech
recognition) is done with high quality, head mounted
microphones.
 Head mounted microphones eliminate the distortion that occurs in
a table microphone as the speaker’s head moves around.
 Noise of any kind also makes recognition harder.
 Thus recognizing a speaker dictating in a quiet office is much
easier than recognizing a speaker in a noisy car on the highway
with the window open.
3/29/2024 45
Parameters of Speech Recognition
Task…
 A final dimension of variation is accent or speaker-class
characteristics.
 Speech is easier to recognize if the speaker is speaking a standard
dialect, or in general one that matches the data the system was
trained on.
 Recognition is thus harder on foreign accented speech, or speech
of children (unless the system was specifically trained on exactly
these kinds of speech).
3/29/2024 46
Parameters of Speech Recognition
Task…
 The table shows the rough percentage of incorrect words (the
word error rate, or WER) from state-of-the-art systems on
different ASR tasks.
3/29/2024 47
Parameters of Speech Recognition
Task…
 A final dimension of variation is accent or speaker-class
characteristics…
 Variation due to noise and accent increases the error rates quite
a bit.
 The word error rate on strongly Japanese-accented or Spanish
accented English has been reported to be about 3 to 4 times
higher than for native speakers on the same task.
 Adding automobile noise with a 10dB SNR (signal-to-noise
ratio) can cause error rates to go up by 2 to 4 times.
3/29/2024 48
Large-Vocabulary Continuous
Speech Recognition (LVCSR)
 Large vocabulary generally means that the systems have a
vocabulary of roughly 20,000 to 60,000 words.
 Continuous means that the words are run together naturally.
 Algorithms can be speaker independent; that is, they are able to
recognize speech from people whose speech the system has
never been exposed to before. (Speaker independent algorithms are considered
for this discussion)
3/29/2024 49
Speech Recognition Architecture
 The task of speech recognition is to take as input an acoustic
waveform and produce as output a string of words.
 HMM-based speech recognition systems view this task using
the metaphor of the noisy channel.
 The intuition of the noisy channel model is to treat the acoustic
waveform as an “noisy” version of the string of words, i.e.. a
version that has been passed through a noisy communications
channel.
3/29/2024 50
Speech Recognition Architecture…
 This channel introduces “noise” which makes it hard to
recognize the “true” string of words. The goal is then to build a
model of the channel so that one can figure out how it modified
this “true” sentence and hence recover it.
 The insight of the noisy channel model is that if we know how
the channel distorts the source, we could find the correct source
sentence for a waveform by taking every possible sentence in
the language, running each sentence through our noisy channel
model, and seeing if it matches the output.
 It is then possible to select the best matching source sentence as
desired source sentence.
3/29/2024 51
Speech Recognition Architecture…
3/29/2024 52
Feature Extraction
3/29/2024 53
 MFCC = mel frequency cepstral coefficients
 DFT = Discrete Fourier Transform
 IDFT = The Cepstrum: Inverse Discrete Fourier Transform
Question & Answer
3/29/2024 54
Thank You !!!
3/29/2024 55
Individual Assignment - Two
 Review the paper given below
 Paper-2: A Comparative Study of Parameter Estimation
Methods for Statistical Natural Language Processing
3/29/2024 56

More Related Content

Similar to 2-Chapter Two-N-gram Language Models.ppt

Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Spelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platformsSpelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platformsAnjan Goswami
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEAbdurrahimDerric
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introductionThennarasuSakkan
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSijfcstjournal
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSijfcstjournal
 
A Neural Probabilistic Language Model.pptx
A Neural Probabilistic Language Model.pptxA Neural Probabilistic Language Model.pptx
A Neural Probabilistic Language Model.pptxRama Irsheidat
 
Audit report[rollno 49]
Audit report[rollno 49]Audit report[rollno 49]
Audit report[rollno 49]RAHULROHAM2
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approchanil maurya
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Langauage model
Langauage modelLangauage model
Langauage modelc sharada
 
LARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERSLARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERSgerogepatton
 
LARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERSLARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERSgerogepatton
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsR Systems International
 

Similar to 2-Chapter Two-N-gram Language Models.ppt (20)

Language Modeling.docx
Language Modeling.docxLanguage Modeling.docx
Language Modeling.docx
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Spelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platformsSpelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platforms
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
 
A Neural Probabilistic Language Model.pptx
A Neural Probabilistic Language Model.pptxA Neural Probabilistic Language Model.pptx
A Neural Probabilistic Language Model.pptx
 
Audit report[rollno 49]
Audit report[rollno 49]Audit report[rollno 49]
Audit report[rollno 49]
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approch
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Langauage model
Langauage modelLangauage model
Langauage model
 
LARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERSLARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERS
 
LARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERSLARGE LANGUAGE MODELS FOR CIPHERS
LARGE LANGUAGE MODELS FOR CIPHERS
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analytics
 

More from milkesa13

5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).ppt5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).pptmilkesa13
 
4-Chapter Four-Syntactic Parsing and Semantic Analysis.ppt
4-Chapter Four-Syntactic Parsing and Semantic Analysis.ppt4-Chapter Four-Syntactic Parsing and Semantic Analysis.ppt
4-Chapter Four-Syntactic Parsing and Semantic Analysis.pptmilkesa13
 
distributed system concerned lab sessions
distributed system concerned lab sessionsdistributed system concerned lab sessions
distributed system concerned lab sessionsmilkesa13
 
distributed system lab materials about ad
distributed system lab materials about addistributed system lab materials about ad
distributed system lab materials about admilkesa13
 
distributed system with lap practices at
distributed system with lap practices atdistributed system with lap practices at
distributed system with lap practices atmilkesa13
 
introduction to advanced distributed system
introduction to advanced distributed systemintroduction to advanced distributed system
introduction to advanced distributed systemmilkesa13
 
distributed system relation mapping (ORM)
distributed system relation mapping  (ORM)distributed system relation mapping  (ORM)
distributed system relation mapping (ORM)milkesa13
 
decision support system in management information
decision support system in management informationdecision support system in management information
decision support system in management informationmilkesa13
 
management system development and planning
management system development and planningmanagement system development and planning
management system development and planningmilkesa13
 
trends of information systems and artificial technology
trends of information systems and artificial technologytrends of information systems and artificial technology
trends of information systems and artificial technologymilkesa13
 

More from milkesa13 (10)

5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).ppt5-Information Extraction (IE) and Machine Translation (MT).ppt
5-Information Extraction (IE) and Machine Translation (MT).ppt
 
4-Chapter Four-Syntactic Parsing and Semantic Analysis.ppt
4-Chapter Four-Syntactic Parsing and Semantic Analysis.ppt4-Chapter Four-Syntactic Parsing and Semantic Analysis.ppt
4-Chapter Four-Syntactic Parsing and Semantic Analysis.ppt
 
distributed system concerned lab sessions
distributed system concerned lab sessionsdistributed system concerned lab sessions
distributed system concerned lab sessions
 
distributed system lab materials about ad
distributed system lab materials about addistributed system lab materials about ad
distributed system lab materials about ad
 
distributed system with lap practices at
distributed system with lap practices atdistributed system with lap practices at
distributed system with lap practices at
 
introduction to advanced distributed system
introduction to advanced distributed systemintroduction to advanced distributed system
introduction to advanced distributed system
 
distributed system relation mapping (ORM)
distributed system relation mapping  (ORM)distributed system relation mapping  (ORM)
distributed system relation mapping (ORM)
 
decision support system in management information
decision support system in management informationdecision support system in management information
decision support system in management information
 
management system development and planning
management system development and planningmanagement system development and planning
management system development and planning
 
trends of information systems and artificial technology
trends of information systems and artificial technologytrends of information systems and artificial technology
trends of information systems and artificial technology
 

Recently uploaded

Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 

Recently uploaded (20)

Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 

2-Chapter Two-N-gram Language Models.ppt

  • 1. Chapter 2 : N – gram Language Models Adama Science and Technology University School of Electrical Engineering and Computing Department of CSE Dr. Mesfin Abebe Haile (2021)
  • 2. Outline  Introduction  The role of language models  Simple N-gram models  Estimating parameters and smoothing  Evaluating language models 3/29/2024 2
  • 3. Introduction  Language models assign a probability that a sentence is a legal string in a language.  Language Models are useful component of many NLP systems, such as: Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), and Machine Translation (MT). 3/29/2024 3
  • 4. Introduction …  Language Models Definition: Impossible to recover W successfully in all cases – ambiguity. Instead, minimize probability of error. Choosing estimate of W out of a number of options. Ŵ – for which the probability given signal Y is greatest. Ŵ = max(∀ i : p(Ŵi | Y) )  Language model – computational mechanism for obtaining these conditional probabilities. 3/29/2024 4
  • 5. Introduction …  Language models answer the question:  How likely is a string of English words good English?  Help with reordering:  PLM(the house is small) > PLM(small the is house)  Help with word choice:  PLM(I am going home) > PLM(I am going house) 3/29/2024 5
  • 6. Introduction …  What is a statistical language model?  A stochastic process model for word sequences. A mechanism for computing the probability of: p(w1, . . . ,wT )  Statistical language modeling  Goal: create a statistical model so that one can calculate the probability of a sequence of tokens s = w1, w2,…, wn in a language.  General approach: 3/29/2024 6 Training corpus Probabilities of the observed elements s P(s)
  • 7. Role of Language Models  Why are language models interesting?  Important component of a speech recognition system.  Helps discriminate between similar sounding words.  Helps reduce search costs.  In statistical machine translation, a language model characterizes the target language, captures fluency.  For selecting alternatives in summarization, generation.  Text classification (style, reading level, language, topic, . . . )  Language models can be used for more than just words:  Letter sequences (language identification)  Speech act sequence modeling  Case and punctuation restoration 3/29/2024 7
  • 8. Role of Language Models…  Uses of Language Models:  Speech recognition  “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”  OCR & Handwriting recognition  More probable sentences are more likely correct readings.  Machine translation  More likely sentences are probably better translations.  Generation  More likely sentences are probably better NL generations.  Context sensitive spelling correction  “Their are problems wit this sentence.” 3/29/2024 8
  • 9. Role of Language Models…  Completion Prediction  A language model also supports predicting the completion of a sentence.  Please turn off your cell _____  Your program does not ______  Predictive text input systems can guess what you are typing and give choices on how to complete it. 3/29/2024 9
  • 10. Simple N-Gram Models  An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1) - order Markov model.  N-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression.  Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.  Simple n-gram models are easy to train on unsupervised corpora and can provide useful estimates of sentence likelihood. 3/29/2024 10
  • 11. Simple N-Gram Models…  Estimate probability of each word given prior context.  P(phone | Please turn off your cell)  Number of parameters required grows exponentially with the number of words of prior context.  An n-gram model uses only N1 words of prior context.  Unigram: P(phone)  Bigram: P(phone | cell)  Trigram: P(phone | your cell) 3/29/2024 11
  • 12. Simple N-Gram Models…  The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history.  In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an n-gram model is a (N1)-order Markov model.  Use the previous N-1 words in a sequence to predict the next word.  Language Model (LM)  unigrams, bigrams, trigrams, 4 grams, 5 grams…  How do we train these models?  Very large corpora 3/29/2024 12
  • 13. Simple N-Gram Models…  N-Gram Model Formulas 3/29/2024 13
  • 14. Estimating Probabilities  N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.  To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words. 3/29/2024 14
  • 15. Example  Here are some text normalized sample user queries (a sample of 9332 sentences is on the website):  Berkeley Restaurant Project Senetences:  can you tell me about any good cantonese restaurants close by  mid priced thai food is what i’m looking for  tell me about chez panisse  can you give me a listing of the kinds of food that are available  i’m looking for a good place to eat breakfast  when is caffe venezia open during the day 3/29/2024 15
  • 17. Example  Bigram estimates of sentence probabilities:  P(<s> i want english food </s>)  = P(i | <s>) P(want | i) P(english | want)  = P(food | english) P(</s> | food)  = 0.25 x .33 x .0011 x .5 x .68 = .000031  P(<s> i want chinese food </s>)  = P(i | <s>) P(want | i) P(chinese | want)  = P(food | chinese) P(</s> | food)  = 0.25 x .33 x .0065 x .52 x .68 = .00019 3/29/2024 17
  • 18. Example  What kinds of knowledge?  P(english | want) = 0.0011  P(chinese | want) = 0.0065 (More of the world)  P(to | want) = 0.66 (more about the grammar)  P(eat | to) = 0.28  P(food | to) = 0.0 (contingent zero)  P(want | spend) = 0.0 (more about the grammar)  P(i | <s>) = 0.25 3/29/2024 18
  • 19. Example  Practical Issues: We do every thing in log space,  To avoid underflow, (Arithmetic under flow)  To make easy computing (Adding is faster than Multiplication).  P1 x P2 x P3 x P4 = logP1 + logP2 + logP3 + logP4 3/29/2024 19
  • 20. Simple N-Gram Models …  Train and Test Corpora  A language model must be trained on a large corpus of text to estimate good parameter values.  Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate).  Ideally, the training (and test) corpus should be representative of the actual application data.  May need to adapt a general model to a small amount of new (in- domain) data by adding highly weighted small corpus to original training data. 3/29/2024 20
  • 21. Simple N-Gram Models …  Train and Test Corpora…  Unknown Words:  How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?  Train a model that includes an explicit symbol for an unknown word (<UNK>).  Choose a vocabulary in advance and replace other words in the training corpus with <UNK>.  Replace the first occurrence of each word in the training data with <UNK>. 3/29/2024 21
  • 22. Estimating Parameters and Smoothing  Estimating Parameters  Parameter estimation is fundamental to many statistical approaches to NLP.  Because of the high-dimensional nature of natural language, it is often easy to generate an extremely large number of features.  The challenge of parameter estimation is to find a combination of the typically noisy, redundant features that accurately predicts the target output variable and avoids over fitting.  List of potential parameter estimators:  Maximum Entropy (ME) estimation with L2 regularization, the Averaged Perceptron (AP), Boosting, ME estimation with L1 regularization using a novel optimization algorithm, and BLasso, which is a version of Boosting with Lasso (L1) regularization, etc 3/29/2024 22
  • 23. Estimating Parameters and Smoothing…  Estimating Parameters…  Intuitively, this can be achieved either  By selecting a small number of highly-effective features and ignoring the others, or  By averaging over a large number of weakly informative features.  The first intuition motivates feature selection methods such as Boosting and Blasso which usually work best when many features are completely irrelevant.  L1 or Lasso regularization of linear models embeds feature selection into regularization so that both an assessment of the reliability of a feature and the decision about whether to remove it are done in the same framework, and has generated a large amount of interest in the NLP community recently. 3/29/2024 23
  • 24. Estimating Parameters and Smoothing…  Estimating Parameters…  If on the other hand most features are noisy but at least weakly correlated with the target, it may be reasonable to attempt to reduce noise by averaging over all of the features.  ME estimators with L2 regularization, which have been widely used in NLP tasks tend to produce models that have this property.  In addition, the perceptron algorithm and its variants, e.g., the voted or averaged perceptron, is becoming increasingly popular due to their competitive performance, simplicity in implementation and low computational cost in training. 3/29/2024 24
  • 25. Estimating Parameters and Smoothing…  Smoothing (to keep the ML from assigning zero probability)  Since there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so maximum likelihood estimates (MLE) incorrectly assigns zero to many parameters (a.k.a. sparse data).  If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity).  In practice, parameters are smoothed (a.k.a. regularized) to reassign some probability mass to unseen events.  Adding probability mass to unseen events requires removing it from seen ones (discounting) in order to maintain a joint distribution that sums to 1. 3/29/2024 25
  • 26. Estimating Parameters and Smoothing…  Smoothing…  “Hallucinate” additional training data in which each possible N- gram occurs exactly once and adjust estimates accordingly. 3/29/2024 26  Smoothing…  where V is the total number of possible (N1)-grams (i.e. the vocabulary size for a bigram model).  Tends to reassign too much mass to unseen events, so can be adjusted to add 0<<1 (normalized by V instead of V).
  • 27. Estimating Parameters and Smoothing…  Smoothing…  Advanced Smoothing (discounting)  Many advanced techniques have been developed to improve smoothing for language models.  Laplace smoothing (simple approach)  Add-k smoothing  Backoff and Interpolation  Kneser-Ney smoothing  Class-based (cluster) N-grams 3/29/2024 27
  • 28. Evaluating Language Model  Ideally, evaluate use of model in end application (extrinsic evaluation)  Realistic approach  Expensive (time consuming)  Evaluate the ability of the model using test corpus and metrics (intrinsic evaluation: independent of any application).  Less realistic  Cheaper  Verify at least once that intrinsic evaluation correlates with an extrinsic one. 3/29/2024 28
  • 29. Evaluating Language Model …  Perplexity  Measure of how well a model “fits” the test data.  Uses the probability that the model assigns to the test corpus.  Normalizes for the number of words in the test corpus and takes the inverse. 3/29/2024 29  Measures the weighted average branching factor in predicting the next word (lower pp(w) is better).
  • 30. Evaluating Language Model …  Sample Perplexity Evaluation (for different n-gram models)  Models trained on 38 million words from the Wall Street Journal (WSJ) using a 19,979 word vocabulary.  Evaluate on a disjoint set of 1.5 million WSJ words. (test set) 3/29/2024 30 Unigram Bigram Trigram Perplexity 962 170 109
  • 31. Summary of Language Model  Limitations of LM (n-gram) so far:  P(word / full history) is too expensive.  P(word / previous few words) is feasible  The approach give us the local context only! It has lack of the global context.  Other approaches:  Neural Networks  Recurrent Neural Network (RNN – Most recent words)  Long Short Term Memory (LSTM – limited to a few hundred words due to their inherently complex sequential paths from the previous unit to the current unit)  Transformer (new model – in 2017 Google paper) 3/29/2024 31
  • 33. Outline  Speech Processing  Speech Recognition  Speech Recognition Process  Parameters of Speech Recognition Task  Large-Vocabulary Continuous Speech Recognition  Speech Recognition Architecture  Feature Extraction 3/29/2024 33
  • 35. Speech Processing …  Speech Analysis/Synthesis: Speech analysis enables to identify words and analyze audio patterns to detect emotions and stress in a speaker's voice. Speech Synthesis is the artificial production of human speech. The modern task of speech synthesis, also called text-to-speech or TTS, is to produce speech (acoustic waveforms) from text input.  Speech Recognition is the process by which a computer (or a machine) converts the voice signal into the corresponding text or command through identification and understanding.  Speech Coding: is the process of obtaining a compact representation of voice signals for efficient transmission over band-limited wired and wireless channels. 3/29/2024 35
  • 36. Speech Recognition  Application areas: Human-computer interaction: While many tasks are better solved with visual or pointing interfaces, speech has the potential to be a better interface than the keyboard for tasks where full natural language communication is useful, or for which keyboards are not appropriate. This includes hands-busy or eyes-busy applications, such as where the user has objects to manipulate or equipment to control. Telephony: Where speech recognition is already used for example in spoken dialogue systems for entering digits, recognizing “yes” to accept collect calls, finding out airplane or train information, and call- routing (“Accounting, please”, “Prof. Regier, please”). In some applications, a multimodal interface combining speech and pointing can be more efficient than a graphical user interface without speech 3/29/2024 36
  • 37. Speech Recognition…  Application areas… Dictation Automatic Speech Recognition (ASR) can also be applied to dictation, that is, transcription of extended monologue by a single specific speaker. Dictation is common in fields such as law and is also important as part of augmentative communication (interaction between computers and humans with some disability resulting in the inability to type, or the inability to speak) 3/29/2024 37
  • 38. Speech Recognition…  The general problem of automatic transcription of speech by any speaker in any environment is still far from solved.  The SR process can be attributed to pattern recognition and matching.  Speech features can be extracted from the original speech signal, which should have been pre-processed and analysed, and finally SR template is constructed  During the voice recognition, voice template stored in the system is to be compared to the characteristics of the input voice signal, according to certain algorithms and strategies, to identify the optimal template for matching the inputting voice, and finally to output recognition results. 3/29/2024 38
  • 39. Speech Recognition Process  Speech Recognition (SR) process generally involves the following several key modules:  Signal pre-processing,  Speech feature extraction,  Matching training-library template, and  Outputting the matching results 3/29/2024 39
  • 40. Speech Recognition Process…  SR process generally involves the following several key modules:  Signal pre-processing module for sampling voice signal, removing noise impact caused by the equipment and the environment, and involves the selection of speech recognition unit.  Speech feature-extraction module: extract the acoustic parameters that reflect the essential characteristics of voice, such as voice frequency and amplitude.  The matching module: process calculating voice speed and the likelihood probability between the input characteristics according to certain criteria such as word formation rules, grammar rules, semantic rules, and determines the semantic information of the inputting voice. 3/29/2024 40
  • 41. Parameters of Speech Recognition Task  Vocabulary size  Digit Recognition  Large Vocabulary  How fluent, natural, or conversational the speech is:  Isolated Word  Continuous Speech  Read Speech  Conversational Speech  Channel and Noise  Accent or Speaker-class Characteristics 3/29/2024 41
  • 42. Parameters of Speech Recognition Task…  One dimension of variation in speech recognition tasks is Vocabulary size:  Speech recognition is easier if the number of distinct words we need to recognize is smaller. So tasks with a two word vocabulary, like yes versus no detection, or an eleven word vocabulary, like recognizing sequences of digits, in what is called the digits task , are relatively easy.  On the other end, tasks with large vocabularies, like transcribing human-human telephone conversations, or transcribing broadcast news, tasks with vocabularies of 64,000 words or more, are much harder. 3/29/2024 42
  • 43. Parameters of Speech Recognition Task…  A second dimension of variation is how fluent, natural, or conversational the speech is.  Isolated word recognition, in which each word is surrounded by some sort of pause, is much easier than recognizing continuous speech  Continuous speech in which words run into each other and have to be segmented. Continuous speech tasks themselves vary greatly in difficulty. 3/29/2024 43
  • 44. Parameters of Speech Recognition Task…  Continuous speech in which words run into each other and have to be segmented. Continuous speech tasks themselves vary greatly in difficulty.  For example, human-to-machine speech turns out to be far easier to recognize than human-to-human speech. That is, recognizing speech of humans talking to machines, either reading out loud in read speech (which simulates the dictation task), or conversing with speech dialogue systems, is relatively easy.  Recognizing the speech of two humans talking to each other, in conversational speech recognition, for example for transcribing a business meeting or a telephone conversation, is much harder.  It seems that when humans talk to machines, they simplify their speech quite a bit, talking more slowly and more clearly. 3/29/2024 44
  • 45. Parameters of Speech Recognition Task…  A third dimension of variation is channel and noise.  The dictation task (and much laboratory research in speech recognition) is done with high quality, head mounted microphones.  Head mounted microphones eliminate the distortion that occurs in a table microphone as the speaker’s head moves around.  Noise of any kind also makes recognition harder.  Thus recognizing a speaker dictating in a quiet office is much easier than recognizing a speaker in a noisy car on the highway with the window open. 3/29/2024 45
  • 46. Parameters of Speech Recognition Task…  A final dimension of variation is accent or speaker-class characteristics.  Speech is easier to recognize if the speaker is speaking a standard dialect, or in general one that matches the data the system was trained on.  Recognition is thus harder on foreign accented speech, or speech of children (unless the system was specifically trained on exactly these kinds of speech). 3/29/2024 46
  • 47. Parameters of Speech Recognition Task…  The table shows the rough percentage of incorrect words (the word error rate, or WER) from state-of-the-art systems on different ASR tasks. 3/29/2024 47
  • 48. Parameters of Speech Recognition Task…  A final dimension of variation is accent or speaker-class characteristics…  Variation due to noise and accent increases the error rates quite a bit.  The word error rate on strongly Japanese-accented or Spanish accented English has been reported to be about 3 to 4 times higher than for native speakers on the same task.  Adding automobile noise with a 10dB SNR (signal-to-noise ratio) can cause error rates to go up by 2 to 4 times. 3/29/2024 48
  • 49. Large-Vocabulary Continuous Speech Recognition (LVCSR)  Large vocabulary generally means that the systems have a vocabulary of roughly 20,000 to 60,000 words.  Continuous means that the words are run together naturally.  Algorithms can be speaker independent; that is, they are able to recognize speech from people whose speech the system has never been exposed to before. (Speaker independent algorithms are considered for this discussion) 3/29/2024 49
  • 50. Speech Recognition Architecture  The task of speech recognition is to take as input an acoustic waveform and produce as output a string of words.  HMM-based speech recognition systems view this task using the metaphor of the noisy channel.  The intuition of the noisy channel model is to treat the acoustic waveform as an “noisy” version of the string of words, i.e.. a version that has been passed through a noisy communications channel. 3/29/2024 50
  • 51. Speech Recognition Architecture…  This channel introduces “noise” which makes it hard to recognize the “true” string of words. The goal is then to build a model of the channel so that one can figure out how it modified this “true” sentence and hence recover it.  The insight of the noisy channel model is that if we know how the channel distorts the source, we could find the correct source sentence for a waveform by taking every possible sentence in the language, running each sentence through our noisy channel model, and seeing if it matches the output.  It is then possible to select the best matching source sentence as desired source sentence. 3/29/2024 51
  • 53. Feature Extraction 3/29/2024 53  MFCC = mel frequency cepstral coefficients  DFT = Discrete Fourier Transform  IDFT = The Cepstrum: Inverse Discrete Fourier Transform
  • 56. Individual Assignment - Two  Review the paper given below  Paper-2: A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing 3/29/2024 56