SlideShare a Scribd company logo
Deep Learning Srihari
1
NLP: N-Gram Models
Sargur N. Srihari
srihari@cedar.buffalo.edu
This is part of lecture slides on Deep Learning:
http://www.cedar.buffalo.edu/~srihari/CSE676
Deep Learning Srihari
Topics in NLP
1. Overview
2. N-gram Models
3. Neural Language Models
4. High-dimensional Outputs
5. Combining Neural LMs with n-grams
6. Neural Machine Translation
7. Attention Models
8. Historical Perspective
2
Deep Learning Srihari
Use of probability in NLP
• Some tasks involving probability
3
1. Predicting the next word 2. Which is more probable?
all of a sudden I notice three guys standing
on the sidewalk
Same set of words in a different order is
nonsensical:
on guys all I of notice sidewalk three a
sudden standing the
Probability essential in tasks with ambiguous input:
• Speech recognition
• Spelling correction, grammatical error correction
• Machine translation
Deep Learning Srihari
Computing the probability of a word
• Consider task of computing P(w | h)
– the probability of a word w given some history h
– Suppose the history h is
“its water is so transparent that”
– We want the probability that the next word is the:
P(the | its water is so transparent that)
– Estimate probability from frequency counts ( C )
P(the | its water is so transparent that) =
C (its water is so transparent that the) / C (its water is so
transparent that)
• Even the web is not big enough to estimate
probability from frequency counts 4
Deep Learning Srihari
A better method for probabilities
• Chain rule of probability
P(X1.. Xn) = P(X1)P(X2|X1)P(X3|X1
2 ) .. P(Xn|X1
n-1) =∏k=1
n P(Xk|X1
k-1)
– Applying it to words
P(wn
1) = P(w1)P(w2|w1)P(w3|w2
1)…P(wn|wn-1
1 )= ∏k=1
n P(wk|wk-1
1 )
• shows link between computing joint probability of a
sequence and computing the conditional probability of a
word given previous words.
• The intuition of the n-gram model is that instead
of computing the probability of a word given its
entire history, we can approximate the history
by just the last few words 5
Deep Learning Srihari
Bigram Probabilities
• Bigram
– instead of computing the probability
P(the|Walden Pond’s water is so transparent that)
– we approximate it with the probability
P(the|that)
• We are making the assumption that
– P(wn|wn-1
1 ) = P(wn|wn-1)
6
Deep Learning Srihari
N-Gram Models
• An n-gram is a sequence of tokens, e.g., words
– n-gram models define the conditional probability of
the nth token given the previous n-1 tokens
• Products of conditional distributions define
probability distributions of longer sequences
• Comes from chain rule of probability
– P(x1,..xn)=P(xn|x1,..xn-1)P(x1,..xn-1)
• Distribution of P(x1,..,xn-1) may be defined by a different
model with a smaller value of n
7
P(x1
,..,xτ
) = P(x1
,..,xn−1
) P(xt
| xt−n+1
,..,xτ −1
)
t=n
τ
∏ Sequence of
length n
Deep Learning Srihari
Common n-grams
• Count how many times each possible n-gram
occurs in the training set
– Models based on n-grams have been core building
block of NLP
• For small values of n, we have
n=1 : unigram P(x1)
n=2 : bigram P(x1, x2)
n=3 : trigram P(x1, x2, x3)
8
Deep Learning Srihari
Examples of n-grams
9
1. Frequencies of 4 and 3-grams
2. Protein Sequencing
3. DNA Sequencing
4. Computational linguistics (character)
5. Computational Linguistics (word)
Deep Learning Srihari
Unigram and Bigram Probabilities
10
Unigram Counts
Bigram Counts
Bigram Probabilities
Deep Learning Srihari
Training n-Gram Models
• Usually train both an n-gram model and an n-1
gram model making it easy to compute
– Simply by looking up two stored probabilities
– For this to exactly reproduce inference in Pn we
must omit the final character from each sequence
when we train Pn-1
11
P(xt
| xt−n+1
,..,xτ −1
) =
Pn
(xt−n+1
,..,xt
)
Pn−1
(xt−n+1
,..,xt−1
)
n-gram
probability
n-1 gram
probability
Deep Learning Srihari
Example of Trigram Model Computation
• How to compute probability of “THE DOG RAN AWAY”
• The first words of the sentence cannot be handled by the
default formula on conditional probability because there is
no context at the beginning of the sentence
• Instead we must use the marginal probability over words
at the start of the sentence. We thus evaluate P3 (THE DOG
RAN)
• The last word may be predicted using the typical case, of
using conditional distribution P(AWAY|DOG RAN)
• Putting this together
12
P THE DOG RAN AWAY
( ) =
P3
THE DOG RAN
( )P3
DOG RAN AWAY
( )
P2
DOG RAN
( )
Deep Learning Srihari
Limitation of Maximum Likelihood for n-
gram models
• Pn estimated from training samples is very likely
to be zero in many cases even though the tuple
xt-n+1,..,xt may appear in test set
– When Pn-1 is zero the ratio is undefined
– When Pn-1 is non-zero but Pn is zero the log-
likelihood is -∞
• To avoid such catastrophic outcomes, n-gram
models employ smoothing
– Shift probability mass of observed tuples to
unobserved similar ones 13
Deep Learning Srihari
Smoothing techniques
1. Add non-zero mass to next symbol values
• Justified as Bayesian inference with a uniform or Dirichlet
prior over count parameters
2. Mixture of higher/lower-order n-gram models
• with higher-order models providing more capacity and
lower-order models more likely to avoid counts of zero
3. Back-off methods look-up lower-order n-grams
if frequency of context xt−1, . . . , xt−n+1 is too
small to use higher-order model
• More formally, they estimate distribution over xt by using
contexts xt−n+k, . . . , xt−1, for increasing k, until a sufficiently
reliable estimate is found 14
Deep Learning Srihari
N-gram with Backoff
• For high order models, e.g, N= 5, only a small
fraction of N-grams appear in training corpus
– a problem of data sparsity
• with 0 probability for almost all sentences
• To counteract this, several back-off techniques
have been suggested, the most popular being:
– where α and p are called back-off coefficients and
discounted probabilities, respectively
15
Source: https://arxiv.org/pdf/1804.07705.pdf
Deep Learning Srihari
Shortcomings of n-gram models
• Vulnerable to curse of dimensionality
• There are |V |n possible n-grams and |V | is large
• Even with a massive training set most n-grams
will not occur
• One way to view a classical n-gram model is
that it is performing nearest-neighbor lookup
– In other words, it can be viewed as a local non-
parametric predictor, similar to k-nearest neighbors
– Any two words are at same distance from each
other 16
Deep Learning Srihari
Class-based language models
• To improve statistical efficiency of n-gram
models
– Introduce notion of word categories
• Share statistics of words in same categories
• Idea: use a clustering algorithm to partition
words into clusters based on their co-
occurrence frequencies with other words
– Model can then use word-class IDs rather than
individual word-IDs to represent context
• Still much information is lost in this process
17

More Related Content

Similar to 12.4.1 n-grams.pdf

Lecture 6
Lecture 6Lecture 6
Lecture 6
hunglq
 
AI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, FlexudyAI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, Flexudy
Erlangen Artificial Intelligence & Machine Learning Meetup
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
Saurav Jha
 
Basics of Mathematical Cryptography
Basics of Mathematical CryptographyBasics of Mathematical Cryptography
Basics of Mathematical Cryptography
Neha Gupta
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
Daniel Perez
 
np complete
np completenp complete
np complete
Gayathri Gaayu
 
Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...
Konstantinos Giannakis
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
Khang Pham
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
Logic
LogicLogic
Interview presentation
Interview presentationInterview presentation
Interview presentation
Joseph Gubbins
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Maninda Edirisooriya
 
1. Logic and Proofs.ppt
1. Logic and Proofs.ppt1. Logic and Proofs.ppt
1. Logic and Proofs.ppt
ThnhNguynQuang14
 
Logic agent
Logic agentLogic agent
Logic agent
Slideshare
 
Predicate logic_2(Artificial Intelligence)
Predicate logic_2(Artificial Intelligence)Predicate logic_2(Artificial Intelligence)
Predicate logic_2(Artificial Intelligence)
SHUBHAM KUMAR GUPTA
 
NP-Completeness - II
NP-Completeness - IINP-Completeness - II
NP-Completeness - II
Amrinder Arora
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
GaytriDhingra1
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
GaytriDhingra1
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
Vsevolod Dyomkin
 

Similar to 12.4.1 n-grams.pdf (20)

Lecture 6
Lecture 6Lecture 6
Lecture 6
 
AI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, FlexudyAI applications in education, Pascal Zoleko, Flexudy
AI applications in education, Pascal Zoleko, Flexudy
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
Basics of Mathematical Cryptography
Basics of Mathematical CryptographyBasics of Mathematical Cryptography
Basics of Mathematical Cryptography
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
np complete
np completenp complete
np complete
 
Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...Computing probabilistic queries in the presence of uncertainty via probabilis...
Computing probabilistic queries in the presence of uncertainty via probabilis...
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Logic
LogicLogic
Logic
 
Interview presentation
Interview presentationInterview presentation
Interview presentation
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
 
1. Logic and Proofs.ppt
1. Logic and Proofs.ppt1. Logic and Proofs.ppt
1. Logic and Proofs.ppt
 
Logic agent
Logic agentLogic agent
Logic agent
 
Predicate logic_2(Artificial Intelligence)
Predicate logic_2(Artificial Intelligence)Predicate logic_2(Artificial Intelligence)
Predicate logic_2(Artificial Intelligence)
 
NP-Completeness - II
NP-Completeness - IINP-Completeness - II
NP-Completeness - II
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 

Recently uploaded

Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...
Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...
Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...
my Pandit
 
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
taqyea
 
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Stone Art Hub
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
The latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from NewentideThe latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from Newentide
JoeYangGreatMachiner
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
Lacey Max
 
Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...
Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...
Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...
SOFTTECHHUB
 
Call8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessingCall8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessing
➑➌➋➑➒➎➑➑➊➍
 
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
BBPMedia1
 
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel ChartSatta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
Innovation Management Frameworks: Your Guide to Creativity & Innovation
Innovation Management Frameworks: Your Guide to Creativity & InnovationInnovation Management Frameworks: Your Guide to Creativity & Innovation
Innovation Management Frameworks: Your Guide to Creativity & Innovation
Operational Excellence Consulting
 
TIMES BPO: Business Plan For Startup Industry
TIMES BPO: Business Plan For Startup IndustryTIMES BPO: Business Plan For Startup Industry
TIMES BPO: Business Plan For Startup Industry
timesbpobusiness
 
DearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUniDearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUni
katiejasper96
 
3 Simple Steps To Buy Verified Payoneer Account In 2024
3 Simple Steps To Buy Verified Payoneer Account In 20243 Simple Steps To Buy Verified Payoneer Account In 2024
3 Simple Steps To Buy Verified Payoneer Account In 2024
SEOSMMEARTH
 
GKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt PresentationGKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt Presentation
GraceKohler1
 
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Neil Horowitz
 
Top mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptxTop mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptx
JeremyPeirce1
 
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
APCO
 
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Lviv Startup Club
 
Industrial Tech SW: Category Renewal and Creation
Industrial Tech SW:  Category Renewal and CreationIndustrial Tech SW:  Category Renewal and Creation
Industrial Tech SW: Category Renewal and Creation
Christian Dahlen
 

Recently uploaded (20)

Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...
Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...
Taurus Zodiac Sign: Unveiling the Traits, Dates, and Horoscope Insights of th...
 
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
 
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
 
The latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from NewentideThe latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from Newentide
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
 
Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...
Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...
Hamster Kombat' Telegram Game Surpasses 100 Million Players—Token Release Sch...
 
Call8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessingCall8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessing
 
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
 
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel ChartSatta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
 
Innovation Management Frameworks: Your Guide to Creativity & Innovation
Innovation Management Frameworks: Your Guide to Creativity & InnovationInnovation Management Frameworks: Your Guide to Creativity & Innovation
Innovation Management Frameworks: Your Guide to Creativity & Innovation
 
TIMES BPO: Business Plan For Startup Industry
TIMES BPO: Business Plan For Startup IndustryTIMES BPO: Business Plan For Startup Industry
TIMES BPO: Business Plan For Startup Industry
 
DearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUniDearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUni
 
3 Simple Steps To Buy Verified Payoneer Account In 2024
3 Simple Steps To Buy Verified Payoneer Account In 20243 Simple Steps To Buy Verified Payoneer Account In 2024
3 Simple Steps To Buy Verified Payoneer Account In 2024
 
GKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt PresentationGKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt Presentation
 
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
 
Top mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptxTop mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptx
 
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
 
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
 
Industrial Tech SW: Category Renewal and Creation
Industrial Tech SW:  Category Renewal and CreationIndustrial Tech SW:  Category Renewal and Creation
Industrial Tech SW: Category Renewal and Creation
 

12.4.1 n-grams.pdf

  • 1. Deep Learning Srihari 1 NLP: N-Gram Models Sargur N. Srihari srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/CSE676
  • 2. Deep Learning Srihari Topics in NLP 1. Overview 2. N-gram Models 3. Neural Language Models 4. High-dimensional Outputs 5. Combining Neural LMs with n-grams 6. Neural Machine Translation 7. Attention Models 8. Historical Perspective 2
  • 3. Deep Learning Srihari Use of probability in NLP • Some tasks involving probability 3 1. Predicting the next word 2. Which is more probable? all of a sudden I notice three guys standing on the sidewalk Same set of words in a different order is nonsensical: on guys all I of notice sidewalk three a sudden standing the Probability essential in tasks with ambiguous input: • Speech recognition • Spelling correction, grammatical error correction • Machine translation
  • 4. Deep Learning Srihari Computing the probability of a word • Consider task of computing P(w | h) – the probability of a word w given some history h – Suppose the history h is “its water is so transparent that” – We want the probability that the next word is the: P(the | its water is so transparent that) – Estimate probability from frequency counts ( C ) P(the | its water is so transparent that) = C (its water is so transparent that the) / C (its water is so transparent that) • Even the web is not big enough to estimate probability from frequency counts 4
  • 5. Deep Learning Srihari A better method for probabilities • Chain rule of probability P(X1.. Xn) = P(X1)P(X2|X1)P(X3|X1 2 ) .. P(Xn|X1 n-1) =∏k=1 n P(Xk|X1 k-1) – Applying it to words P(wn 1) = P(w1)P(w2|w1)P(w3|w2 1)…P(wn|wn-1 1 )= ∏k=1 n P(wk|wk-1 1 ) • shows link between computing joint probability of a sequence and computing the conditional probability of a word given previous words. • The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words 5
  • 6. Deep Learning Srihari Bigram Probabilities • Bigram – instead of computing the probability P(the|Walden Pond’s water is so transparent that) – we approximate it with the probability P(the|that) • We are making the assumption that – P(wn|wn-1 1 ) = P(wn|wn-1) 6
  • 7. Deep Learning Srihari N-Gram Models • An n-gram is a sequence of tokens, e.g., words – n-gram models define the conditional probability of the nth token given the previous n-1 tokens • Products of conditional distributions define probability distributions of longer sequences • Comes from chain rule of probability – P(x1,..xn)=P(xn|x1,..xn-1)P(x1,..xn-1) • Distribution of P(x1,..,xn-1) may be defined by a different model with a smaller value of n 7 P(x1 ,..,xτ ) = P(x1 ,..,xn−1 ) P(xt | xt−n+1 ,..,xτ −1 ) t=n τ ∏ Sequence of length n
  • 8. Deep Learning Srihari Common n-grams • Count how many times each possible n-gram occurs in the training set – Models based on n-grams have been core building block of NLP • For small values of n, we have n=1 : unigram P(x1) n=2 : bigram P(x1, x2) n=3 : trigram P(x1, x2, x3) 8
  • 9. Deep Learning Srihari Examples of n-grams 9 1. Frequencies of 4 and 3-grams 2. Protein Sequencing 3. DNA Sequencing 4. Computational linguistics (character) 5. Computational Linguistics (word)
  • 10. Deep Learning Srihari Unigram and Bigram Probabilities 10 Unigram Counts Bigram Counts Bigram Probabilities
  • 11. Deep Learning Srihari Training n-Gram Models • Usually train both an n-gram model and an n-1 gram model making it easy to compute – Simply by looking up two stored probabilities – For this to exactly reproduce inference in Pn we must omit the final character from each sequence when we train Pn-1 11 P(xt | xt−n+1 ,..,xτ −1 ) = Pn (xt−n+1 ,..,xt ) Pn−1 (xt−n+1 ,..,xt−1 ) n-gram probability n-1 gram probability
  • 12. Deep Learning Srihari Example of Trigram Model Computation • How to compute probability of “THE DOG RAN AWAY” • The first words of the sentence cannot be handled by the default formula on conditional probability because there is no context at the beginning of the sentence • Instead we must use the marginal probability over words at the start of the sentence. We thus evaluate P3 (THE DOG RAN) • The last word may be predicted using the typical case, of using conditional distribution P(AWAY|DOG RAN) • Putting this together 12 P THE DOG RAN AWAY ( ) = P3 THE DOG RAN ( )P3 DOG RAN AWAY ( ) P2 DOG RAN ( )
  • 13. Deep Learning Srihari Limitation of Maximum Likelihood for n- gram models • Pn estimated from training samples is very likely to be zero in many cases even though the tuple xt-n+1,..,xt may appear in test set – When Pn-1 is zero the ratio is undefined – When Pn-1 is non-zero but Pn is zero the log- likelihood is -∞ • To avoid such catastrophic outcomes, n-gram models employ smoothing – Shift probability mass of observed tuples to unobserved similar ones 13
  • 14. Deep Learning Srihari Smoothing techniques 1. Add non-zero mass to next symbol values • Justified as Bayesian inference with a uniform or Dirichlet prior over count parameters 2. Mixture of higher/lower-order n-gram models • with higher-order models providing more capacity and lower-order models more likely to avoid counts of zero 3. Back-off methods look-up lower-order n-grams if frequency of context xt−1, . . . , xt−n+1 is too small to use higher-order model • More formally, they estimate distribution over xt by using contexts xt−n+k, . . . , xt−1, for increasing k, until a sufficiently reliable estimate is found 14
  • 15. Deep Learning Srihari N-gram with Backoff • For high order models, e.g, N= 5, only a small fraction of N-grams appear in training corpus – a problem of data sparsity • with 0 probability for almost all sentences • To counteract this, several back-off techniques have been suggested, the most popular being: – where α and p are called back-off coefficients and discounted probabilities, respectively 15 Source: https://arxiv.org/pdf/1804.07705.pdf
  • 16. Deep Learning Srihari Shortcomings of n-gram models • Vulnerable to curse of dimensionality • There are |V |n possible n-grams and |V | is large • Even with a massive training set most n-grams will not occur • One way to view a classical n-gram model is that it is performing nearest-neighbor lookup – In other words, it can be viewed as a local non- parametric predictor, similar to k-nearest neighbors – Any two words are at same distance from each other 16
  • 17. Deep Learning Srihari Class-based language models • To improve statistical efficiency of n-gram models – Introduce notion of word categories • Share statistics of words in same categories • Idea: use a clustering algorithm to partition words into clusters based on their co- occurrence frequencies with other words – Model can then use word-class IDs rather than individual word-IDs to represent context • Still much information is lost in this process 17