SlideShare a Scribd company logo
Language Model
自然言語処理シリーズ 4 機械翻訳 pp.62-80
Koichi Akabe
MT study
NAIST
2014-05-08
2014-05-08 Koichi Akabe (NAIST MT) 1 / 20
Fluency of Machine Translation
Machine Translation: f −→ e
2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
Fluency of Machine Translation
Machine Translation: f −→ e
Which translation e is correct?
▶ e1 = he is big
▶ e2 = is big he
▶ e3 = this is a purple dog
2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
Fluency of Machine Translation
Machine Translation: f −→ e
Which translation e is correct?
▶ e1 = he is big
▶ e2 = is big he
▶ e3 = this is a purple dog
2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
Fluency of Machine Translation
Machine Translation: f −→ e
Which translation e is correct?
▶ e1 = he is big
▶ e2 = is big he
▶ e3 = this is a purple dog
We can know the answer without f
2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
Fluency of Machine Translation
Machine Translation: f −→ e
Which translation e is correct?
▶ e1 = he is big
▶ e2 = is big he −→ Syntax broken
▶ e3 = this is a purple dog −→ We have never seen
We can know the answer without f
2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
Language model (LM)
Language model gives scores P(e) for each sentence without f
▶ P(e = he is big)
▶ P(e = is big he)
▶ P(e = this is a purple dog)
2014-05-08 Koichi Akabe (NAIST MT) 3 / 20
Language model (LM)
Language model gives scores P(e) for each sentence without f
▶ P(e = he is big)
▶ P(e = is big he)
▶ P(e = this is a purple dog)
Using this, we can compare sentences!
P(e = e1) > P(e = e3) > P(e = e2) ?
MT uses LM to increase translation accuracy
We call P(e) “language model probability”
2014-05-08 Koichi Akabe (NAIST MT) 3 / 20
How to calculate P(e)?
We want to calculate probability of a sentence:
P(e = he is big)
2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
How to calculate P(e)?
We want to calculate probability of a sentence:
P(e = he is big)
Direct method: count frequency of sentences in the training data
PML(e) =
ctrain(e)
∑
e′ ctrain(e′)
2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
How to calculate P(e)?
We want to calculate probability of a sentence:
P(e = he is big)
Direct method: count frequency of sentences in the training data
PML(e) =
ctrain(e)
∑
e′ ctrain(e′)
Almost possible sentences are not contained in the training data
(−→ PML(e) = 0 for almost sentences)
2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
How to calculate P(e)?
We want to calculate probability of a sentence:
P(e = he is big)
Direct method: count frequency of sentences in the training data
PML(e) =
ctrain(e)
∑
e′ ctrain(e′)
Almost possible sentences are not contained in the training data
(−→ PML(e) = 0 for almost sentences)
Focus words to solve this problem
2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
Rewrite P using words
P(e = he is big)
2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
Rewrite P using words
P(e = he is big)
First, we split variable e into words and text length I
P(I = 3, e1 = he, e2 = is, e3 = big)
2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
Rewrite P using words
P(e = he is big)
First, we split variable e into words and text length I
P(I = 3, e1 = he, e2 = is, e3 = big)
To use uniform variable type, we replace I to eI = ⟨/s⟩
P(e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
Rewrite P using words
P(e = he is big)
First, we split variable e into words and text length I
P(I = 3, e1 = he, e2 = is, e3 = big)
To use uniform variable type, we replace I to eI = ⟨/s⟩
P(e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
We also add a prefix symbol for contexts (described later)
P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
Rewrite P using conditional probability P(word|context)
2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
Rewrite P using conditional probability P(word|context)
Chain rule: P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
= P(e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)
×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)
×P(e2 = is|e0 = ⟨s⟩, e1 = he)
×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩)
2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
Rewrite P using conditional probability P(word|context)
Chain rule: P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
= P(e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)
×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)
×P(e2 = is|e0 = ⟨s⟩, e1 = he)
×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩)
Generalize:
P(eI
1) =
I+1∏
i=1
PML(ei|ei−1
0 ) =
I+1∏
i=1
ctrain(ei
0)
ctrain(ei−1
0 )
ej
i = ei ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1
2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
Rewrite P using conditional probability P(word|context)
Chain rule: P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
= P(e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)
×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)
×P(e2 = is|e0 = ⟨s⟩, e1 = he)
×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩)
Generalize:
P(eI
1) =
I+1∏
i=1
PML(ei|ei−1
0 ) =
I+1∏
i=1
ctrain(ei
0)
ctrain(ei−1
0 )
ej
i = ei ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1
However, ctrain(ei
0) becomes 0 for large i
2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
n-gram language model
So, we do not use long word sequences!
2014-05-08 Koichi Akabe (NAIST MT) 7 / 20
n-gram language model
So, we do not use long word sequences!
n-gram model uses only n − 1 words as contexts:
P(eI
1) ≈
I+1∏
i=1
PML(ei|ei−1
i−n+1) =
I+1∏
i=1
ctrain(ei
i−n+1)
ctrain(ei−1
i−n+1)
n-gram model eases 0 probability problem
2014-05-08 Koichi Akabe (NAIST MT) 7 / 20
Example of strict / 2-gram probabilities
Strict probability:
P(e = he is big) = PML(⟨/s⟩|⟨s⟩ he is big)
×PML(big|⟨s⟩ he is)
×PML(is|⟨s⟩ he)
×PML(he|⟨s⟩)
2-gram probability:
P(e = he is big) ≈ PML(⟨/s⟩|big)
×PML(big|is)
×PML(is|he)
×PML(he|⟨s⟩)
2014-05-08 Koichi Akabe (NAIST MT) 8 / 20
Smoothing
Smoothing makes a robust LM for unknown linguistic phenomena
2014-05-08 Koichi Akabe (NAIST MT) 9 / 20
Smoothing
Smoothing makes a robust LM for unknown linguistic phenomena
Basically, we calculate n-gram LM probability with (n − 1)-gram or
shorter contexts
2014-05-08 Koichi Akabe (NAIST MT) 9 / 20
Linear interpolation
Interpolate probability with shorter n-grams:
P(ei|ei−1
i−n+1) = (1 − α)PML(ei|ei−1
i−n+1) + αP(ei|ei−1
i−n+2)
For large α: ■■■■
■■■
For small α: ■■■■
■■■
2014-05-08 Koichi Akabe (NAIST MT) 10 / 20
Linear interpolation
Interpolate probability with shorter n-grams:
P(ei|ei−1
i−n+1) = (1 − α)PML(ei|ei−1
i−n+1) + αP(ei|ei−1
i−n+2)
For large α: ■■■■
■■■
For small α: ■■■■
■■■
Give constant probability for unknown words:
P(ei) = (1 − α)PML(ei) + α
1
|V|
where |V| is the vocabulary size
2014-05-08 Koichi Akabe (NAIST MT) 10 / 20
Linear interpolation
Interpolate probability with shorter n-grams:
P(ei|ei−1
i−n+1) = (1 − α)PML(ei|ei−1
i−n+1) + αP(ei|ei−1
i−n+2)
For large α: ■■■■
■■■
For small α: ■■■■
■■■
Give constant probability for unknown words:
P(ei) = (1 − α)PML(ei) + α
1
|V|
where |V| is the vocabulary size
How to choose α?
2014-05-08 Koichi Akabe (NAIST MT) 10 / 20
Idea of Witten-Bell method
Table Comparison of two n-gram contexts
president was president ronald
elected 5 reagan 38
the 3 caza 1
in 3 venetiaan 1
first 3
· · ·
52 unique words, 110 times 3 unique words, 40 times
2014-05-08 Koichi Akabe (NAIST MT) 11 / 20
Idea of Witten-Bell method
Table Comparison of two n-gram contexts
president was president ronald
elected 5 reagan 38
the 3 caza 1
in 3 venetiaan 1
first 3
· · ·
52 unique words, 110 times 3 unique words, 40 times
▶ “president was” may follow unknown words
−→ We cannot trust P(·|president was)
▶ “president ronald” frequently follow “reagan”
−→ We can trust P(·|president ronald)
2014-05-08 Koichi Akabe (NAIST MT) 11 / 20
Witten-Bell method
α depends on each reliability of n-gram:
αei−1
i−n+1
=
u(ei−1
i−n+1, ·)
u(ei−1
i−n+1, ·) + c(ei−1
i−n+1)
P(ei|ei−1
i−n+1) = (1 − αei−1
i−n+1
)PML(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)
2014-05-08 Koichi Akabe (NAIST MT) 12 / 20
Witten-Bell method
α depends on each reliability of n-gram:
αei−1
i−n+1
=
u(ei−1
i−n+1, ·)
u(ei−1
i−n+1, ·) + c(ei−1
i−n+1)
P(ei|ei−1
i−n+1) = (1 − αei−1
i−n+1
)PML(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)
e.g.
αpresident was =
u(president was, ·)
u(president was, ·) + c(president was)
=
52
52 + 110
αpresident was = 0.32 −→ do not trust (see shorter contexts)
αpresident ronald = 0.07 −→ trust
2014-05-08 Koichi Akabe (NAIST MT) 12 / 20
Absolute discounting method
Discount constant parameter d from frequencies of n-grams
Pd(ei|ei−1
i−n+1) =
max(ctrain(ei
i−n+1)−d, 0)
ctrain(ei−1
i−n+1)
Ignore rare n-grams
0 1 2 3 4 5 6 7 8 9 10
0.0
0.5
1.0
d = 0.5
d = 0.1
d = 1.0
d = 2.0
Ratiotonormalcounting
2014-05-08 Koichi Akabe (NAIST MT) 13 / 20
Absolute discounting method
Give discounted quantity for shorter n-grams
αei−1
i−n+1
= 1 −
∑
ei
Pd(ei|ei−1
i−n+1)
P(ei|ei−1
i−n+1) = Pd(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)
2014-05-08 Koichi Akabe (NAIST MT) 14 / 20
Absolute discounting method
Give discounted quantity for shorter n-grams
αei−1
i−n+1
= 1 −
∑
ei
Pd(ei|ei−1
i−n+1)
P(ei|ei−1
i−n+1) = Pd(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)
e.g. α := 0.5 (normally decided to maximize likelihood of dev set)
Pd(reagan|president ronald) =
38 − 0.5
40
= 0.9375
Pd(caza|president ronald) =
1 − 0.5
40
= 0.0125
Pd(venetiaan|president ronald) =
1 − 0.5
40
= 0.0125
αpresident ronald = 1 −
∑
ei
Pd(ei|president ronald) = 0.0375
2014-05-08 Koichi Akabe (NAIST MT) 14 / 20
Kneser-Ney method
Idea
“ronald reagan” or “president reagan” is frequently contained in
corpora, and normal smoothing methods give large probability for
“ronald” and “reagan”. However, “reagan” is not used in other
contexts.
2014-05-08 Koichi Akabe (NAIST MT) 15 / 20
Kneser-Ney method
Idea
“ronald reagan” or “president reagan” is frequently contained in
corpora, and normal smoothing methods give large probability for
“ronald” and “reagan”. However, “reagan” is not used in other
contexts.
Kneser and Ney used the unique counter u in absolute discounting
Pkn(ei|ei−1
i−n+1) =
max(u(·, ei
i−n+1) − d, 0)
u(·, ei−1
i−n+1, ·)
αei−1
i−n+1
= 1 −
∑
ei
Pkn(ei|ei−1
i−n+1)
2014-05-08 Koichi Akabe (NAIST MT) 15 / 20
Kneser-Ney method (example)
u(·, reagan) = 2 u(·, ronald reagan) = 10
u(·, ronald smith) = 1 u(·, ronald, ·) = 11
u(·, ·) = 2000 d = 0.5
Pkn(reagan|ronald) =
max(u(·, ronald reagan) − d, 0)
u(·, ronald, ·)
= 0.864
Pkn(smith|ronald) =
max(u(·, ronald smith) − d, 0)
u(·, ronald, ·)
= 0.045
Pkn(reagan) =
max(u(·, reagan) − d, 0)
u(·, ·)
= 0.00075
αronald = 1 −
∑
ei
Pkn(ei|ronald) = 0.091
2014-05-08 Koichi Akabe (NAIST MT) 16 / 20
Other methods
Additive smoothing
Pd(ei|ei−1
i−n+1) =
ctrain(ei
i−n+1)+δ
ctrain(ei−1
i−n+1)+δ|W|
where |W| is a number of words (for normalization)
2014-05-08 Koichi Akabe (NAIST MT) 17 / 20
Other methods
Good-Turing (“Good” is a scientist)
Turing estimator uses revised values as a number of words:
r∗
= (r + 1)
Nr+1
Nr
where Nr is a number of words occurring r times
If Nr = 0, r∗ becomes indeterminate form
Good-Turing estimator uses linear regression with Zipf’s law to
solve this problem
Zri :=
2Nri
ri+1 − ri−1
where ri is ith non-zero number (r1 < r2 < r3 < · · · )
2014-05-08 Koichi Akabe (NAIST MT) 18 / 20
Other methods
Good-Turing (cont’d)
Estimate a and b:
log Zri ∼ a + b log ri
r∗
= (r + 1)
Zr+1
Zr
= r
(
1 +
1
r
)b+1
2014-05-08 Koichi Akabe (NAIST MT) 19 / 20
Other methods
Back-off
Use shorter n-grams only when longer n-gram is not contained
Absolute discounting with back-off:
P(ei|ei−1
i−n+1) =
{
Pd(ei|ei−1
i−n+1) (c(ei−1
i−n+1) > 0)
βei−1
i−n+1
P(ei|ei−1
i−n+2) (otherwise)
2014-05-08 Koichi Akabe (NAIST MT) 20 / 20

More Related Content

Similar to Presentation

Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko TransliteratorKoreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator
HONGJOO LEE
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iAlexander Decker
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iAlexander Decker
 
Lecture 4 f17
Lecture 4 f17Lecture 4 f17
Lecture 4 f17
Eric Cochran
 
A new technique for proving non regularity based on the measure of a language
A new technique for proving non regularity based on the measure of a languageA new technique for proving non regularity based on the measure of a language
A new technique for proving non regularity based on the measure of a language
Ryoma Sin'ya
 
Decision Theory pattern recognition theory
Decision Theory pattern recognition theoryDecision Theory pattern recognition theory
Decision Theory pattern recognition theory
anupam07ii1
 
Chapter3pptx__2021_12_23_22_52_54.pptx
Chapter3pptx__2021_12_23_22_52_54.pptxChapter3pptx__2021_12_23_22_52_54.pptx
Chapter3pptx__2021_12_23_22_52_54.pptx
DrIsikoIsaac
 
Knutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.pptKnutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.ppt
saki931
 
Laplace transformation Engineering Mathematics part 2
Laplace transformation Engineering Mathematics part 2Laplace transformation Engineering Mathematics part 2
Laplace transformation Engineering Mathematics part 2
NeAMul1
 
Tutorial on EM algorithm – Part 4
Tutorial on EM algorithm – Part 4Tutorial on EM algorithm – Part 4
Tutorial on EM algorithm – Part 4
Loc Nguyen
 
AI 10 | Naive Bayes Classifier
AI 10 | Naive Bayes ClassifierAI 10 | Naive Bayes Classifier
AI 10 | Naive Bayes Classifier
Mohammad Imam Hossain
 
Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985
Julyan Arbel
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
HyunKyu Jeon
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
cscpconf
 
Expectation Maximization Algorithm with Combinatorial Assumption
Expectation Maximization Algorithm with Combinatorial AssumptionExpectation Maximization Algorithm with Combinatorial Assumption
Expectation Maximization Algorithm with Combinatorial Assumption
Loc Nguyen
 
Chapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata TheoryChapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata Theory
Tsegazeab Asgedom
 
Enumeration of 2-level polytopes
Enumeration of 2-level polytopesEnumeration of 2-level polytopes
Enumeration of 2-level polytopes
Vissarion Fisikopoulos
 

Similar to Presentation (18)

Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko TransliteratorKoreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type i
 
Bayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type iBayes estimators for the shape parameter of pareto type i
Bayes estimators for the shape parameter of pareto type i
 
Lecture 4 f17
Lecture 4 f17Lecture 4 f17
Lecture 4 f17
 
A new technique for proving non regularity based on the measure of a language
A new technique for proving non regularity based on the measure of a languageA new technique for proving non regularity based on the measure of a language
A new technique for proving non regularity based on the measure of a language
 
Decision Theory pattern recognition theory
Decision Theory pattern recognition theoryDecision Theory pattern recognition theory
Decision Theory pattern recognition theory
 
Chapter3pptx__2021_12_23_22_52_54.pptx
Chapter3pptx__2021_12_23_22_52_54.pptxChapter3pptx__2021_12_23_22_52_54.pptx
Chapter3pptx__2021_12_23_22_52_54.pptx
 
Knutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.pptKnutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.ppt
 
Laplace transformation Engineering Mathematics part 2
Laplace transformation Engineering Mathematics part 2Laplace transformation Engineering Mathematics part 2
Laplace transformation Engineering Mathematics part 2
 
Tutorial on EM algorithm – Part 4
Tutorial on EM algorithm – Part 4Tutorial on EM algorithm – Part 4
Tutorial on EM algorithm – Part 4
 
AI 10 | Naive Bayes Classifier
AI 10 | Naive Bayes ClassifierAI 10 | Naive Bayes Classifier
AI 10 | Naive Bayes Classifier
 
Presentation
PresentationPresentation
Presentation
 
Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 
Expectation Maximization Algorithm with Combinatorial Assumption
Expectation Maximization Algorithm with Combinatorial AssumptionExpectation Maximization Algorithm with Combinatorial Assumption
Expectation Maximization Algorithm with Combinatorial Assumption
 
Chapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata TheoryChapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata Theory
 
Enumeration of 2-level polytopes
Enumeration of 2-level polytopesEnumeration of 2-level polytopes
Enumeration of 2-level polytopes
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Presentation

  • 1. Language Model 自然言語処理シリーズ 4 機械翻訳 pp.62-80 Koichi Akabe MT study NAIST 2014-05-08 2014-05-08 Koichi Akabe (NAIST MT) 1 / 20
  • 2. Fluency of Machine Translation Machine Translation: f −→ e 2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
  • 3. Fluency of Machine Translation Machine Translation: f −→ e Which translation e is correct? ▶ e1 = he is big ▶ e2 = is big he ▶ e3 = this is a purple dog 2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
  • 4. Fluency of Machine Translation Machine Translation: f −→ e Which translation e is correct? ▶ e1 = he is big ▶ e2 = is big he ▶ e3 = this is a purple dog 2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
  • 5. Fluency of Machine Translation Machine Translation: f −→ e Which translation e is correct? ▶ e1 = he is big ▶ e2 = is big he ▶ e3 = this is a purple dog We can know the answer without f 2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
  • 6. Fluency of Machine Translation Machine Translation: f −→ e Which translation e is correct? ▶ e1 = he is big ▶ e2 = is big he −→ Syntax broken ▶ e3 = this is a purple dog −→ We have never seen We can know the answer without f 2014-05-08 Koichi Akabe (NAIST MT) 2 / 20
  • 7. Language model (LM) Language model gives scores P(e) for each sentence without f ▶ P(e = he is big) ▶ P(e = is big he) ▶ P(e = this is a purple dog) 2014-05-08 Koichi Akabe (NAIST MT) 3 / 20
  • 8. Language model (LM) Language model gives scores P(e) for each sentence without f ▶ P(e = he is big) ▶ P(e = is big he) ▶ P(e = this is a purple dog) Using this, we can compare sentences! P(e = e1) > P(e = e3) > P(e = e2) ? MT uses LM to increase translation accuracy We call P(e) “language model probability” 2014-05-08 Koichi Akabe (NAIST MT) 3 / 20
  • 9. How to calculate P(e)? We want to calculate probability of a sentence: P(e = he is big) 2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
  • 10. How to calculate P(e)? We want to calculate probability of a sentence: P(e = he is big) Direct method: count frequency of sentences in the training data PML(e) = ctrain(e) ∑ e′ ctrain(e′) 2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
  • 11. How to calculate P(e)? We want to calculate probability of a sentence: P(e = he is big) Direct method: count frequency of sentences in the training data PML(e) = ctrain(e) ∑ e′ ctrain(e′) Almost possible sentences are not contained in the training data (−→ PML(e) = 0 for almost sentences) 2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
  • 12. How to calculate P(e)? We want to calculate probability of a sentence: P(e = he is big) Direct method: count frequency of sentences in the training data PML(e) = ctrain(e) ∑ e′ ctrain(e′) Almost possible sentences are not contained in the training data (−→ PML(e) = 0 for almost sentences) Focus words to solve this problem 2014-05-08 Koichi Akabe (NAIST MT) 4 / 20
  • 13. Rewrite P using words P(e = he is big) 2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
  • 14. Rewrite P using words P(e = he is big) First, we split variable e into words and text length I P(I = 3, e1 = he, e2 = is, e3 = big) 2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
  • 15. Rewrite P using words P(e = he is big) First, we split variable e into words and text length I P(I = 3, e1 = he, e2 = is, e3 = big) To use uniform variable type, we replace I to eI = ⟨/s⟩ P(e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩) 2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
  • 16. Rewrite P using words P(e = he is big) First, we split variable e into words and text length I P(I = 3, e1 = he, e2 = is, e3 = big) To use uniform variable type, we replace I to eI = ⟨/s⟩ P(e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩) We also add a prefix symbol for contexts (described later) P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩) 2014-05-08 Koichi Akabe (NAIST MT) 5 / 20
  • 17. Rewrite P using conditional probability P(word|context) 2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
  • 18. Rewrite P using conditional probability P(word|context) Chain rule: P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩) = P(e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big) ×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is) ×P(e2 = is|e0 = ⟨s⟩, e1 = he) ×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩) 2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
  • 19. Rewrite P using conditional probability P(word|context) Chain rule: P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩) = P(e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big) ×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is) ×P(e2 = is|e0 = ⟨s⟩, e1 = he) ×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩) Generalize: P(eI 1) = I+1∏ i=1 PML(ei|ei−1 0 ) = I+1∏ i=1 ctrain(ei 0) ctrain(ei−1 0 ) ej i = ei ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1 2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
  • 20. Rewrite P using conditional probability P(word|context) Chain rule: P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩) = P(e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big) ×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is) ×P(e2 = is|e0 = ⟨s⟩, e1 = he) ×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩) Generalize: P(eI 1) = I+1∏ i=1 PML(ei|ei−1 0 ) = I+1∏ i=1 ctrain(ei 0) ctrain(ei−1 0 ) ej i = ei ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1 However, ctrain(ei 0) becomes 0 for large i 2014-05-08 Koichi Akabe (NAIST MT) 6 / 20
  • 21. n-gram language model So, we do not use long word sequences! 2014-05-08 Koichi Akabe (NAIST MT) 7 / 20
  • 22. n-gram language model So, we do not use long word sequences! n-gram model uses only n − 1 words as contexts: P(eI 1) ≈ I+1∏ i=1 PML(ei|ei−1 i−n+1) = I+1∏ i=1 ctrain(ei i−n+1) ctrain(ei−1 i−n+1) n-gram model eases 0 probability problem 2014-05-08 Koichi Akabe (NAIST MT) 7 / 20
  • 23. Example of strict / 2-gram probabilities Strict probability: P(e = he is big) = PML(⟨/s⟩|⟨s⟩ he is big) ×PML(big|⟨s⟩ he is) ×PML(is|⟨s⟩ he) ×PML(he|⟨s⟩) 2-gram probability: P(e = he is big) ≈ PML(⟨/s⟩|big) ×PML(big|is) ×PML(is|he) ×PML(he|⟨s⟩) 2014-05-08 Koichi Akabe (NAIST MT) 8 / 20
  • 24. Smoothing Smoothing makes a robust LM for unknown linguistic phenomena 2014-05-08 Koichi Akabe (NAIST MT) 9 / 20
  • 25. Smoothing Smoothing makes a robust LM for unknown linguistic phenomena Basically, we calculate n-gram LM probability with (n − 1)-gram or shorter contexts 2014-05-08 Koichi Akabe (NAIST MT) 9 / 20
  • 26. Linear interpolation Interpolate probability with shorter n-grams: P(ei|ei−1 i−n+1) = (1 − α)PML(ei|ei−1 i−n+1) + αP(ei|ei−1 i−n+2) For large α: ■■■■ ■■■ For small α: ■■■■ ■■■ 2014-05-08 Koichi Akabe (NAIST MT) 10 / 20
  • 27. Linear interpolation Interpolate probability with shorter n-grams: P(ei|ei−1 i−n+1) = (1 − α)PML(ei|ei−1 i−n+1) + αP(ei|ei−1 i−n+2) For large α: ■■■■ ■■■ For small α: ■■■■ ■■■ Give constant probability for unknown words: P(ei) = (1 − α)PML(ei) + α 1 |V| where |V| is the vocabulary size 2014-05-08 Koichi Akabe (NAIST MT) 10 / 20
  • 28. Linear interpolation Interpolate probability with shorter n-grams: P(ei|ei−1 i−n+1) = (1 − α)PML(ei|ei−1 i−n+1) + αP(ei|ei−1 i−n+2) For large α: ■■■■ ■■■ For small α: ■■■■ ■■■ Give constant probability for unknown words: P(ei) = (1 − α)PML(ei) + α 1 |V| where |V| is the vocabulary size How to choose α? 2014-05-08 Koichi Akabe (NAIST MT) 10 / 20
  • 29. Idea of Witten-Bell method Table Comparison of two n-gram contexts president was president ronald elected 5 reagan 38 the 3 caza 1 in 3 venetiaan 1 first 3 · · · 52 unique words, 110 times 3 unique words, 40 times 2014-05-08 Koichi Akabe (NAIST MT) 11 / 20
  • 30. Idea of Witten-Bell method Table Comparison of two n-gram contexts president was president ronald elected 5 reagan 38 the 3 caza 1 in 3 venetiaan 1 first 3 · · · 52 unique words, 110 times 3 unique words, 40 times ▶ “president was” may follow unknown words −→ We cannot trust P(·|president was) ▶ “president ronald” frequently follow “reagan” −→ We can trust P(·|president ronald) 2014-05-08 Koichi Akabe (NAIST MT) 11 / 20
  • 31. Witten-Bell method α depends on each reliability of n-gram: αei−1 i−n+1 = u(ei−1 i−n+1, ·) u(ei−1 i−n+1, ·) + c(ei−1 i−n+1) P(ei|ei−1 i−n+1) = (1 − αei−1 i−n+1 )PML(ei|ei−1 i−n+1) + αei−1 i−n+1 P(ei|ei−1 i−n+2) 2014-05-08 Koichi Akabe (NAIST MT) 12 / 20
  • 32. Witten-Bell method α depends on each reliability of n-gram: αei−1 i−n+1 = u(ei−1 i−n+1, ·) u(ei−1 i−n+1, ·) + c(ei−1 i−n+1) P(ei|ei−1 i−n+1) = (1 − αei−1 i−n+1 )PML(ei|ei−1 i−n+1) + αei−1 i−n+1 P(ei|ei−1 i−n+2) e.g. αpresident was = u(president was, ·) u(president was, ·) + c(president was) = 52 52 + 110 αpresident was = 0.32 −→ do not trust (see shorter contexts) αpresident ronald = 0.07 −→ trust 2014-05-08 Koichi Akabe (NAIST MT) 12 / 20
  • 33. Absolute discounting method Discount constant parameter d from frequencies of n-grams Pd(ei|ei−1 i−n+1) = max(ctrain(ei i−n+1)−d, 0) ctrain(ei−1 i−n+1) Ignore rare n-grams 0 1 2 3 4 5 6 7 8 9 10 0.0 0.5 1.0 d = 0.5 d = 0.1 d = 1.0 d = 2.0 Ratiotonormalcounting 2014-05-08 Koichi Akabe (NAIST MT) 13 / 20
  • 34. Absolute discounting method Give discounted quantity for shorter n-grams αei−1 i−n+1 = 1 − ∑ ei Pd(ei|ei−1 i−n+1) P(ei|ei−1 i−n+1) = Pd(ei|ei−1 i−n+1) + αei−1 i−n+1 P(ei|ei−1 i−n+2) 2014-05-08 Koichi Akabe (NAIST MT) 14 / 20
  • 35. Absolute discounting method Give discounted quantity for shorter n-grams αei−1 i−n+1 = 1 − ∑ ei Pd(ei|ei−1 i−n+1) P(ei|ei−1 i−n+1) = Pd(ei|ei−1 i−n+1) + αei−1 i−n+1 P(ei|ei−1 i−n+2) e.g. α := 0.5 (normally decided to maximize likelihood of dev set) Pd(reagan|president ronald) = 38 − 0.5 40 = 0.9375 Pd(caza|president ronald) = 1 − 0.5 40 = 0.0125 Pd(venetiaan|president ronald) = 1 − 0.5 40 = 0.0125 αpresident ronald = 1 − ∑ ei Pd(ei|president ronald) = 0.0375 2014-05-08 Koichi Akabe (NAIST MT) 14 / 20
  • 36. Kneser-Ney method Idea “ronald reagan” or “president reagan” is frequently contained in corpora, and normal smoothing methods give large probability for “ronald” and “reagan”. However, “reagan” is not used in other contexts. 2014-05-08 Koichi Akabe (NAIST MT) 15 / 20
  • 37. Kneser-Ney method Idea “ronald reagan” or “president reagan” is frequently contained in corpora, and normal smoothing methods give large probability for “ronald” and “reagan”. However, “reagan” is not used in other contexts. Kneser and Ney used the unique counter u in absolute discounting Pkn(ei|ei−1 i−n+1) = max(u(·, ei i−n+1) − d, 0) u(·, ei−1 i−n+1, ·) αei−1 i−n+1 = 1 − ∑ ei Pkn(ei|ei−1 i−n+1) 2014-05-08 Koichi Akabe (NAIST MT) 15 / 20
  • 38. Kneser-Ney method (example) u(·, reagan) = 2 u(·, ronald reagan) = 10 u(·, ronald smith) = 1 u(·, ronald, ·) = 11 u(·, ·) = 2000 d = 0.5 Pkn(reagan|ronald) = max(u(·, ronald reagan) − d, 0) u(·, ronald, ·) = 0.864 Pkn(smith|ronald) = max(u(·, ronald smith) − d, 0) u(·, ronald, ·) = 0.045 Pkn(reagan) = max(u(·, reagan) − d, 0) u(·, ·) = 0.00075 αronald = 1 − ∑ ei Pkn(ei|ronald) = 0.091 2014-05-08 Koichi Akabe (NAIST MT) 16 / 20
  • 39. Other methods Additive smoothing Pd(ei|ei−1 i−n+1) = ctrain(ei i−n+1)+δ ctrain(ei−1 i−n+1)+δ|W| where |W| is a number of words (for normalization) 2014-05-08 Koichi Akabe (NAIST MT) 17 / 20
  • 40. Other methods Good-Turing (“Good” is a scientist) Turing estimator uses revised values as a number of words: r∗ = (r + 1) Nr+1 Nr where Nr is a number of words occurring r times If Nr = 0, r∗ becomes indeterminate form Good-Turing estimator uses linear regression with Zipf’s law to solve this problem Zri := 2Nri ri+1 − ri−1 where ri is ith non-zero number (r1 < r2 < r3 < · · · ) 2014-05-08 Koichi Akabe (NAIST MT) 18 / 20
  • 41. Other methods Good-Turing (cont’d) Estimate a and b: log Zri ∼ a + b log ri r∗ = (r + 1) Zr+1 Zr = r ( 1 + 1 r )b+1 2014-05-08 Koichi Akabe (NAIST MT) 19 / 20
  • 42. Other methods Back-off Use shorter n-grams only when longer n-gram is not contained Absolute discounting with back-off: P(ei|ei−1 i−n+1) = { Pd(ei|ei−1 i−n+1) (c(ei−1 i−n+1) > 0) βei−1 i−n+1 P(ei|ei−1 i−n+2) (otherwise) 2014-05-08 Koichi Akabe (NAIST MT) 20 / 20