Presentation

Language Model
自然言語処理シリーズ 4 機械翻訳 pp.62-80
Koichi Akabe
MT study
NAIST
2014-05-08
2014-05-08 Koichi Akabe (NAIST MT) 1 / 20

Fluency of Machine Translation
Machine Translation: f −→ e

Which translation e is correct?
▶ e1 = he is big
▶ e2 = is big he
▶ e3 = this is a purple dog

▶ e1 = he is big
▶ e2 = is big he
▶ e3 = this is a purple dog
We can know the answer without f

▶ e1 = he is big
▶ e2 = is big he −→ Syntax broken
▶ e3 = this is a purple dog −→ We have never seen
We can know the answer without f

Language model (LM)
Language model gives scores P(e) for each sentence without f
▶ P(e = he is big)
▶ P(e = is big he)
▶ P(e = this is a purple dog)

Language model (LM)
Language model gives scores P(e) for each sentence without f
▶ P(e = he is big)
▶ P(e = is big he)
▶ P(e = this is a purple dog)
Using this, we can compare sentences!
P(e = e1) > P(e = e3) > P(e = e2) ?
MT uses LM to increase translation accuracy
We call P(e) “language model probability”

How to calculate P(e)?
We want to calculate probability of a sentence:
P(e = he is big)

P(e = he is big)
Direct method: count frequency of sentences in the training data
PML(e) =
ctrain(e)
∑
e′ ctrain(e′)

P(e = he is big)
PML(e) =
ctrain(e)
∑
e′ ctrain(e′)
Almost possible sentences are not contained in the training data
(−→ PML(e) = 0 for almost sentences)

P(e = he is big)
PML(e) =
ctrain(e)
∑
e′ ctrain(e′)
Almost possible sentences are not contained in the training data
(−→ PML(e) = 0 for almost sentences)
Focus words to solve this problem

Rewrite P using words
P(e = he is big)

P(e = he is big)
First, we split variable e into words and text length I
P(I = 3, e1 = he, e2 = is, e3 = big)

P(e = he is big)
P(I = 3, e1 = he, e2 = is, e3 = big)
To use uniform variable type, we replace I to eI = ⟨/s⟩
P(e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

P(e = he is big)
P(I = 3, e1 = he, e2 = is, e3 = big)
To use uniform variable type, we replace I to eI = ⟨/s⟩
P(e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
We also add a preﬁx symbol for contexts (described later)
P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

Rewrite P using conditional probability P(word|context)

Chain rule: P(e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)
= P(e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)
×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)
×P(e2 = is|e0 = ⟨s⟩, e1 = he)
×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩)

×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)
×P(e2 = is|e0 = ⟨s⟩, e1 = he)
×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩)
Generalize:
P(eI
1) =
I+1∏
i=1
PML(ei|ei−1
0 ) =
I+1∏
i=1
ctrain(ei
0)
ctrain(ei−1
0 )
ej
i = ei ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1

×P(e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)
×P(e2 = is|e0 = ⟨s⟩, e1 = he)
×P(e1 = he|e0 = ⟨s⟩)×P(e0 = ⟨s⟩)
Generalize:
P(eI
1) =
I+1∏
i=1
PML(ei|ei−1
0 ) =
I+1∏
i=1
ctrain(ei
0)
ctrain(ei−1
0 )
ej
i = ei ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1
However, ctrain(ei
0) becomes 0 for large i

n-gram language model
So, we do not use long word sequences!

n-gram language model
So, we do not use long word sequences!
n-gram model uses only n − 1 words as contexts:
P(eI
1) ≈
I+1∏
i=1
PML(ei|ei−1
i−n+1) =
I+1∏
i=1
ctrain(ei
i−n+1)
ctrain(ei−1
i−n+1)
n-gram model eases 0 probability problem

Example of strict / 2-gram probabilities
Strict probability:
P(e = he is big) = PML(⟨/s⟩|⟨s⟩ he is big)
×PML(big|⟨s⟩ he is)
×PML(is|⟨s⟩ he)
×PML(he|⟨s⟩)
2-gram probability:
P(e = he is big) ≈ PML(⟨/s⟩|big)
×PML(big|is)
×PML(is|he)
×PML(he|⟨s⟩)

Smoothing
Smoothing makes a robust LM for unknown linguistic phenomena

Smoothing
Smoothing makes a robust LM for unknown linguistic phenomena
Basically, we calculate n-gram LM probability with (n − 1)-gram or
shorter contexts

Linear interpolation
Interpolate probability with shorter n-grams:
P(ei|ei−1
i−n+1) = (1 − α)PML(ei|ei−1
i−n+1) + αP(ei|ei−1
i−n+2)
For large α: ■■■■
■■■
For small α: ■■■■
■■■

P(ei|ei−1
i−n+1) = (1 − α)PML(ei|ei−1
i−n+2)
■■■
■■■
Give constant probability for unknown words:
P(ei) = (1 − α)PML(ei) + α
1
|V|
where |V| is the vocabulary size

P(ei|ei−1
i−n+1) = (1 − α)PML(ei|ei−1
i−n+2)
■■■
■■■
Give constant probability for unknown words:
P(ei) = (1 − α)PML(ei) + α
1
|V|
where |V| is the vocabulary size
How to choose α?

Idea of Witten-Bell method
Table Comparison of two n-gram contexts
president was president ronald
elected 5 reagan 38
the 3 caza 1
in 3 venetiaan 1
ﬁrst 3
· · ·
52 unique words, 110 times 3 unique words, 40 times

Idea of Witten-Bell method
Table Comparison of two n-gram contexts
president was president ronald
elected 5 reagan 38
the 3 caza 1
in 3 venetiaan 1
ﬁrst 3
· · ·
52 unique words, 110 times 3 unique words, 40 times
▶ “president was” may follow unknown words
−→ We cannot trust P(·|president was)
▶ “president ronald” frequently follow “reagan”
−→ We can trust P(·|president ronald)

Witten-Bell method
α depends on each reliability of n-gram:
αei−1
i−n+1
=
u(ei−1
i−n+1, ·)
u(ei−1
i−n+1, ·) + c(ei−1
i−n+1)
P(ei|ei−1
i−n+1) = (1 − αei−1
i−n+1
)PML(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)

Witten-Bell method
α depends on each reliability of n-gram:
αei−1
i−n+1
=
u(ei−1
i−n+1, ·)
u(ei−1
i−n+1, ·) + c(ei−1
i−n+1)
P(ei|ei−1
i−n+1) = (1 − αei−1
i−n+1
)PML(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)
e.g.
αpresident was =
u(president was, ·)
u(president was, ·) + c(president was)
=
52
52 + 110
αpresident was = 0.32 −→ do not trust (see shorter contexts)
αpresident ronald = 0.07 −→ trust

Absolute discounting method
Discount constant parameter d from frequencies of n-grams
Pd(ei|ei−1
i−n+1) =
max(ctrain(ei
i−n+1)−d, 0)
ctrain(ei−1
i−n+1)
Ignore rare n-grams
0 1 2 3 4 5 6 7 8 9 10
0.0
0.5
1.0
d = 0.5
d = 0.1
d = 1.0
d = 2.0
Ratiotonormalcounting

Give discounted quantity for shorter n-grams
αei−1
i−n+1
= 1 −
∑
ei
Pd(ei|ei−1
i−n+1)
P(ei|ei−1
i−n+1) = Pd(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)

Give discounted quantity for shorter n-grams
αei−1
i−n+1
= 1 −
∑
ei
Pd(ei|ei−1
i−n+1)
P(ei|ei−1
i−n+1) = Pd(ei|ei−1
i−n+1) + αei−1
i−n+1
P(ei|ei−1
i−n+2)
e.g. α := 0.5 (normally decided to maximize likelihood of dev set)
Pd(reagan|president ronald) =
38 − 0.5
40
= 0.9375
Pd(caza|president ronald) =
1 − 0.5
40
= 0.0125
Pd(venetiaan|president ronald) =
1 − 0.5
40
= 0.0125
αpresident ronald = 1 −
∑
ei
Pd(ei|president ronald) = 0.0375

Kneser-Ney method
Idea
“ronald reagan” or “president reagan” is frequently contained in
corpora, and normal smoothing methods give large probability for
“ronald” and “reagan”. However, “reagan” is not used in other
contexts.

Kneser-Ney method
Idea
“ronald reagan” or “president reagan” is frequently contained in
corpora, and normal smoothing methods give large probability for
“ronald” and “reagan”. However, “reagan” is not used in other
contexts.
Kneser and Ney used the unique counter u in absolute discounting
Pkn(ei|ei−1
i−n+1) =
max(u(·, ei
i−n+1) − d, 0)
u(·, ei−1
i−n+1, ·)
αei−1
i−n+1
= 1 −
∑
ei
Pkn(ei|ei−1
i−n+1)

Kneser-Ney method (example)
u(·, reagan) = 2 u(·, ronald reagan) = 10
u(·, ronald smith) = 1 u(·, ronald, ·) = 11
u(·, ·) = 2000 d = 0.5
Pkn(reagan|ronald) =
max(u(·, ronald reagan) − d, 0)
u(·, ronald, ·)
= 0.864
Pkn(smith|ronald) =
max(u(·, ronald smith) − d, 0)
u(·, ronald, ·)
= 0.045
Pkn(reagan) =
max(u(·, reagan) − d, 0)
u(·, ·)
= 0.00075
αronald = 1 −
∑
ei
Pkn(ei|ronald) = 0.091

Other methods
Good-Turing (“Good” is a scientist)
Turing estimator uses revised values as a number of words:
r∗
= (r + 1)
Nr+1
Nr
where Nr is a number of words occurring r times
If Nr = 0, r∗ becomes indeterminate form
Good-Turing estimator uses linear regression with Zipf’s law to
solve this problem
Zri :=
2Nri
ri+1 − ri−1
where ri is ith non-zero number (r1 < r2 < r3 < · · · )

Other methods
Good-Turing (cont’d)
Estimate a and b:
log Zri ∼ a + b log ri
r∗
= (r + 1)
Zr+1
Zr
= r
(
1 +
1
r
)b+1

Other methods
Back-oﬀ
Use shorter n-grams only when longer n-gram is not contained
Absolute discounting with back-oﬀ:
P(ei|ei−1
i−n+1) =
{
Pd(ei|ei−1
i−n+1) (c(ei−1
i−n+1) > 0)
βei−1
i−n+1
P(ei|ei−1
i−n+2) (otherwise)

Presentation

Recommended

Recommended

More Related Content

Similar to Presentation

Similar to Presentation (18)

Recently uploaded

Recently uploaded (20)

Presentation