If beam search is the answer, what was the question

If Beam Search is the Answer,
What was the Question
자연어처리팀: 박희수(발표자), 백지윤, 신동진, 진명훈,
1

Motivation
Maximum A Posteriori (MAP) for Machine Translation
2
“나는 오늘 저녁에 파스타를 먹었다”
???
Korean Input Sequence
English Probable Output Sequence
How can we measure the probability?
𝑥 : “나는 오늘 저녁에 파스타를 먹었다”

3
Motivation
This is NP-hard Problem!
Maximum A Posteriori (MAP) for Machine Translation
…
Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
Compute all possible combination to
find the sentence which maximizes:
vocab
Possible length (1…n)
I
We
Am
…
…
…
…
…
…
Dinner
pasta
I
We
Am
…
…
…
…
…
…
Dinner
pasta
I
We
Am
…
…
…
…
…
…
Dinner
pasta

Greedy Search
4
Encoder
Decoder
At every step, select a word which
maximizes:
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 2
I
am
am
…
I
We
Am
…
…
…
…
<END>
…
Dinner
pasta
Step n
It may be a sub-optimal choice.
Selects one best candidate as an input sequence for each time step

Beam Search
Selects multiple alternatives for an input sequence at each timestep based on
conditional probability.
5
Encoder
Decoder
Select beam size words which
maximizes:
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2

Beam Search
6
1. Take the first two selected words (I, We)
from step1 as input to the second step.
Then, find the three best alternatives for
the second word. (I had, We ate)
I
We
Had
…
…
…
Ate
…
…
Dinner
pasta
Step 2Encoder
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
ate
had

Beam Search
7
I
We
had
…
…
…
ate
…
…
Dinner
pasta
Step 2Encoder
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
2. Evaluate the probability for the other word
(We had, I ate)
ate
had

Beam Search
8
I
We
had
…
…
…
ate
…
…
Dinner
pasta
Step 2Encoder
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
3. Choose the top 3 first and second-word
pair combinations (I had, I ate)
4. Drop the first word which is not found in
any word pair. (We)
ate
had

Beam Search
9
I
We
Am
…
…
…
…
<END>
…
Dinner
pasta
Step n
I
We
had
…
…
…
ate
…
…
Dinner
pasta
Step 2Encoder
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
…
ate
had
5. Continue this process and pick three
sentences with the highest probability. :
• I ate pasta this evening → 0.57
• I had pasta this evening → 0.63
6. Pick the sentence with the highest
probability
• I had pasta this evening

10
Beam Search
Beam Search results show a large search error and poor score for the objective.
However,
• Increasing the beam size beyond 5 (obviously more exact answer) can hurt model
performance in terms of downstream evaluation metrics.
• Global optimum in many cases is the empty string, when they otherwise might fail to
produce text at all.
• Beam search typically generates well-formed and coherent text from probabilistic models.
(Human-like)
Beam Search is the Answer!!!
Why?? What was the Question?

Alternative Decoding Objectives
Objective to reward longer outputs
11
Objective to reward coverage of input words in a prediction using the attention mechanism
of an encoder–decoder model
➔ While such methods help obtain state-of-the-art results in neural MT, the fact that text quality
still degrades with increased beam sizes when these rewards are used suggests that they do not
address the inherent issues with text generation systems.

Deriving Beam Search
Regularized decoding framework (Insert the search error into the objective function)
12
𝑢0 𝐵𝑂𝑆 = 0
𝑢 𝑡 𝑦 = − log 𝑝 𝜃(𝑦|𝑥, 𝑦<𝑡) ∀𝑡 ≥ 1
Surprisal
𝑦∗ = argmax
𝑦∈𝒴
(log 𝑝 𝜃 𝑦 𝑥 − 𝜆 ⋅ 𝑅(𝑦))
𝑅 𝑔𝑟𝑒𝑒𝑑𝑦 𝑦 = ෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡 − min
𝑦′∈𝒱
𝑢 𝑡 𝑦′ 2
𝑅 𝑏𝑒𝑎𝑚 𝑌 = ෍
𝑡=1
|𝑛 𝑚𝑎𝑥|
𝑢 𝑡 𝑌𝑡 − min
𝑦′⊆ℬ 𝑡, 𝑌′ =𝑘
𝑢 𝑡 𝑌′ 2
𝑌∗ = argmax
𝑌∈𝒴, 𝑌 =𝑘
(log 𝑝 𝜃 𝑌 𝑥 − 𝜆 ⋅ 𝑅(𝑌))
The error of greedy search
results
The error of beam search
results
➔ Maximize a posterior and minimize the search error simultaneously
• An information-theoretic concept that characterizes the
amount of new information expressed at time t
• We can use it to analyze the properties enforced on
generated text by beam search.

From Beam Search to UID
Uniform Information Density (UID)
13
How big is the family you cook for? How big is the family that you cook for?
Information is split
into two words
family family
that
Grammaticality
and information
content are held
constant
• The part of its internal
contents
• The onset of a relative clause
• The part of its internal
contents
• The onset of a relative clause

From Beam Search to UID
The UID Bias in Beam Search
14

Experiments
15
• Train
• IWSLT ‘14 De-En
• WMT ‘14 En-Fr
• Test
• Newstest ‘14 for WMT
• Model & Training Method
• fairseq: https://github.com/pytorch/fairseq
Dataset & Model
• Variance Regularizer
• Local Consistency
• Max Regularizer
• Squared Regularizer
𝑅 𝑣𝑎𝑟 𝑦 =
1
|𝑦|
෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡 − 𝜇 2
𝜇 =
1
|𝑦|
෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡
𝑅𝑙𝑜𝑐𝑎𝑙 𝑦 =
1
|𝑦|
෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡 − 𝑢 𝑡−1 𝑦𝑡−1
2
𝑅 𝑚𝑎𝑥 𝑦 = max
𝑡=1~|𝑦|
𝑢 𝑡 𝑦𝑡
𝑅 𝑠𝑞𝑢𝑎𝑟𝑒 𝑦 = ෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡
2
Generalized UID Decoding

Experiments
16
Cognitive Motivation for Beam Search
BLEU ∝ 1/(Std.dev of Surprisals)
Exact Decoding over 𝑦∗
= argmax
𝑦∈𝒴
(log 𝑝 𝜃 𝑦 𝑥 − 𝜆 ⋅ 𝑅 𝑔𝑟𝑒𝑒𝑑𝑦(𝑦))
Beam Search Decoding over
𝑦∗
= argmax
𝑦∈𝒴
(log 𝑝 𝜃 𝑦 𝑥 )
• Increasing the strength of the
regularizer appears to alleviate
the text quality degradation
seen with exact search, leading
to results that approach the
BLEU of those generated using
optimal beam search

Experiments
17
Regularized Beam Search
• The greedy and squared regularizer aid performance for larger beam sizes more so than other regularizers
• Variance and local variance are the purest encodings of UID, they perform the poorest of the regularizers.
• This may be due to the fact that they do not simultaneously (as the other regularizers do) penalize for
high surprisal.

Experiments
18
• Combining multiple UID regularizers does not lead to as great an increase in performance as one might
expect, which hints that a single method for enforcing UID is sufficient for promoting quality in generated
text.
Combination of UID regularizers
Greedy + Squared

If beam search is the answer, what was the question

More Related Content

Similar to If beam search is the answer, what was the question

More from taeseon ryu

Recently uploaded

If beam search is the answer, what was the question