If Beam Search is the Answer,
What was the Question
자연어처리팀: 박희수(발표자), 백지윤, 신동진, 진명훈,
1
Motivation
Maximum A Posteriori (MAP) for Machine Translation
2
“나는 오늘 저녁에 파스타를 먹었다”
???
Korean Input Sequence
English Probable Output Sequence
How can we measure the probability?
𝑥 : “나는 오늘 저녁에 파스타를 먹었다”
3
Motivation
This is NP-hard Problem!
Maximum A Posteriori (MAP) for Machine Translation
…
Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
Compute all possible combination to
find the sentence which maximizes:
vocab
Possible length (1…n)
I
We
Am
…
…
…
…
…
…
Dinner
pasta
I
We
Am
…
…
…
…
…
…
Dinner
pasta
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Greedy Search
4
Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
At every step, select a word which
maximizes:
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 2
I
am
am
…
I
We
Am
…
…
…
…
<END>
…
Dinner
pasta
Step n
It may be a sub-optimal choice.
Selects one best candidate as an input sequence for each time step
Beam Search
Selects multiple alternatives for an input sequence at each timestep based on
conditional probability.
5
Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
Select beam size words which
maximizes:
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
Beam Search
Selects multiple alternatives for an input sequence at each timestep based on
conditional probability.
6
1. Take the first two selected words (I, We)
from step1 as input to the second step.
Then, find the three best alternatives for
the second word. (I had, We ate)
I
We
Had
…
…
…
Ate
…
…
Dinner
pasta
Step 2Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
ate
had
Beam Search
Selects multiple alternatives for an input sequence at each timestep based on
conditional probability.
7
I
We
had
…
…
…
ate
…
…
Dinner
pasta
Step 2Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
2. Evaluate the probability for the other word
(We had, I ate)
ate
had
Beam Search
Selects multiple alternatives for an input sequence at each timestep based on
conditional probability.
8
I
We
had
…
…
…
ate
…
…
Dinner
pasta
Step 2Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
3. Choose the top 3 first and second-word
pair combinations (I had, I ate)
4. Drop the first word which is not found in
any word pair. (We)
ate
had
Beam Search
Selects multiple alternatives for an input sequence at each timestep based on
conditional probability.
9
I
We
Am
…
…
…
…
<END>
…
Dinner
pasta
Step n
I
We
had
…
…
…
ate
…
…
Dinner
pasta
Step 2Encoder
“나는 오늘 저녁에
파스타를 먹었다”
Decoder
vocab
I
We
Am
…
…
…
…
…
…
Dinner
pasta
Step 1
<START>
I
We
Beam size: 2
…
ate
had
5. Continue this process and pick three
sentences with the highest probability. :
• I ate pasta this evening → 0.57
• I had pasta this evening → 0.63
6. Pick the sentence with the highest
probability
• I had pasta this evening
10
Beam Search
Beam Search results show a large search error and poor score for the objective.
However,
• Increasing the beam size beyond 5 (obviously more exact answer) can hurt model
performance in terms of downstream evaluation metrics.
• Global optimum in many cases is the empty string, when they otherwise might fail to
produce text at all.
• Beam search typically generates well-formed and coherent text from probabilistic models.
(Human-like)
Beam Search is the Answer!!!
Why?? What was the Question?
Alternative Decoding Objectives
Objective to reward longer outputs
11
Objective to reward coverage of input words in a prediction using the attention mechanism
of an encoder–decoder model
➔ While such methods help obtain state-of-the-art results in neural MT, the fact that text quality
still degrades with increased beam sizes when these rewards are used suggests that they do not
address the inherent issues with text generation systems.
Deriving Beam Search
Regularized decoding framework (Insert the search error into the objective function)
12
𝑢0 𝐵𝑂𝑆 = 0
𝑢 𝑡 𝑦 = − log 𝑝 𝜃(𝑦|𝑥, 𝑦<𝑡) ∀𝑡 ≥ 1
Surprisal
𝑦∗ = argmax
𝑦∈𝒴
(log 𝑝 𝜃 𝑦 𝑥 − 𝜆 ⋅ 𝑅(𝑦))
𝑅 𝑔𝑟𝑒𝑒𝑑𝑦 𝑦 = ෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡 − min
𝑦′∈𝒱
𝑢 𝑡 𝑦′ 2
𝑅 𝑏𝑒𝑎𝑚 𝑌 = ෍
𝑡=1
|𝑛 𝑚𝑎𝑥|
𝑢 𝑡 𝑌𝑡 − min
𝑦′⊆ℬ 𝑡, 𝑌′ =𝑘
𝑢 𝑡 𝑌′ 2
𝑌∗ = argmax
𝑌∈𝒴, 𝑌 =𝑘
(log 𝑝 𝜃 𝑌 𝑥 − 𝜆 ⋅ 𝑅(𝑌))
The error of greedy search
results
The error of beam search
results
➔ Maximize a posterior and minimize the search error simultaneously
• An information-theoretic concept that characterizes the
amount of new information expressed at time t
• We can use it to analyze the properties enforced on
generated text by beam search.
From Beam Search to UID
Uniform Information Density (UID)
13
How big is the family you cook for? How big is the family that you cook for?
Information is split
into two words
family family
that
Grammaticality
and information
content are held
constant
• The part of its internal
contents
• The onset of a relative clause
• The part of its internal
contents
• The onset of a relative clause
From Beam Search to UID
The UID Bias in Beam Search
14
Experiments
15
• Train
• IWSLT ‘14 De-En
• WMT ‘14 En-Fr
• Test
• Newstest ‘14 for WMT
• Model & Training Method
• fairseq: https://github.com/pytorch/fairseq
Dataset & Model
• Variance Regularizer
• Local Consistency
• Max Regularizer
• Squared Regularizer
𝑅 𝑣𝑎𝑟 𝑦 =
1
|𝑦|
෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡 − 𝜇 2
𝜇 =
1
|𝑦|
෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡
𝑅𝑙𝑜𝑐𝑎𝑙 𝑦 =
1
|𝑦|
෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡 − 𝑢 𝑡−1 𝑦𝑡−1
2
𝑅 𝑚𝑎𝑥 𝑦 = max
𝑡=1~|𝑦|
𝑢 𝑡 𝑦𝑡
𝑅 𝑠𝑞𝑢𝑎𝑟𝑒 𝑦 = ෍
𝑡=1
|𝑦|
𝑢 𝑡 𝑦𝑡
2
Generalized UID Decoding
Experiments
16
Cognitive Motivation for Beam Search
BLEU ∝ 1/(Std.dev of Surprisals)
Exact Decoding over 𝑦∗
= argmax
𝑦∈𝒴
(log 𝑝 𝜃 𝑦 𝑥 − 𝜆 ⋅ 𝑅 𝑔𝑟𝑒𝑒𝑑𝑦(𝑦))
Beam Search Decoding over
𝑦∗
= argmax
𝑦∈𝒴
(log 𝑝 𝜃 𝑦 𝑥 )
• Increasing the strength of the
regularizer appears to alleviate
the text quality degradation
seen with exact search, leading
to results that approach the
BLEU of those generated using
optimal beam search
Experiments
17
Regularized Beam Search
• The greedy and squared regularizer aid performance for larger beam sizes more so than other regularizers
• Variance and local variance are the purest encodings of UID, they perform the poorest of the regularizers.
• This may be due to the fact that they do not simultaneously (as the other regularizers do) penalize for
high surprisal.
Experiments
18
• Combining multiple UID regularizers does not lead to as great an increase in performance as one might
expect, which hints that a single method for enforcing UID is sufficient for promoting quality in generated
text.
Combination of UID regularizers
Greedy + Squared
Thank you
19

If beam search is the answer, what was the question

  • 1.
    If Beam Searchis the Answer, What was the Question 자연어처리팀: 박희수(발표자), 백지윤, 신동진, 진명훈, 1
  • 2.
    Motivation Maximum A Posteriori(MAP) for Machine Translation 2 “나는 오늘 저녁에 파스타를 먹었다” ??? Korean Input Sequence English Probable Output Sequence How can we measure the probability? 𝑥 : “나는 오늘 저녁에 파스타를 먹었다”
  • 3.
    3 Motivation This is NP-hardProblem! Maximum A Posteriori (MAP) for Machine Translation … Encoder “나는 오늘 저녁에 파스타를 먹었다” Decoder Compute all possible combination to find the sentence which maximizes: vocab Possible length (1…n) I We Am … … … … … … Dinner pasta I We Am … … … … … … Dinner pasta I We Am … … … … … … Dinner pasta
  • 4.
    Greedy Search 4 Encoder “나는 오늘저녁에 파스타를 먹었다” Decoder At every step, select a word which maximizes: vocab I We Am … … … … … … Dinner pasta Step 1 <START> I I We Am … … … … … … Dinner pasta Step 2 I am am … I We Am … … … … <END> … Dinner pasta Step n It may be a sub-optimal choice. Selects one best candidate as an input sequence for each time step
  • 5.
    Beam Search Selects multiplealternatives for an input sequence at each timestep based on conditional probability. 5 Encoder “나는 오늘 저녁에 파스타를 먹었다” Decoder Select beam size words which maximizes: vocab I We Am … … … … … … Dinner pasta Step 1 <START> I We Beam size: 2
  • 6.
    Beam Search Selects multiplealternatives for an input sequence at each timestep based on conditional probability. 6 1. Take the first two selected words (I, We) from step1 as input to the second step. Then, find the three best alternatives for the second word. (I had, We ate) I We Had … … … Ate … … Dinner pasta Step 2Encoder “나는 오늘 저녁에 파스타를 먹었다” Decoder vocab I We Am … … … … … … Dinner pasta Step 1 <START> I We Beam size: 2 ate had
  • 7.
    Beam Search Selects multiplealternatives for an input sequence at each timestep based on conditional probability. 7 I We had … … … ate … … Dinner pasta Step 2Encoder “나는 오늘 저녁에 파스타를 먹었다” Decoder vocab I We Am … … … … … … Dinner pasta Step 1 <START> I We Beam size: 2 2. Evaluate the probability for the other word (We had, I ate) ate had
  • 8.
    Beam Search Selects multiplealternatives for an input sequence at each timestep based on conditional probability. 8 I We had … … … ate … … Dinner pasta Step 2Encoder “나는 오늘 저녁에 파스타를 먹었다” Decoder vocab I We Am … … … … … … Dinner pasta Step 1 <START> I We Beam size: 2 3. Choose the top 3 first and second-word pair combinations (I had, I ate) 4. Drop the first word which is not found in any word pair. (We) ate had
  • 9.
    Beam Search Selects multiplealternatives for an input sequence at each timestep based on conditional probability. 9 I We Am … … … … <END> … Dinner pasta Step n I We had … … … ate … … Dinner pasta Step 2Encoder “나는 오늘 저녁에 파스타를 먹었다” Decoder vocab I We Am … … … … … … Dinner pasta Step 1 <START> I We Beam size: 2 … ate had 5. Continue this process and pick three sentences with the highest probability. : • I ate pasta this evening → 0.57 • I had pasta this evening → 0.63 6. Pick the sentence with the highest probability • I had pasta this evening
  • 10.
    10 Beam Search Beam Searchresults show a large search error and poor score for the objective. However, • Increasing the beam size beyond 5 (obviously more exact answer) can hurt model performance in terms of downstream evaluation metrics. • Global optimum in many cases is the empty string, when they otherwise might fail to produce text at all. • Beam search typically generates well-formed and coherent text from probabilistic models. (Human-like) Beam Search is the Answer!!! Why?? What was the Question?
  • 11.
    Alternative Decoding Objectives Objectiveto reward longer outputs 11 Objective to reward coverage of input words in a prediction using the attention mechanism of an encoder–decoder model ➔ While such methods help obtain state-of-the-art results in neural MT, the fact that text quality still degrades with increased beam sizes when these rewards are used suggests that they do not address the inherent issues with text generation systems.
  • 12.
    Deriving Beam Search Regularizeddecoding framework (Insert the search error into the objective function) 12 𝑢0 𝐵𝑂𝑆 = 0 𝑢 𝑡 𝑦 = − log 𝑝 𝜃(𝑦|𝑥, 𝑦<𝑡) ∀𝑡 ≥ 1 Surprisal 𝑦∗ = argmax 𝑦∈𝒴 (log 𝑝 𝜃 𝑦 𝑥 − 𝜆 ⋅ 𝑅(𝑦)) 𝑅 𝑔𝑟𝑒𝑒𝑑𝑦 𝑦 = ෍ 𝑡=1 |𝑦| 𝑢 𝑡 𝑦𝑡 − min 𝑦′∈𝒱 𝑢 𝑡 𝑦′ 2 𝑅 𝑏𝑒𝑎𝑚 𝑌 = ෍ 𝑡=1 |𝑛 𝑚𝑎𝑥| 𝑢 𝑡 𝑌𝑡 − min 𝑦′⊆ℬ 𝑡, 𝑌′ =𝑘 𝑢 𝑡 𝑌′ 2 𝑌∗ = argmax 𝑌∈𝒴, 𝑌 =𝑘 (log 𝑝 𝜃 𝑌 𝑥 − 𝜆 ⋅ 𝑅(𝑌)) The error of greedy search results The error of beam search results ➔ Maximize a posterior and minimize the search error simultaneously • An information-theoretic concept that characterizes the amount of new information expressed at time t • We can use it to analyze the properties enforced on generated text by beam search.
  • 13.
    From Beam Searchto UID Uniform Information Density (UID) 13 How big is the family you cook for? How big is the family that you cook for? Information is split into two words family family that Grammaticality and information content are held constant • The part of its internal contents • The onset of a relative clause • The part of its internal contents • The onset of a relative clause
  • 14.
    From Beam Searchto UID The UID Bias in Beam Search 14
  • 15.
    Experiments 15 • Train • IWSLT‘14 De-En • WMT ‘14 En-Fr • Test • Newstest ‘14 for WMT • Model & Training Method • fairseq: https://github.com/pytorch/fairseq Dataset & Model • Variance Regularizer • Local Consistency • Max Regularizer • Squared Regularizer 𝑅 𝑣𝑎𝑟 𝑦 = 1 |𝑦| ෍ 𝑡=1 |𝑦| 𝑢 𝑡 𝑦𝑡 − 𝜇 2 𝜇 = 1 |𝑦| ෍ 𝑡=1 |𝑦| 𝑢 𝑡 𝑦𝑡 𝑅𝑙𝑜𝑐𝑎𝑙 𝑦 = 1 |𝑦| ෍ 𝑡=1 |𝑦| 𝑢 𝑡 𝑦𝑡 − 𝑢 𝑡−1 𝑦𝑡−1 2 𝑅 𝑚𝑎𝑥 𝑦 = max 𝑡=1~|𝑦| 𝑢 𝑡 𝑦𝑡 𝑅 𝑠𝑞𝑢𝑎𝑟𝑒 𝑦 = ෍ 𝑡=1 |𝑦| 𝑢 𝑡 𝑦𝑡 2 Generalized UID Decoding
  • 16.
    Experiments 16 Cognitive Motivation forBeam Search BLEU ∝ 1/(Std.dev of Surprisals) Exact Decoding over 𝑦∗ = argmax 𝑦∈𝒴 (log 𝑝 𝜃 𝑦 𝑥 − 𝜆 ⋅ 𝑅 𝑔𝑟𝑒𝑒𝑑𝑦(𝑦)) Beam Search Decoding over 𝑦∗ = argmax 𝑦∈𝒴 (log 𝑝 𝜃 𝑦 𝑥 ) • Increasing the strength of the regularizer appears to alleviate the text quality degradation seen with exact search, leading to results that approach the BLEU of those generated using optimal beam search
  • 17.
    Experiments 17 Regularized Beam Search •The greedy and squared regularizer aid performance for larger beam sizes more so than other regularizers • Variance and local variance are the purest encodings of UID, they perform the poorest of the regularizers. • This may be due to the fact that they do not simultaneously (as the other regularizers do) penalize for high surprisal.
  • 18.
    Experiments 18 • Combining multipleUID regularizers does not lead to as great an increase in performance as one might expect, which hints that a single method for enforcing UID is sufficient for promoting quality in generated text. Combination of UID regularizers Greedy + Squared
  • 19.