2. Automatic Speech Recognition is widely used in the industry
● Teleconference, Simultaneous Translation, AI Speaker, AI assistant etc.
2
“Alexa!”
(a) Teleconference (b) AI Speaker
3. Online Automatic Speech Recognition
● Offline Automatic Speech Recognition (ASR) requires waiting for the speech to end.
● Online ASR recognize a speech simultaneously with a speech.
3
Play songs in the playlist
Oflline ASR
Online ASR
Play songs in the playlist
Fast!
Time axis
4. Motivation
● Monotonic Multihead Attention (MMA) shows comparable performance to the
SOTA online methods in ASR, but there is room for reducing the latency.
● HeadDrop and Head-Synchronous Beam Search Decoding reduce the latency of
MMA, but there is the gap between training and testing phase.
In this work,
● We proposed Mutually-Constrained MMA (MCMMA) to fill the gap.
● We improve the performance with the small amount of increase in the latency.
6. Monotonic Attention (Inference)
● Monotonic Attention (MA) is the online attention mechanism to learn
monotonic alignments in an end-to-end manner.
● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 ,
the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1.
[1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
alignments." International Conference on Machine Learning. PMLR, 2017.
To compute the next decoder state 𝑠𝑖,
𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗
𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗
If 𝑝𝑖,𝑗 ≥ 0.5 for some j, then 𝑐𝑖 = ℎ𝑗.
With 𝑐𝑖, produce 𝑠𝑖.
Memory h (j)
Output
y
(i)
𝑝1,1 < 0.5
𝑝1,2 ≥ 0.5
testing phase
7. Monotonic Attention (Training)
● Monotonic Attention (MA) is the online attention mechanism to learn
monotonic alignments in an end-to-end manner.
● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 ,
the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1.
[1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
alignments." International Conference on Machine Learning. PMLR, 2017.
To compute the next decoder state 𝑠𝑖
𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗
𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗
Calculate attention distribution 𝛼𝑖,𝑗
𝛼𝑖,𝑗 = 𝑝𝑖,𝑗
𝑘=1
𝑗
𝛼𝑖−1,𝑘 ෑ
𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑐𝑖 =
𝑗=1
𝑇
𝛼𝑖,𝑗ℎ𝑗
Memory h (j)
Output
y
(i)
𝑝1,1 < 0.5
𝑝1,2 ≥ 0.5
training phase
8. Monotonic Multihead Attention
● Monotonic Multihead Attention (MMA) extends MA to multihead attention by
letting each head learns its own alignment.
● There is an unnecessary delay time since MMA produce an output token until
the selction of all heads is done.
[2] Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu,
“Monotonic multihead attention,” in Proc. of ICLR, 2020.
Memory h
𝛼𝑖,𝑗
𝑚
= 𝑝𝑖,𝑗
𝑚
𝑘=1
𝑗
𝛼𝑖−1,𝑘
𝑚
ෑ
𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
𝑐𝑖
𝑚
=
𝑗=1
𝑇
𝛼𝑖,𝑗
𝑚
ℎ𝑗
𝑚
𝑐𝑖 = ConCat 𝑐𝑖
1
, … , 𝑐𝑖
𝐻
𝑝1,1
1
< 0.5
𝑝1,1
2
≥ 0.5
𝑝1,1
3
< 0.5
𝑝1,1
4
< 0.5
9. HeadDrop & Head-Synchronous Beam Search Decoding
● To decrease the latency of MMA, and
are introduced.
ꟷ drops heads stochastically as Dropout.
ꟷ forces late heads to select the rightmost selected frame.
● There is the gap between training and testing phase.
[3] Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic multihead
attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137–2141.
Memory h
Head
m
Right Bound
Activation
Waiting Threshold 𝝐 = 𝟑
(b) Head-Synchronous Decoding
(a) MMA
11. The Overview of Mutually-Constrained MMA
Memory h
Head
m
Right Bound
Activation
Waiting Threshold 𝝐 = 𝟑
FFN
MMA
SAN
SAN
FFN
Encoder states
Previous output tokens
Token embedding
1D-Convolution
Linear & Softmax
Prediction
× 𝟒
× 𝟐
MMA
12. Mutually-Constrained Monotonic Multihead Attention
● Mutually-Constrained MMA (MCMMA) brings HSD to the training phase to fill
the gap between training and testing process.
● In detail, we modify attention distribution α of MMA to reflect HSD.
● We need to consider two cases to bring HSD to the training process.
13. Mutually-Constrained Monotonic Multihead Attention
The m-th head selects the j-th frame by its own attention distribution,
when all other heads do not select any frames until the (j-ε)-th frame.
(Case 2) The m-th head selects the j-th frame by HSD,
when the fastest head selects the (j-ε)-th frame.
The m-th head: 𝑝𝑖,𝑗
𝑚 σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚 ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
Other heads: ς𝑚′≠𝑚
𝐻
(1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
Probability: 𝑝𝑖,𝑗
𝑚
σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚
ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
ς𝑚′≠𝑚
𝐻 (1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
14. Mutually-Constrained Monotonic Multihead Attention
(Case 1) The m-th head selects the j-th frame by its own attention distribution,
when all other heads do not select any frames until the (j-ε)-th frame.
The m-th head selects the j-th frame by HSD,
when the fastest head selects the (j-ε)-th frame.
The m-th head: (1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
)
Other heads: ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
− ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
Probability: (1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
) ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
− ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
15. Mutually-Constrained Monotonic Multihead Attention
The m-th head selects the j-th frame by its own attention distribution,
when all other heads do not select any frames until the (j-ε)-th frame.
The m-th head selects the j-th frame by HSD,
when the fastest head selects the (j-ε)-th frame.
Combine and
𝛿𝑖,𝑗
𝑚
= 𝑝𝑖,𝑗
𝑚 σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚 ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚 ς𝑚′≠𝑚
𝐻 (1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
+ 1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
− ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
19. Results on Librispeech 100-hour and AISHELL-1
● Our approach shows better performance than MMA and HeadDrop
20. Trade-offs between Performance and Latency
● Our approach shows better performance than MMA and HeadDrop
with the slight increase in latency =
1
𝐿𝑚𝑖𝑛
σ𝑖=1
𝐿𝑚𝑖𝑛
𝑏𝑖
ℎ𝑦𝑝
− 𝑏𝑖
𝑟𝑒𝑓
22. Conclusion
● We proposed Mutually-Constrained MMA (MCMMA) algorithm to fill the gap.
● We broght HSD to the training phase by modifiying the attention distribution.
● We improve the performance with the small amount of increase in the latency
over Librispeech 100-hour and AISHELL-1.
23. Reference
1) Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
alignments." International Conference on Machine Learning. PMLR, 2017.
2) Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu, “Monotonic
multihead attention,” in Proc. of ICLR, 2020.
3) Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic
multihead attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137–
2141.