J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI

Mutually-Constrained Monotonic
Multihead Attention for Online ASR
ICASSP 2021
Jaeyun Song, Hajin Shim, Eunho Yang
{mercery, shimazing, eunhoy}@kaist.ac.kr

Automatic Speech Recognition is widely used in the industry
● Teleconference, Simultaneous Translation, AI Speaker, AI assistant etc.
2
“Alexa!”
(a) Teleconference (b) AI Speaker

Online Automatic Speech Recognition
● Offline Automatic Speech Recognition (ASR) requires waiting for the speech to end.
● Online ASR recognize a speech simultaneously with a speech.
3
Play songs in the playlist
Oflline ASR
Online ASR
Play songs in the playlist
Fast!
Time axis

Motivation
● Monotonic Multihead Attention (MMA) shows comparable performance to the
SOTA online methods in ASR, but there is room for reducing the latency.
● HeadDrop and Head-Synchronous Beam Search Decoding reduce the latency of
MMA, but there is the gap between training and testing phase.
In this work,
● We proposed Mutually-Constrained MMA (MCMMA) to fill the gap.
● We improve the performance with the small amount of increase in the latency.

Outline
● Monotonic Multihead Attention (prior work)
ꟷ Monotonic Attention
ꟷ Mototonic Multihead Attention
ꟷ HeadDrop
ꟷ Head-Synchronous Beam Search Decoding
● Mutually-Constrained Motonic Multihead Attention
● Librispeech 100-hour and AISHELL-1 Results
● Conclusion

Monotonic Attention (Inference)
● Monotonic Attention (MA) is the online attention mechanism to learn
monotonic alignments in an end-to-end manner.
● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 ,
the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1.
[1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
alignments." International Conference on Machine Learning. PMLR, 2017.
To compute the next decoder state 𝑠𝑖,
𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗
𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗
If 𝑝𝑖,𝑗 ≥ 0.5 for some j, then 𝑐𝑖 = ℎ𝑗.
With 𝑐𝑖, produce 𝑠𝑖.
Memory h (j)
Output
y
(i)
𝑝1,1 < 0.5
𝑝1,2 ≥ 0.5
testing phase

Monotonic Attention (Training)
● Monotonic Attention (MA) is the online attention mechanism to learn
monotonic alignments in an end-to-end manner.
● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 ,
the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1.
[1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
To compute the next decoder state 𝑠𝑖
𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗
𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗
Calculate attention distribution 𝛼𝑖,𝑗
𝛼𝑖,𝑗 = 𝑝𝑖,𝑗 ෍
𝑘=1
𝑗
𝛼𝑖−1,𝑘 ෑ
𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑐𝑖 = ෍
𝑗=1
𝑇
𝛼𝑖,𝑗ℎ𝑗
Memory h (j)
Output
y
(i)
𝑝1,1 < 0.5
𝑝1,2 ≥ 0.5
training phase

Monotonic Multihead Attention
● Monotonic Multihead Attention (MMA) extends MA to multihead attention by
letting each head learns its own alignment.
● There is an unnecessary delay time since MMA produce an output token until
the selction of all heads is done.
[2] Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu,
“Monotonic multihead attention,” in Proc. of ICLR, 2020.
Memory h
𝛼𝑖,𝑗
𝑚
= 𝑝𝑖,𝑗
𝑚
෍
𝑘=1
𝑗
𝛼𝑖−1,𝑘
𝑚
ෑ
𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
𝑐𝑖
𝑚
= ෍
𝑗=1
𝑇
𝛼𝑖,𝑗
𝑚
ℎ𝑗
𝑚
𝑐𝑖 = ConCat 𝑐𝑖
1
, … , 𝑐𝑖
𝐻
𝑝1,1
1
< 0.5
𝑝1,1
2
≥ 0.5
𝑝1,1
3
< 0.5
𝑝1,1
4
< 0.5

HeadDrop & Head-Synchronous Beam Search Decoding
● To decrease the latency of MMA, and
are introduced.
ꟷ drops heads stochastically as Dropout.
ꟷ forces late heads to select the rightmost selected frame.
● There is the gap between training and testing phase.
[3] Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic multihead
attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137–2141.
Memory h
Head
m
Right Bound
Activation
Waiting Threshold 𝝐 = 𝟑
(b) Head-Synchronous Decoding
(a) MMA

The Overview of Mutually-Constrained MMA
Memory h
Head
m
Right Bound
Activation
Waiting Threshold 𝝐 = 𝟑
FFN
MMA
SAN
SAN
FFN
Encoder states
Previous output tokens
Token embedding
1D-Convolution
Linear & Softmax
Prediction
× 𝟒
× 𝟐
MMA

Mutually-Constrained Monotonic Multihead Attention
● Mutually-Constrained MMA (MCMMA) brings HSD to the training phase to fill
the gap between training and testing process.
● In detail, we modify attention distribution α of MMA to reflect HSD.
● We need to consider two cases to bring HSD to the training process.

The m-th head selects the j-th frame by its own attention distribution,
when all other heads do not select any frames until the (j-ε)-th frame.
(Case 2) The m-th head selects the j-th frame by HSD,
when the fastest head selects the (j-ε)-th frame.
The m-th head: 𝑝𝑖,𝑗
𝑚 σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚 ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
Other heads: ς𝑚′≠𝑚
𝐻
(1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
Probability: 𝑝𝑖,𝑗
𝑚
σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚
ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
ς𝑚′≠𝑚
𝐻 (1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)

(Case 1) The m-th head selects the j-th frame by its own attention distribution,
The m-th head selects the j-th frame by HSD,
The m-th head: (1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
)
Other heads: ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
− ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
Probability: (1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
) ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′

The m-th head selects the j-th frame by its own attention distribution,
The m-th head selects the j-th frame by HSD,
Combine and
𝛿𝑖,𝑗
𝑚
= 𝑝𝑖,𝑗
𝑚 σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚 ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚 ς𝑚′≠𝑚
𝐻 (1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
+ 1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′

We cover the marginal case. (We compute 𝛿𝑖,𝑗
𝑚
autoregressively over output tokens.)
𝐴𝑖,𝑗
𝑚
𝛿 = 𝑝𝑖,𝑗
𝑚
෍
𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚
ෑ
𝑜=𝑘
𝑗−1
1 − 𝑝𝑖,𝑜
𝑚
, 𝐵𝑖,𝑗
𝑚
𝛿 = 1 − ෍
𝑘=1
𝑗
𝛿𝑖,𝑗
𝑚
𝛿𝑖,𝑗
𝑚
=
𝐴𝑖,𝑗
𝑚
𝛿 , 𝑗 ≤ 𝜖
𝐴𝑖,𝑗
𝑚
𝛿 ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖
𝑚′
𝛿 + 𝐵𝑖,𝑗−1
𝑚
𝛿 ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖−1
𝑚′
𝛿 − ෑ
𝑚′≠𝑚
𝑚′
𝛿 𝑂. 𝑊.
𝑐𝑖
𝑚
= ෍
𝑗=1
𝑇
𝛿𝑖,𝑗
𝑚
ℎ𝑗
𝑚
1
, … , 𝑐𝑖
𝐻

To compute attention distribution in parallel,
we compute α by MMA, and then modify α to produce መ
𝛿.
𝐴𝑖,𝑗
𝑚
𝛼 = 𝑝𝑖,𝑗
𝑚
෍
𝑘=1
𝑗
𝛼𝑖−1,𝑘
𝑚
ෑ
𝑜=𝑘
𝑗−1
1 − 𝑝𝑖,𝑜
𝑚
, 𝐵𝑖,𝑗
𝑚
𝛼 = 1 − ෍
𝑘=1
𝑗
𝛼𝑖,𝑗
𝑚
መ
𝛿𝑖,𝑗
𝑚
=
𝐴𝑖,𝑗
𝑚
𝛼 , 𝑗 ≤ 𝜖
𝐴𝑖,𝑗
𝑚
𝛼 ෑ
𝑚′≠𝑚
𝑚′
𝛼 + 𝐵𝑖,𝑗−1
𝑚
𝛼 ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖−1
𝑚′
𝛼 − ෑ
𝑚′≠𝑚
𝑚′
𝛼 𝑂. 𝑊.
𝑐𝑖
𝑚
= ෍
𝑗=1
𝑇
መ
𝛿𝑖,𝑗
𝑚
ℎ𝑗
𝑚
1
, … , 𝑐𝑖
𝐻

Results on Librispeech 100-hour and AISHELL-1
● Our approach shows better performance than MMA and HeadDrop

Trade-offs between Performance and Latency
● Our approach shows better performance than MMA and HeadDrop
with the slight increase in latency =
1
𝐿𝑚𝑖𝑛
σ𝑖=1
𝐿𝑚𝑖𝑛
𝑏𝑖
ℎ𝑦𝑝
− 𝑏𝑖
𝑟𝑒𝑓

Conclusion
● We proposed Mutually-Constrained MMA (MCMMA) algorithm to fill the gap.
● We broght HSD to the training phase by modifiying the attention distribution.
● We improve the performance with the small amount of increase in the latency
over Librispeech 100-hour and AISHELL-1.

Reference
1) Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
2) Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu, “Monotonic
multihead attention,” in Proc. of ICLR, 2020.
3) Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic
multihead attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137–
2141.

J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI

Recommended

Recommended

More Related Content

Similar to J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI

Similar to J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI (20)

More from MLILAB

More from MLILAB (20)

Recently uploaded

Recently uploaded (20)

J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI