SlideShare a Scribd company logo
1 of 23
Download to read offline
Mutually-Constrained Monotonic
Multihead Attention for Online ASR
ICASSP 2021
Jaeyun Song, Hajin Shim, Eunho Yang
{mercery, shimazing, eunhoy}@kaist.ac.kr
Automatic Speech Recognition is widely used in the industry
● Teleconference, Simultaneous Translation, AI Speaker, AI assistant etc.
2
“Alexa!”
(a) Teleconference (b) AI Speaker
Online Automatic Speech Recognition
● Offline Automatic Speech Recognition (ASR) requires waiting for the speech to end.
● Online ASR recognize a speech simultaneously with a speech.
3
Play songs in the playlist
Oflline ASR
Online ASR
Play songs in the playlist
Fast!
Time axis
Motivation
● Monotonic Multihead Attention (MMA) shows comparable performance to the
SOTA online methods in ASR, but there is room for reducing the latency.
● HeadDrop and Head-Synchronous Beam Search Decoding reduce the latency of
MMA, but there is the gap between training and testing phase.
In this work,
● We proposed Mutually-Constrained MMA (MCMMA) to fill the gap.
● We improve the performance with the small amount of increase in the latency.
Outline
● Monotonic Multihead Attention (prior work)
ꟷ Monotonic Attention
ꟷ Mototonic Multihead Attention
ꟷ HeadDrop
ꟷ Head-Synchronous Beam Search Decoding
● Mutually-Constrained Motonic Multihead Attention
● Librispeech 100-hour and AISHELL-1 Results
● Conclusion
Monotonic Attention (Inference)
● Monotonic Attention (MA) is the online attention mechanism to learn
monotonic alignments in an end-to-end manner.
● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 ,
the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1.
[1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
alignments." International Conference on Machine Learning. PMLR, 2017.
To compute the next decoder state 𝑠𝑖,
𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗
𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗
If 𝑝𝑖,𝑗 ≥ 0.5 for some j, then 𝑐𝑖 = ℎ𝑗.
With 𝑐𝑖, produce 𝑠𝑖.
Memory h (j)
Output
y
(i)
𝑝1,1 < 0.5
𝑝1,2 ≥ 0.5
testing phase
Monotonic Attention (Training)
● Monotonic Attention (MA) is the online attention mechanism to learn
monotonic alignments in an end-to-end manner.
● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 ,
the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1.
[1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
alignments." International Conference on Machine Learning. PMLR, 2017.
To compute the next decoder state 𝑠𝑖
𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗
𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗
Calculate attention distribution 𝛼𝑖,𝑗
𝛼𝑖,𝑗 = 𝑝𝑖,𝑗 ෍
𝑘=1
𝑗
𝛼𝑖−1,𝑘 ෑ
𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑐𝑖 = ෍
𝑗=1
𝑇
𝛼𝑖,𝑗ℎ𝑗
Memory h (j)
Output
y
(i)
𝑝1,1 < 0.5
𝑝1,2 ≥ 0.5
training phase
Monotonic Multihead Attention
● Monotonic Multihead Attention (MMA) extends MA to multihead attention by
letting each head learns its own alignment.
● There is an unnecessary delay time since MMA produce an output token until
the selction of all heads is done.
[2] Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu,
“Monotonic multihead attention,” in Proc. of ICLR, 2020.
Memory h
𝛼𝑖,𝑗
𝑚
= 𝑝𝑖,𝑗
𝑚
෍
𝑘=1
𝑗
𝛼𝑖−1,𝑘
𝑚
ෑ
𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
𝑐𝑖
𝑚
= ෍
𝑗=1
𝑇
𝛼𝑖,𝑗
𝑚
ℎ𝑗
𝑚
𝑐𝑖 = ConCat 𝑐𝑖
1
, … , 𝑐𝑖
𝐻
𝑝1,1
1
< 0.5
𝑝1,1
2
≥ 0.5
𝑝1,1
3
< 0.5
𝑝1,1
4
< 0.5
HeadDrop & Head-Synchronous Beam Search Decoding
● To decrease the latency of MMA, and
are introduced.
ꟷ drops heads stochastically as Dropout.
ꟷ forces late heads to select the rightmost selected frame.
● There is the gap between training and testing phase.
[3] Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic multihead
attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137–2141.
Memory h
Head
m
Right Bound
Activation
Waiting Threshold 𝝐 = 𝟑
(b) Head-Synchronous Decoding
(a) MMA
Outline
● Monotonic Multihead Attention (prior work)
ꟷ Monotonic Attention
ꟷ Mototonic Multihead Attention
ꟷ HeadDrop
ꟷ Head-Synchronous Beam Search Decoding
● Mutually-Constrained Motonic Multihead Attention
● Librispeech 100-hour and AISHELL-1 Results
● Conclusion
The Overview of Mutually-Constrained MMA
Memory h
Head
m
Right Bound
Activation
Waiting Threshold 𝝐 = 𝟑
FFN
MMA
SAN
SAN
FFN
Encoder states
Previous output tokens
Token embedding
1D-Convolution
Linear & Softmax
Prediction
× 𝟒
× 𝟐
MMA
Mutually-Constrained Monotonic Multihead Attention
● Mutually-Constrained MMA (MCMMA) brings HSD to the training phase to fill
the gap between training and testing process.
● In detail, we modify attention distribution α of MMA to reflect HSD.
● We need to consider two cases to bring HSD to the training process.
Mutually-Constrained Monotonic Multihead Attention
The m-th head selects the j-th frame by its own attention distribution,
when all other heads do not select any frames until the (j-ε)-th frame.
(Case 2) The m-th head selects the j-th frame by HSD,
when the fastest head selects the (j-ε)-th frame.
The m-th head: 𝑝𝑖,𝑗
𝑚 σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚 ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
Other heads: ς𝑚′≠𝑚
𝐻
(1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
Probability: 𝑝𝑖,𝑗
𝑚
σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚
ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚
ς𝑚′≠𝑚
𝐻 (1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
Mutually-Constrained Monotonic Multihead Attention
(Case 1) The m-th head selects the j-th frame by its own attention distribution,
when all other heads do not select any frames until the (j-ε)-th frame.
The m-th head selects the j-th frame by HSD,
when the fastest head selects the (j-ε)-th frame.
The m-th head: (1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
)
Other heads: ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
− ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
Probability: (1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
) ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
− ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
Mutually-Constrained Monotonic Multihead Attention
The m-th head selects the j-th frame by its own attention distribution,
when all other heads do not select any frames until the (j-ε)-th frame.
The m-th head selects the j-th frame by HSD,
when the fastest head selects the (j-ε)-th frame.
Combine and
𝛿𝑖,𝑗
𝑚
= 𝑝𝑖,𝑗
𝑚 σ𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚 ς𝑙=𝑘
𝑗−1
1 − 𝑝𝑖,𝑙
𝑚 ς𝑚′≠𝑚
𝐻 (1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
)
+ 1 − σ𝑘=1
𝑗−1
𝛿𝑖,𝑘
𝑚
ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖−1
𝛿𝑖,𝑘
𝑚′
− ς𝑚′≠𝑚
𝐻
1 − σ𝑘=1
𝑗−𝜖
𝛿𝑖,𝑘
𝑚′
Mutually-Constrained Monotonic Multihead Attention
We cover the marginal case. (We compute 𝛿𝑖,𝑗
𝑚
autoregressively over output tokens.)
𝐴𝑖,𝑗
𝑚
𝛿 = 𝑝𝑖,𝑗
𝑚
෍
𝑘=1
𝑗
𝛿𝑖−1,𝑘
𝑚
ෑ
𝑜=𝑘
𝑗−1
1 − 𝑝𝑖,𝑜
𝑚
, 𝐵𝑖,𝑗
𝑚
𝛿 = 1 − ෍
𝑘=1
𝑗
𝛿𝑖,𝑗
𝑚
𝛿𝑖,𝑗
𝑚
=
𝐴𝑖,𝑗
𝑚
𝛿 , 𝑗 ≤ 𝜖
𝐴𝑖,𝑗
𝑚
𝛿 ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖
𝑚′
𝛿 + 𝐵𝑖,𝑗−1
𝑚
𝛿 ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖−1
𝑚′
𝛿 − ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖
𝑚′
𝛿 𝑂. 𝑊.
𝑐𝑖
𝑚
= ෍
𝑗=1
𝑇
𝛿𝑖,𝑗
𝑚
ℎ𝑗
𝑚
𝑐𝑖 = ConCat 𝑐𝑖
1
, … , 𝑐𝑖
𝐻
Mutually-Constrained Monotonic Multihead Attention
To compute attention distribution in parallel,
we compute α by MMA, and then modify α to produce መ
𝛿.
𝐴𝑖,𝑗
𝑚
𝛼 = 𝑝𝑖,𝑗
𝑚
෍
𝑘=1
𝑗
𝛼𝑖−1,𝑘
𝑚
ෑ
𝑜=𝑘
𝑗−1
1 − 𝑝𝑖,𝑜
𝑚
, 𝐵𝑖,𝑗
𝑚
𝛼 = 1 − ෍
𝑘=1
𝑗
𝛼𝑖,𝑗
𝑚
መ
𝛿𝑖,𝑗
𝑚
=
𝐴𝑖,𝑗
𝑚
𝛼 , 𝑗 ≤ 𝜖
𝐴𝑖,𝑗
𝑚
𝛼 ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖
𝑚′
𝛼 + 𝐵𝑖,𝑗−1
𝑚
𝛼 ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖−1
𝑚′
𝛼 − ෑ
𝑚′≠𝑚
𝐵𝑖,𝑗−𝜖
𝑚′
𝛼 𝑂. 𝑊.
𝑐𝑖
𝑚
= ෍
𝑗=1
𝑇
መ
𝛿𝑖,𝑗
𝑚
ℎ𝑗
𝑚
𝑐𝑖 = ConCat 𝑐𝑖
1
, … , 𝑐𝑖
𝐻
Outline
● Monotonic Multihead Attention (prior work)
ꟷ Monotonic Attention
ꟷ Mototonic Multihead Attention
ꟷ HeadDrop
ꟷ Head-Synchronous Beam Search Decoding
● Mutually-Constrained Motonic Multihead Attention
● Librispeech 100-hour and AISHELL-1 Results
● Conclusion
Results on Librispeech 100-hour and AISHELL-1
● Our approach shows better performance than MMA and HeadDrop
Trade-offs between Performance and Latency
● Our approach shows better performance than MMA and HeadDrop
with the slight increase in latency =
1
𝐿𝑚𝑖𝑛
σ𝑖=1
𝐿𝑚𝑖𝑛
𝑏𝑖
ℎ𝑦𝑝
− 𝑏𝑖
𝑟𝑒𝑓
Outline
● Monotonic Multihead Attention (prior work)
ꟷ Monotonic Attention
ꟷ Mototonic Multihead Attention
ꟷ HeadDrop
ꟷ Head-Synchronous Beam Search Decoding
● Mutually-Constrained Motonic Multihead Attention
● Librispeech 100-hour and AISHELL-1 Results
● Conclusion
Conclusion
● We proposed Mutually-Constrained MMA (MCMMA) algorithm to fill the gap.
● We broght HSD to the training phase by modifiying the attention distribution.
● We improve the performance with the small amount of increase in the latency
over Librispeech 100-hour and AISHELL-1.
Reference
1) Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic
alignments." International Conference on Machine Learning. PMLR, 2017.
2) Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu, “Monotonic
multihead attention,” in Proc. of ICLR, 2020.
3) Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic
multihead attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137–
2141.

More Related Content

Similar to J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI

Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingzukun
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 
Signal flow graph
Signal flow graphSignal flow graph
Signal flow graphjani parth
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...TELKOMNIKA JOURNAL
 
Isoparametric bilinear quadrilateral element _ ppt presentation
Isoparametric bilinear quadrilateral element _ ppt presentationIsoparametric bilinear quadrilateral element _ ppt presentation
Isoparametric bilinear quadrilateral element _ ppt presentationFilipe Giesteira
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
PRAM algorithms from deepika
PRAM algorithms from deepikaPRAM algorithms from deepika
PRAM algorithms from deepikaguest1f4fb3
 
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)Alex Pruden
 
One shot learning - deep learning ( meta learn )
One shot learning - deep learning ( meta learn )One shot learning - deep learning ( meta learn )
One shot learning - deep learning ( meta learn )Dong Heon Cho
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingDongang (Sean) Wang
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
 
Event classification & prediction using support vector machine
Event classification & prediction using support vector machineEvent classification & prediction using support vector machine
Event classification & prediction using support vector machineRuta Kambli
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
Stochastic optimal control &amp; rl
Stochastic optimal control &amp; rlStochastic optimal control &amp; rl
Stochastic optimal control &amp; rlChoiJinwon3
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
 

Similar to J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI (20)

Icra 17
Icra 17Icra 17
Icra 17
 
Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programming
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
Signal flow graph
Signal flow graphSignal flow graph
Signal flow graph
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
Isoparametric bilinear quadrilateral element _ ppt presentation
Isoparametric bilinear quadrilateral element _ ppt presentationIsoparametric bilinear quadrilateral element _ ppt presentation
Isoparametric bilinear quadrilateral element _ ppt presentation
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
PRAM algorithms from deepika
PRAM algorithms from deepikaPRAM algorithms from deepika
PRAM algorithms from deepika
 
Apsipa2016for ss
Apsipa2016for ssApsipa2016for ss
Apsipa2016for ss
 
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
 
Presentation1
Presentation1Presentation1
Presentation1
 
One shot learning - deep learning ( meta learn )
One shot learning - deep learning ( meta learn )One shot learning - deep learning ( meta learn )
One shot learning - deep learning ( meta learn )
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
What is I/Q phase
What is I/Q phaseWhat is I/Q phase
What is I/Q phase
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Event classification & prediction using support vector machine
Event classification & prediction using support vector machineEvent classification & prediction using support vector machine
Event classification & prediction using support vector machine
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Stochastic optimal control &amp; rl
Stochastic optimal control &amp; rlStochastic optimal control &amp; rl
Stochastic optimal control &amp; rl
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 

More from MLILAB

J. Jeong, AAAI 2024, MLILAB, KAIST AI..
J. Jeong,  AAAI 2024, MLILAB, KAIST AI..J. Jeong,  AAAI 2024, MLILAB, KAIST AI..
J. Jeong, AAAI 2024, MLILAB, KAIST AI..MLILAB
 
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
J. Yun,  NeurIPS 2023,  MLILAB,  KAISTAIJ. Yun,  NeurIPS 2023,  MLILAB,  KAISTAI
J. Yun, NeurIPS 2023, MLILAB, KAISTAIMLILAB
 
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
S. Kim,  NeurIPS 2023,  MLILAB,  KAISTAIS. Kim,  NeurIPS 2023,  MLILAB,  KAISTAI
S. Kim, NeurIPS 2023, MLILAB, KAISTAIMLILAB
 
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAIC. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAIMLILAB
 
Y. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIY. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIMLILAB
 
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIJ. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIMLILAB
 
K. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIK. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIMLILAB
 
G. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIG. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIMLILAB
 
S. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIS. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIMLILAB
 
Y. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIY. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIMLILAB
 
J. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIJ. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIMLILAB
 
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIJ. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIMLILAB
 
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIJ. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIMLILAB
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIMLILAB
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIMLILAB
 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIMLILAB
 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIMLILAB
 
I. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AII. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AIMLILAB
 
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIMLILAB
 
J. Yi, ICLR 2020, MLILAB, KAIST AI
J. Yi, ICLR 2020, MLILAB, KAIST AIJ. Yi, ICLR 2020, MLILAB, KAIST AI
J. Yi, ICLR 2020, MLILAB, KAIST AIMLILAB
 

More from MLILAB (20)

J. Jeong, AAAI 2024, MLILAB, KAIST AI..
J. Jeong,  AAAI 2024, MLILAB, KAIST AI..J. Jeong,  AAAI 2024, MLILAB, KAIST AI..
J. Jeong, AAAI 2024, MLILAB, KAIST AI..
 
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
J. Yun,  NeurIPS 2023,  MLILAB,  KAISTAIJ. Yun,  NeurIPS 2023,  MLILAB,  KAISTAI
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
 
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
S. Kim,  NeurIPS 2023,  MLILAB,  KAISTAIS. Kim,  NeurIPS 2023,  MLILAB,  KAISTAI
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
 
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAIC. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI
 
Y. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIY. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAI
 
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIJ. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
 
K. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIK. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAI
 
G. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIG. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAI
 
S. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIS. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAI
 
Y. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIY. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAI
 
J. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIJ. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAI
 
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIJ. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
 
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIJ. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AI
 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
 
I. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AII. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AI
 
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
 
J. Yi, ICLR 2020, MLILAB, KAIST AI
J. Yi, ICLR 2020, MLILAB, KAIST AIJ. Yi, ICLR 2020, MLILAB, KAIST AI
J. Yi, ICLR 2020, MLILAB, KAIST AI
 

Recently uploaded

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 

Recently uploaded (20)

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 

J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI

  • 1. Mutually-Constrained Monotonic Multihead Attention for Online ASR ICASSP 2021 Jaeyun Song, Hajin Shim, Eunho Yang {mercery, shimazing, eunhoy}@kaist.ac.kr
  • 2. Automatic Speech Recognition is widely used in the industry ● Teleconference, Simultaneous Translation, AI Speaker, AI assistant etc. 2 “Alexa!” (a) Teleconference (b) AI Speaker
  • 3. Online Automatic Speech Recognition ● Offline Automatic Speech Recognition (ASR) requires waiting for the speech to end. ● Online ASR recognize a speech simultaneously with a speech. 3 Play songs in the playlist Oflline ASR Online ASR Play songs in the playlist Fast! Time axis
  • 4. Motivation ● Monotonic Multihead Attention (MMA) shows comparable performance to the SOTA online methods in ASR, but there is room for reducing the latency. ● HeadDrop and Head-Synchronous Beam Search Decoding reduce the latency of MMA, but there is the gap between training and testing phase. In this work, ● We proposed Mutually-Constrained MMA (MCMMA) to fill the gap. ● We improve the performance with the small amount of increase in the latency.
  • 5. Outline ● Monotonic Multihead Attention (prior work) ꟷ Monotonic Attention ꟷ Mototonic Multihead Attention ꟷ HeadDrop ꟷ Head-Synchronous Beam Search Decoding ● Mutually-Constrained Motonic Multihead Attention ● Librispeech 100-hour and AISHELL-1 Results ● Conclusion
  • 6. Monotonic Attention (Inference) ● Monotonic Attention (MA) is the online attention mechanism to learn monotonic alignments in an end-to-end manner. ● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 , the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1. [1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic alignments." International Conference on Machine Learning. PMLR, 2017. To compute the next decoder state 𝑠𝑖, 𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗 𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗 If 𝑝𝑖,𝑗 ≥ 0.5 for some j, then 𝑐𝑖 = ℎ𝑗. With 𝑐𝑖, produce 𝑠𝑖. Memory h (j) Output y (i) 𝑝1,1 < 0.5 𝑝1,2 ≥ 0.5 testing phase
  • 7. Monotonic Attention (Training) ● Monotonic Attention (MA) is the online attention mechanism to learn monotonic alignments in an end-to-end manner. ● The input sequence (𝑥1, … , 𝑥𝑇′), the encoder states ℎ1, … , ℎ𝑇 , the output sequence (𝑦1, … , 𝑦𝐿), and the (i-1)-th decoder state 𝑠𝑖−1. [1] Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic alignments." International Conference on Machine Learning. PMLR, 2017. To compute the next decoder state 𝑠𝑖 𝑒𝑖,𝑗 = Energy 𝑠𝑖−1, ℎ𝑗 𝑝𝑖,𝑗 = 𝜎 𝑒𝑖,𝑗 Calculate attention distribution 𝛼𝑖,𝑗 𝛼𝑖,𝑗 = 𝑝𝑖,𝑗 ෍ 𝑘=1 𝑗 𝛼𝑖−1,𝑘 ෑ 𝑙=𝑘 𝑗−1 1 − 𝑝𝑖,𝑙 𝑐𝑖 = ෍ 𝑗=1 𝑇 𝛼𝑖,𝑗ℎ𝑗 Memory h (j) Output y (i) 𝑝1,1 < 0.5 𝑝1,2 ≥ 0.5 training phase
  • 8. Monotonic Multihead Attention ● Monotonic Multihead Attention (MMA) extends MA to multihead attention by letting each head learns its own alignment. ● There is an unnecessary delay time since MMA produce an output token until the selction of all heads is done. [2] Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu, “Monotonic multihead attention,” in Proc. of ICLR, 2020. Memory h 𝛼𝑖,𝑗 𝑚 = 𝑝𝑖,𝑗 𝑚 ෍ 𝑘=1 𝑗 𝛼𝑖−1,𝑘 𝑚 ෑ 𝑙=𝑘 𝑗−1 1 − 𝑝𝑖,𝑙 𝑚 𝑐𝑖 𝑚 = ෍ 𝑗=1 𝑇 𝛼𝑖,𝑗 𝑚 ℎ𝑗 𝑚 𝑐𝑖 = ConCat 𝑐𝑖 1 , … , 𝑐𝑖 𝐻 𝑝1,1 1 < 0.5 𝑝1,1 2 ≥ 0.5 𝑝1,1 3 < 0.5 𝑝1,1 4 < 0.5
  • 9. HeadDrop & Head-Synchronous Beam Search Decoding ● To decrease the latency of MMA, and are introduced. ꟷ drops heads stochastically as Dropout. ꟷ forces late heads to select the rightmost selected frame. ● There is the gap between training and testing phase. [3] Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic multihead attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137–2141. Memory h Head m Right Bound Activation Waiting Threshold 𝝐 = 𝟑 (b) Head-Synchronous Decoding (a) MMA
  • 10. Outline ● Monotonic Multihead Attention (prior work) ꟷ Monotonic Attention ꟷ Mototonic Multihead Attention ꟷ HeadDrop ꟷ Head-Synchronous Beam Search Decoding ● Mutually-Constrained Motonic Multihead Attention ● Librispeech 100-hour and AISHELL-1 Results ● Conclusion
  • 11. The Overview of Mutually-Constrained MMA Memory h Head m Right Bound Activation Waiting Threshold 𝝐 = 𝟑 FFN MMA SAN SAN FFN Encoder states Previous output tokens Token embedding 1D-Convolution Linear & Softmax Prediction × 𝟒 × 𝟐 MMA
  • 12. Mutually-Constrained Monotonic Multihead Attention ● Mutually-Constrained MMA (MCMMA) brings HSD to the training phase to fill the gap between training and testing process. ● In detail, we modify attention distribution α of MMA to reflect HSD. ● We need to consider two cases to bring HSD to the training process.
  • 13. Mutually-Constrained Monotonic Multihead Attention The m-th head selects the j-th frame by its own attention distribution, when all other heads do not select any frames until the (j-ε)-th frame. (Case 2) The m-th head selects the j-th frame by HSD, when the fastest head selects the (j-ε)-th frame. The m-th head: 𝑝𝑖,𝑗 𝑚 σ𝑘=1 𝑗 𝛿𝑖−1,𝑘 𝑚 ς𝑙=𝑘 𝑗−1 1 − 𝑝𝑖,𝑙 𝑚 Other heads: ς𝑚′≠𝑚 𝐻 (1 − σ𝑘=1 𝑗−𝜖 𝛿𝑖,𝑘 𝑚′ ) Probability: 𝑝𝑖,𝑗 𝑚 σ𝑘=1 𝑗 𝛿𝑖−1,𝑘 𝑚 ς𝑙=𝑘 𝑗−1 1 − 𝑝𝑖,𝑙 𝑚 ς𝑚′≠𝑚 𝐻 (1 − σ𝑘=1 𝑗−𝜖 𝛿𝑖,𝑘 𝑚′ )
  • 14. Mutually-Constrained Monotonic Multihead Attention (Case 1) The m-th head selects the j-th frame by its own attention distribution, when all other heads do not select any frames until the (j-ε)-th frame. The m-th head selects the j-th frame by HSD, when the fastest head selects the (j-ε)-th frame. The m-th head: (1 − σ𝑘=1 𝑗−1 𝛿𝑖,𝑘 𝑚 ) Other heads: ς𝑚′≠𝑚 𝐻 1 − σ𝑘=1 𝑗−𝜖−1 𝛿𝑖,𝑘 𝑚′ − ς𝑚′≠𝑚 𝐻 1 − σ𝑘=1 𝑗−𝜖 𝛿𝑖,𝑘 𝑚′ Probability: (1 − σ𝑘=1 𝑗−1 𝛿𝑖,𝑘 𝑚 ) ς𝑚′≠𝑚 𝐻 1 − σ𝑘=1 𝑗−𝜖−1 𝛿𝑖,𝑘 𝑚′ − ς𝑚′≠𝑚 𝐻 1 − σ𝑘=1 𝑗−𝜖 𝛿𝑖,𝑘 𝑚′
  • 15. Mutually-Constrained Monotonic Multihead Attention The m-th head selects the j-th frame by its own attention distribution, when all other heads do not select any frames until the (j-ε)-th frame. The m-th head selects the j-th frame by HSD, when the fastest head selects the (j-ε)-th frame. Combine and 𝛿𝑖,𝑗 𝑚 = 𝑝𝑖,𝑗 𝑚 σ𝑘=1 𝑗 𝛿𝑖−1,𝑘 𝑚 ς𝑙=𝑘 𝑗−1 1 − 𝑝𝑖,𝑙 𝑚 ς𝑚′≠𝑚 𝐻 (1 − σ𝑘=1 𝑗−𝜖 𝛿𝑖,𝑘 𝑚′ ) + 1 − σ𝑘=1 𝑗−1 𝛿𝑖,𝑘 𝑚 ς𝑚′≠𝑚 𝐻 1 − σ𝑘=1 𝑗−𝜖−1 𝛿𝑖,𝑘 𝑚′ − ς𝑚′≠𝑚 𝐻 1 − σ𝑘=1 𝑗−𝜖 𝛿𝑖,𝑘 𝑚′
  • 16. Mutually-Constrained Monotonic Multihead Attention We cover the marginal case. (We compute 𝛿𝑖,𝑗 𝑚 autoregressively over output tokens.) 𝐴𝑖,𝑗 𝑚 𝛿 = 𝑝𝑖,𝑗 𝑚 ෍ 𝑘=1 𝑗 𝛿𝑖−1,𝑘 𝑚 ෑ 𝑜=𝑘 𝑗−1 1 − 𝑝𝑖,𝑜 𝑚 , 𝐵𝑖,𝑗 𝑚 𝛿 = 1 − ෍ 𝑘=1 𝑗 𝛿𝑖,𝑗 𝑚 𝛿𝑖,𝑗 𝑚 = 𝐴𝑖,𝑗 𝑚 𝛿 , 𝑗 ≤ 𝜖 𝐴𝑖,𝑗 𝑚 𝛿 ෑ 𝑚′≠𝑚 𝐵𝑖,𝑗−𝜖 𝑚′ 𝛿 + 𝐵𝑖,𝑗−1 𝑚 𝛿 ෑ 𝑚′≠𝑚 𝐵𝑖,𝑗−𝜖−1 𝑚′ 𝛿 − ෑ 𝑚′≠𝑚 𝐵𝑖,𝑗−𝜖 𝑚′ 𝛿 𝑂. 𝑊. 𝑐𝑖 𝑚 = ෍ 𝑗=1 𝑇 𝛿𝑖,𝑗 𝑚 ℎ𝑗 𝑚 𝑐𝑖 = ConCat 𝑐𝑖 1 , … , 𝑐𝑖 𝐻
  • 17. Mutually-Constrained Monotonic Multihead Attention To compute attention distribution in parallel, we compute α by MMA, and then modify α to produce መ 𝛿. 𝐴𝑖,𝑗 𝑚 𝛼 = 𝑝𝑖,𝑗 𝑚 ෍ 𝑘=1 𝑗 𝛼𝑖−1,𝑘 𝑚 ෑ 𝑜=𝑘 𝑗−1 1 − 𝑝𝑖,𝑜 𝑚 , 𝐵𝑖,𝑗 𝑚 𝛼 = 1 − ෍ 𝑘=1 𝑗 𝛼𝑖,𝑗 𝑚 መ 𝛿𝑖,𝑗 𝑚 = 𝐴𝑖,𝑗 𝑚 𝛼 , 𝑗 ≤ 𝜖 𝐴𝑖,𝑗 𝑚 𝛼 ෑ 𝑚′≠𝑚 𝐵𝑖,𝑗−𝜖 𝑚′ 𝛼 + 𝐵𝑖,𝑗−1 𝑚 𝛼 ෑ 𝑚′≠𝑚 𝐵𝑖,𝑗−𝜖−1 𝑚′ 𝛼 − ෑ 𝑚′≠𝑚 𝐵𝑖,𝑗−𝜖 𝑚′ 𝛼 𝑂. 𝑊. 𝑐𝑖 𝑚 = ෍ 𝑗=1 𝑇 መ 𝛿𝑖,𝑗 𝑚 ℎ𝑗 𝑚 𝑐𝑖 = ConCat 𝑐𝑖 1 , … , 𝑐𝑖 𝐻
  • 18. Outline ● Monotonic Multihead Attention (prior work) ꟷ Monotonic Attention ꟷ Mototonic Multihead Attention ꟷ HeadDrop ꟷ Head-Synchronous Beam Search Decoding ● Mutually-Constrained Motonic Multihead Attention ● Librispeech 100-hour and AISHELL-1 Results ● Conclusion
  • 19. Results on Librispeech 100-hour and AISHELL-1 ● Our approach shows better performance than MMA and HeadDrop
  • 20. Trade-offs between Performance and Latency ● Our approach shows better performance than MMA and HeadDrop with the slight increase in latency = 1 𝐿𝑚𝑖𝑛 σ𝑖=1 𝐿𝑚𝑖𝑛 𝑏𝑖 ℎ𝑦𝑝 − 𝑏𝑖 𝑟𝑒𝑓
  • 21. Outline ● Monotonic Multihead Attention (prior work) ꟷ Monotonic Attention ꟷ Mototonic Multihead Attention ꟷ HeadDrop ꟷ Head-Synchronous Beam Search Decoding ● Mutually-Constrained Motonic Multihead Attention ● Librispeech 100-hour and AISHELL-1 Results ● Conclusion
  • 22. Conclusion ● We proposed Mutually-Constrained MMA (MCMMA) algorithm to fill the gap. ● We broght HSD to the training phase by modifiying the attention distribution. ● We improve the performance with the small amount of increase in the latency over Librispeech 100-hour and AISHELL-1.
  • 23. Reference 1) Raffel, Colin, et al. "Online and linear-time attention by enforcing monotonic alignments." International Conference on Machine Learning. PMLR, 2017. 2) Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, and Jiatao Gu, “Monotonic multihead attention,” in Proc. of ICLR, 2020. 3) Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, “Enhancing monotonic multihead attention for streaming ASR,” in Proc. Of Interspeech, 2020, pp. 2137– 2141.