Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
1. Learning How Long to Wait: Adaptively-
Constrained Monotonic Multihead
Attention for Streaming ASR
Jaeyun Song, Hajin Shim, Eunho Yang
ASRU 2021
Machine Learning & Intelligence Laboratory
2. Motivation
● Monotonic Multihead Attention (MMA) shows comparable performance to the
SOTA online methods in ASR, but there is room for reducing the latency.
● HeadDrop and Head-Synchronous Beam Search Decoding reduce the latency of
MMA, but there is a gap between the training and testing phase.
● Mutually-Constrained MMA (MCMMA) reduces the latency of MMA with a fixed
waiting time threshold, but the optimal waiting time threshold might be
different depending on an input sequence.
In this work,
● We proposed Adaptively-Constrained MMA (ACMMA) to assign an adequate
waiting time threshold to decrease latency without performance drop.
● We reduce the latency even with improving performance of MCMMA in
Librispeech 100-hour and AISHELL-1.
3. The Overview of Adaptively-Constrained MMA
3
Memory h
Head
m
Right Bound
Activation
Waiting Threshold 𝝐 = 𝟑
FFN
MMA
SAN
SAN
FFN
Encoder states
Previous output tokens
Token embedding
1D-Convolution
Linear & Softmax
Prediction
×𝟒
×𝟐
Context Update
Memory
Threshold
Predictor
MMA
4. Threshold Predictor in ACMMA
● The threshold predictor (TP) predicts the appropriate waiting time threshold
with an partially observable input sequence. è Non-differentiable
𝜖! = TP Q
ℎ!"#, ̂
𝑠!
where Q
ℎ!"# = ConCat Q
ℎ!"#
#
, … , Q
ℎ!"#
$
and Q
ℎ!"#
%
= V
#&'&(
W
𝛿!"#
%
𝜖!"# ℎ'
● We compute the attention distribution via linear interpolation.
● In testing phase, we choose the nearest integer as the predicted threshold.
W
𝛿!,'
%
𝜖! = 𝜖! + 1 − 𝜖!
W
𝛿!,'
%
𝜖! + 𝜖! − 𝜖!
W
𝛿!,'
%
𝜖! + 1
where W
𝛿!,'
%
𝜖! , W
𝛿!,'
%
𝜖! + 1 are attention distributions calculated by MCMMA.
5. Threshold Regularization in ACMMA
● To induce TP to predict a low waiting time threshold, we introduce the
threshold regularization (TR).
● TR is computed by averaging predicted thresholds and adjusted by 𝜆(*.
● We train TP with two approaches such as end-to-end manner and fine tuning
with pretrained MCMMA.
ℒ(* =
1
𝐿𝑆
V
#&+&,
V
#&!&-
𝜖!
(+)
ℒ = 1 − 𝜆010 ℒ232 + 𝜆010ℒ010 + 𝝀𝑻𝑹ℒ(*