SP Study1018 Paper Reading

1/27
SP Study
Paper Reading
NAIST, AHC-Lab, SP-GROUP
Mori Takuma(M1)
10/18/2016 2016©Takuma Mori AHC-Lab, IS, NAIST

2/2710/18/2016 2016©Takuma Mori AHC-Lab, IS, NAIST
1. Towards End-to-End Speech
Recognition with Deep
Convolutional Neural
Networks
2. Deep Speech 2: End-to-End
Speech Recognition in
English and Mandarin
Outline

Networks
Outline

4/27
Towards End-to-End Speech
Networks
Ying Zhang, Mohammad Pezeshki, Phil emon Brakel, Saizheng
Zhang, C esar Laurent,
Yoshua Bengio, Aaron Courville
Universit e de Montr eal, INTERSPEECH 2016

5/27
Background
• Convolutional Neural Networks (CNNs) are effective
models for reducing spectral variations and
modeling spectral correlations in acoustic
features for automatic speech recognition.
• Connectionist Temporal Classification (CTC) with
Recurrent Neural Networks (RNNs), which is
proposed for labeling unsegmented sequences, makes
it feasible to train an ‘end-to-end’ speech
recognition system instead of hybrid settings.
Problem
• RNNs are computationally expensive and sometimes
difficult to train.
Solve
• We propose an end-to-end speech framework for
sequence labeling, by combining hierarchical CNNs
with CTC directly without recurrent connections.
Overview

6/27
• Log-mel-filter-bank to
phoneme
• Input: 40-dimensional log
mel-filter-bank (plus energy
term) coefficients with
deltas and delta-deltas
• Our CNN acoustic model whose
architecture is different
from the above. The complete
CNN includes stacked
convolutional and pooling
layers, at the top of which
are multiple fully-connected
layers.
Proposed Method
CTC

Convolution

8/27
• Sequence of acoustic feature values:
𝑋 ∈ ℝ 𝑐×𝑏×𝑓
• channels 𝑐, frequency bandwidth 𝑏, time length 𝑓
• convolutional layer convolves 𝑋 with 𝑘 filters
{𝑊𝑖} 𝑘:
𝑊𝑖 ∈ ℝ 𝑐×𝑚×𝑛
• Frequency axis = m, length along frame axis = n
• The resulting 𝑘 preactivation feature maps
consist of a 3D tensor:
𝐻 ∈ ℝ 𝑘×𝑏 𝐻×𝑓 𝐻
• 𝐻𝑖 is computed as follows:
𝐻𝑖 = 𝑊𝑖 ∗ 𝑋 + 𝑏𝑖, 𝑖 = 1, … , 𝑘.
Convolution

9/27
• We take the number of piece-wise linear functions
as 2 for example
Maxout

10/27
• Before ෩𝐻𝑖 and After ෡𝐻𝑖 pooling:
[ ෡𝐻𝑖] 𝑟,𝑡= 𝑚𝑎𝑥𝑗=1
𝑝
{[ ෩𝐻𝑖] 𝑟×𝑠+𝑗,𝑡}
• the step size s, pooling size p, all the
[ ෩𝐻𝑖] 𝑟×𝑠+𝑗,𝑡 values inside the max have the same
time index t.
Pooling

Connectionist Temporal Classification

12/27
• DATA: TIMIT corpus(462-speaker training set, 50-
speaker development set)
• Our model achieves 18.2% phoneme error rate on
the core test set, which is slightly better than
the LSTM baseline model and the transducer model
with an explicit RNN language model.
Experimental Result

Networks
Outline

14/27
Deep Speech 2: End-to-End
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared
Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam
Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan,
Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick
LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair,
Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun,
Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani
Yogatama, Jun Zhan, Zhenyao Zhu
Baidu Research – Silicon Valley AI Lab, ICML 2016

15/27
Background
• This “end to end” vision of training simplyfies
the training process.
Effort
• We show that an end-to-end deep learning approach
can be used to recognize either English or
Mandarin Chinese speech–two vastly different
languages.
• In several cases, our system is competitive with
the transcription of human workers when
benchmarked on standard datasets.
• Finally, we show that our system can be
inexpensively deployed in an online setting,
delivering low latency when serving users at
scale.
Overview

16/27
• Spectrogram to Text
• Input:sequence of log-
spegtrograms of power normalized
audio clips(20ms window)
• At each output time-step t, the
RNN makes a prediction, p(𝑙 𝑡|𝑥),
where 𝑙 𝑡 is either a character
in the alphabet or the blank
symbol.
• At inference time, CTC models
are paired a with language model
trained on a bigger corpus of
text.
Q(y) = log(𝑝 𝑅𝑁𝑁 (y|x))
+ αlog(𝑝 𝐿𝑀(y)) + βwc(y) (1)
Proposed Method

17/27
• Recent research has shown that
BatchNorm(Batch Normalization)
can speed convergence of RNNs
training, though not always
improving generalization error.
• A recurrent layer is
implemented as
ℎ 𝑡
𝑙
= 𝑓(𝐵 𝑊 𝑙
ℎ 𝑡
𝑙−1
+ 𝑈 𝑙
ℎ 𝑡−1
𝑙
)
• Two ways of applying BatchNorm
1. ℎ 𝑡
𝑙
ℎ 𝑡
𝑙−1
+ 𝑈 𝑙
ℎ 𝑡−1
𝑙
)
2. ℎ 𝑡
𝑙
ℎ 𝑡
𝑙−1
+ 𝑈 𝑙
ℎ 𝑡−1
𝑙
)
𝐵 :BatchNorm transformation
• 1 is not effective.
• 2 is good.
Batch Normalization for Deep RNNs

Batch Normalization for Deep RNNs

19/27
• Even with Batch Normalization, we find training
with CTC to be occasionally unstable,
particularly in the early stages.
• We use the length of the utterance an a heuristic
• In the first training epoch we iterate through,
minibatches in the training set in increasing
order of the, length of the longest utterance in
the minibatch. After the first epoch training
reverts back to a random order over minibatches.
SortaGrad

20/27
• GRU and LSTM reach similar accuracy
• GRUs are faster to train and less likely to
diverge.
• GRU architecture achieves better WER for all
network depths.
Comparison of vanilla RNNs and GRUs

21/27
• Convolution in frequency attempts to model
spectral variance due to speaker variability more
concisely than what is possible with large fully
conected networks.
• These are both in the time-and-frequency domain
(2D) and in the time-only domain (1D).
Frequency Convolutions

22/27
• Bidirectional RNN models cannot stream the
transcription process as the utterance arrives
from the user.
• The layer learns weights to linearly combine each
neuron’s activations T timesteps into the future,
and thus allows us to control the amount of
future context needed.
Lookahead Convolution and Unidirectional Models

23/27
• The only architectural changes we make to our
networks are due to the characteristics of the
Chinese character set.
• The network outputs probabilities for about 6000
characters, which includes the Roman alphabet,
since hybrid Chinese-English transcripts are
common.
• We use a character level language model in
Mandarin as words are not usually segmented in
text.
Adaptation to Mandarin

24/27
• Optimizer: Synchronous SGD.
• Our training distributes work over multiple GPUs
in a dataparallel fashion with synchronous SGD.
• Each GPU uses a local copy of the model to work on
a portion of the current minibatch and then
exchanges computed gradients with all other GPUs.
• It is reproducible, which facilitates discovering
and fixing regressions.
System Optimizations

25/27
• Training Data: 11,940 hours of labeled speech
containing 8 million utterances
Experimental Result

26/27
• Training Data: 9,400
hours of labeled
speech containing 11
million utterances.
• Development Data: 2000
utterances as well as
a test set of 1882
examples of noisy
speech.
Experimental Result

End Slide.

SP Study1018 Paper Reading

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to SP Study1018 Paper Reading

Similar to SP Study1018 Paper Reading (20)

Recently uploaded

Recently uploaded (20)

SP Study1018 Paper Reading