替代役男 ⺩王俊豪
chuchuhao831@gmail.com
Sequence Learning
from character based acoustic model to
End-to-End ASR system
figure from A. Graves, Supervised sequence labelling, ch2 p9
…
a stream of input data
= a sequence of acoustic feature
= a list of MFCC ( Here we use)
Learning
Algorithm Thee ound oof
Error ?
Extract Feature
Sequence Learning
Outline
1. Feedforward Neural Network
2. Recurrent Neural Network
3. Bidirectional Recurrent Neural Network
4. Long Short-Term Memory
5. Connectionist Temporal Classification
6. Build a End-to-End System
7. Experiment on Low-resource Language
Active Function
Weighted
Feedforward Neural Network
Active Function
Weighted
Recurrent Neural Network
Active Function
Weighted
Bidirectional RNN
figure from A. Graves, Supervised sequence labelling, ch3 p24
figure from A. Graves, Supervised sequence labelling, ch3 p23
Active func
G
G
G Input Gate
Output Gate
Forget Gate
G
W Weighted
Gate
M Multiplyer
M
M
M
W
W
RNNs
LSTM
W
W
W
W
W
W
W
LSTM Forward Pass, Previous Layer Input
G
G
G Input Gate
Output Gate
Forget Gate
M
M
M
W
W
W
W
W
W
W
Previous layer
LSTM Fordward Pass, Input Gate
G
G
Input Gate
Output Gate
Forget Gate
M
M
M
W
W
W
W
W
W
W
Previous layer
G
LSTM Fordward Pass, Foget Gate
G
G Input Gate
Output Gate
Forget Gate
M
M
M
W
W
W
W
W
W
W
Previous layer
G
LSTM Forward Pass, Cell
G
G Input Gate
Output Gate
Forget Gate
M
M
M
W
W
W
W
W
W
W
G
LSTM Forward Pass, Output Gate
G Input Gate
Forget Gate
M
M
M
W
W
W
W
W
W
W
G
Previous layer
G
LSTM Forward Pass, Output
G
G Input Gate
Forget Gate
M
M
M
W
W
W
W
W
W
W
G
G
G
G Input Gate
Output Gate
Forget Gate
M
M
M
W
W
W
W
W
W
W
LSTM Backward Pass, Output Gate
RNN
LSTM
figure from A. Graves, Supervised sequence labelling, p33,35
Additional CTC Output Layer
(1) Predict label at any timestep
Connectionist
Temporal
Classification
(2) Probability of Sentence
Predict label
at any tilmestep
Framewise Acoustic Feature
Label
A B C D E F G I J
K L M N O P Q R
S T U V W X Y Z
{space}
{blank}
others like ‘ “ .
Probability of Sentence
Thh..e S..outtd ooof
Th.e S.outd of
The Soutd of
Target
Step 1, remove repeated labels
Step 2, remove all blanks
paths:
labelling:
Predict
Now, we can do
the maximum likelihood training
Choose the labelling :
Best Path, the most probable path will correspond to most
probable labelling
Output Decoding
Something wrong here. Prefix Search would be better
How to connect target label
and RNNs Output
Sum over paths probability, forward
l_s = .
s-0 : .h.e.l.
s-1 : .h.e.l
s-2 : .h.e.
( s-2 blank to blank )
l_s = l
s-0 : .h.e.l.l
s-1 : .h.e.l.
s-2 : .h.e.l
( s-2 same non-blank label )
only allow transition
(1) between blank and non-blank label
(2) distinct non-blank label
Sum over paths probability, backward
l_s = .
s+0 : .l.l.o.
s+1 : l.l.o.
s+2 : .l.o.
( s+2 blank to blank )
l_s = l
s+0 : l.l.o.
s+1 : .l.o.
s+2 : l.o.
( s+2 same non-blank label )
only allow transition
(1) between blank and non-blank label
(2) distinct non-blank label
Sum over paths probability
D GO
Target : dog -> .d.o.g..t=0
D GO .t=1
D GO .t=2
D GO .t=T-2
D GO .t=T-1
D GO .t=T
…
.
.d
.d.
.d.o
.d.o.
.d.o.g
.d.o.g.
s=1
s=2
s=3
s=4
s=5
s=6
s=7
Sum over paths probability
D GO
backward-forward :
the probability of all the paths corresponding to l
that go through the symbol at time t
.t=0
D GO .t=1
D GO .t=t’
D GO .t=T-1
D GO .t=T
…
…
last we derivates w.r.t. unnormalised outputs, get error signal
A. Graves. Sequence Transduction with
Recurrent Neural Networks
Transcription network + Prediction network
Previous Labelfeature at time t
label prediction at time t next label probability
A. Graves. Towards End-to-end Speech Recognition with
Recurrent Neural Networks ( Google Deepmind )
Additional CTC Output Layer
Additional WER Output Layer
Minimise Objective Function -> Minimise Loss Function
Spectrogram feature
Andrew Y. Ng. Deep Speech: Scaling up end-to-end speech
recognition ( Baidu Research)
Additional CTC Output Layer
Additional WER Output Layer
Spectrogram feature
Language Model Transcription
RNN Output : arther n tickets for the game
Decode Target: are there any tickets for the gam
Q( L ) = log( P( L | x ) ) + a * log( P_LM( L ) ) + b * word_count( L )
Low-resources language experiment
Target: ? ka didto sigi ra og tindog
Output: h bai to giy ngtndog
test 0.44,
test 0.71
1. Boulard and Morgan (1994) Connectionist speech recognition A hybrid approach. 

2. A. Grave. Supervised Sequence Labelling with Recurrent Neural Networks

3. A. Grave. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Network. 

4. A. Graves. Sequence Transduction with Recurrent Neural Networks

5. A. Graves. Towards End-to-end Speech Recognition with Recurrent Neural Networks ( Google
Deepmind ) 

6. Andrew Y. Ng. Deep Speech: Scaling up end-to-end speech recognition ( Baidu Research) 

7. standford CTC: https://github.com/amaas/stanford-ctc

8. R. Pascanu , On the difficulty of training recurrent neural networks

9. L. Besacier. Automatic Speech Recognition for Under-Resourced Languages: A Survey

10.A. Gibiansky. http://andrew.gibiansky.com/blog/machine-learning/speech-recognition-neural-
networks/

11. LSTM Mathematic formalism ( mandarin )

http://blog.csdn.net/u010754290/article/details/47167979

11. Assumption of Markov Model

http://jedlik.phy.bme.hu/~gerjanos/HMM/node5.html

12. Story about ANN-HMM Hybrid ( mandarin )

HMM : http://www.taodocs.com/p-5260781.html

13. Generative model v.s. discriminative model 

http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-
discriminative-algorithm
Reference
Memory Network :
A. Grave. Neural Turing Machine
Appendix B. future work
figure from http://www.slideshare.net/yuzurukato/neural-turing-machines-43179669
Appendix A. Generative v.s. discriminative
A generative model learns the joint probability p(x,y).
A discriminative model learns the conditional probability p(y|x).
data input : (1,0), (1,0), (2,0), (2,1)
y=0 y=1
x=1 1/2 0
x=2 1/4 1/4
y=0 y=1
x=1 1 0
x=2 1/2 1/2
According to Andrew Y. Ng. On Discriminative vs. Generative
classifier: A comparison of logistic regression and naive Bayes
The overall gist is that discriminative models generally
outperform generative models in classification tasks.
Generative Discriminative

Sequence Learning with CTC technique

  • 1.
    替代役男 ⺩王俊豪 chuchuhao831@gmail.com Sequence Learning fromcharacter based acoustic model to End-to-End ASR system
  • 2.
    figure from A.Graves, Supervised sequence labelling, ch2 p9 … a stream of input data = a sequence of acoustic feature = a list of MFCC ( Here we use) Learning Algorithm Thee ound oof Error ? Extract Feature Sequence Learning
  • 3.
    Outline 1. Feedforward NeuralNetwork 2. Recurrent Neural Network 3. Bidirectional Recurrent Neural Network 4. Long Short-Term Memory 5. Connectionist Temporal Classification 6. Build a End-to-End System 7. Experiment on Low-resource Language
  • 4.
  • 6.
  • 8.
  • 9.
    figure from A.Graves, Supervised sequence labelling, ch3 p24
  • 10.
    figure from A.Graves, Supervised sequence labelling, ch3 p23
  • 11.
    Active func G G G InputGate Output Gate Forget Gate G W Weighted Gate M Multiplyer M M M W W RNNs LSTM W W W W W W W
  • 12.
    LSTM Forward Pass,Previous Layer Input G G G Input Gate Output Gate Forget Gate M M M W W W W W W W Previous layer
  • 13.
    LSTM Fordward Pass,Input Gate G G Input Gate Output Gate Forget Gate M M M W W W W W W W Previous layer G
  • 14.
    LSTM Fordward Pass,Foget Gate G G Input Gate Output Gate Forget Gate M M M W W W W W W W Previous layer G
  • 15.
    LSTM Forward Pass,Cell G G Input Gate Output Gate Forget Gate M M M W W W W W W W G
  • 16.
    LSTM Forward Pass,Output Gate G Input Gate Forget Gate M M M W W W W W W W G Previous layer G
  • 17.
    LSTM Forward Pass,Output G G Input Gate Forget Gate M M M W W W W W W W G
  • 18.
    G G G Input Gate OutputGate Forget Gate M M M W W W W W W W LSTM Backward Pass, Output Gate
  • 19.
    RNN LSTM figure from A.Graves, Supervised sequence labelling, p33,35
  • 20.
    Additional CTC OutputLayer (1) Predict label at any timestep Connectionist Temporal Classification (2) Probability of Sentence
  • 21.
    Predict label at anytilmestep Framewise Acoustic Feature Label A B C D E F G I J K L M N O P Q R S T U V W X Y Z {space} {blank} others like ‘ “ .
  • 22.
    Probability of Sentence Thh..eS..outtd ooof Th.e S.outd of The Soutd of Target Step 1, remove repeated labels Step 2, remove all blanks paths: labelling: Predict Now, we can do the maximum likelihood training
  • 23.
    Choose the labelling: Best Path, the most probable path will correspond to most probable labelling Output Decoding Something wrong here. Prefix Search would be better
  • 24.
    How to connecttarget label and RNNs Output
  • 25.
    Sum over pathsprobability, forward
  • 26.
    l_s = . s-0: .h.e.l. s-1 : .h.e.l s-2 : .h.e. ( s-2 blank to blank ) l_s = l s-0 : .h.e.l.l s-1 : .h.e.l. s-2 : .h.e.l ( s-2 same non-blank label ) only allow transition (1) between blank and non-blank label (2) distinct non-blank label
  • 27.
    Sum over pathsprobability, backward
  • 28.
    l_s = . s+0: .l.l.o. s+1 : l.l.o. s+2 : .l.o. ( s+2 blank to blank ) l_s = l s+0 : l.l.o. s+1 : .l.o. s+2 : l.o. ( s+2 same non-blank label ) only allow transition (1) between blank and non-blank label (2) distinct non-blank label
  • 29.
    Sum over pathsprobability D GO Target : dog -> .d.o.g..t=0 D GO .t=1 D GO .t=2 D GO .t=T-2 D GO .t=T-1 D GO .t=T … . .d .d. .d.o .d.o. .d.o.g .d.o.g. s=1 s=2 s=3 s=4 s=5 s=6 s=7
  • 30.
    Sum over pathsprobability D GO backward-forward : the probability of all the paths corresponding to l that go through the symbol at time t .t=0 D GO .t=1 D GO .t=t’ D GO .t=T-1 D GO .t=T … …
  • 31.
    last we derivatesw.r.t. unnormalised outputs, get error signal
  • 32.
    A. Graves. SequenceTransduction with Recurrent Neural Networks Transcription network + Prediction network Previous Labelfeature at time t label prediction at time t next label probability
  • 33.
    A. Graves. TowardsEnd-to-end Speech Recognition with Recurrent Neural Networks ( Google Deepmind ) Additional CTC Output Layer Additional WER Output Layer Minimise Objective Function -> Minimise Loss Function Spectrogram feature
  • 34.
    Andrew Y. Ng.Deep Speech: Scaling up end-to-end speech recognition ( Baidu Research) Additional CTC Output Layer Additional WER Output Layer Spectrogram feature Language Model Transcription RNN Output : arther n tickets for the game Decode Target: are there any tickets for the gam Q( L ) = log( P( L | x ) ) + a * log( P_LM( L ) ) + b * word_count( L )
  • 35.
    Low-resources language experiment Target:? ka didto sigi ra og tindog Output: h bai to giy ngtndog test 0.44, test 0.71
  • 36.
    1. Boulard andMorgan (1994) Connectionist speech recognition A hybrid approach. 2. A. Grave. Supervised Sequence Labelling with Recurrent Neural Networks 3. A. Grave. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Network. 4. A. Graves. Sequence Transduction with Recurrent Neural Networks 5. A. Graves. Towards End-to-end Speech Recognition with Recurrent Neural Networks ( Google Deepmind ) 6. Andrew Y. Ng. Deep Speech: Scaling up end-to-end speech recognition ( Baidu Research) 7. standford CTC: https://github.com/amaas/stanford-ctc 8. R. Pascanu , On the difficulty of training recurrent neural networks 9. L. Besacier. Automatic Speech Recognition for Under-Resourced Languages: A Survey 10.A. Gibiansky. http://andrew.gibiansky.com/blog/machine-learning/speech-recognition-neural- networks/ 11. LSTM Mathematic formalism ( mandarin ) http://blog.csdn.net/u010754290/article/details/47167979 11. Assumption of Markov Model http://jedlik.phy.bme.hu/~gerjanos/HMM/node5.html 12. Story about ANN-HMM Hybrid ( mandarin ) HMM : http://www.taodocs.com/p-5260781.html 13. Generative model v.s. discriminative model http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and- discriminative-algorithm Reference
  • 37.
    Memory Network : A.Grave. Neural Turing Machine Appendix B. future work figure from http://www.slideshare.net/yuzurukato/neural-turing-machines-43179669
  • 38.
    Appendix A. Generativev.s. discriminative A generative model learns the joint probability p(x,y). A discriminative model learns the conditional probability p(y|x). data input : (1,0), (1,0), (2,0), (2,1) y=0 y=1 x=1 1/2 0 x=2 1/4 1/4 y=0 y=1 x=1 1 0 x=2 1/2 1/2 According to Andrew Y. Ng. On Discriminative vs. Generative classifier: A comparison of logistic regression and naive Bayes The overall gist is that discriminative models generally outperform generative models in classification tasks. Generative Discriminative