The document discusses DeepSpeech, an open source speech recognition system that achieves a word error rate (WER) of less than 10%. It describes the core architecture including the deep speech model with convolutional and recurrent neural network layers, the connectionist temporal classification (CTC) algorithm, and use of a language model. It also covers open speech corpora like Librivox and future directions such as additional network architectures, CTC variants, and expanding to other languages.
3. Outline
Part I Core Architecture
I Deep Speech Architecture
II CTC Algorithm
III Language Model
IV Performance
Part II Future Architectural Variants
I Network Variants
II CTC Variants
Part III Open Speech Corpora
I Open Speech Corpora
II Project Common Voice
Part IV Future Directions
6. Deep Speech Architecture: Overview
Input Features
Feedforward Layers
Bidirectional RNN Layer
Feedforward Layer
Softmax LayerSoftMax
7. Deep Speech Architecture: Input Features
Mel-Frequency Cepstrum Coefficients
• 16 bit audio input at 16kHz
• 25ms audio window every 10ms
• 26 Cepstral Coefficients
• Stride of 2
• Context window width 9
• Data “whitened” before use
SoftMax
8. Deep Speech Architecture: Feedforward Layers
Feedforward Layers
• 3 layers
• Layer width 2048
• RELU cells
• RELU clipped at 20
• Dropout 0.20 to 0.30
SoftMax
9. Deep Speech Architecture: Bidirectional RNN Layer
Bidirectional RNN Layer
• 1 layer
• Layer width 2048
• LSTM cells
• No clipping
• Dropout 0.20 to 0.30
SoftMax
10. Deep Speech Architecture: Feedforward Layer
Feedforward Layer
• 1 layer
• Layer width 2048
• RELU cells
• RELU clipped at 20
• Dropout 0.20 to 0.30
SoftMax
11. Deep Speech Architecture: Softmax Layer
Softmax Layer
• L ≡ Alphabet
• Output width k ≡ |L| + 1
• Extra for a “blank label”
SoftMax
27. Performance: WER
Training Data
• TED (Approx 200 hours)
• Fisher (Approx 2000 hours)
• Librivox (Approx 1000 hours)
On Librivox clean test 6.48% WER
34. CTC Variants: RNN Transducer
h1
(5)
h2
(5)
hT
(5)
Path Probability
Character Probability
35. CTC Variants: RNN Transducer
h1
(5)
h2
(5)
hT
(5)
Path Probability
Character Probability
RNN Probability
36. CTC Variants: Sequence-to-Sequence Model with Attention
Encoder (BiRNN) Decoder(RNN)
p lh1
h2
hT
ci
S i-1
Context vector
AttentionModule
Decoder hidden state
Annotation vectors
37. CTC Variants: Sequence-to-Sequence Model with Attention
“a” annotation vector
h1
=(h1
f
,h2
f
,h3
f
, h1
b
,h2
b
,h3
b
)
“a” annotation vector
h4
=(h1
f
,h2
f
,h3
f
, h1
b
,h2
b
,h3
b
)
a — — a b —
38. CTC Variants: Sequence-to-Sequence Model with Attention
2st
context vector
&
1st
hidden state
1st
context vector
&
0th
hidden state
a a b c c c
a a b c c c
39. CTC Variants: Sequence-to-Sequence Model with Attention
21
32
23
32
43
97
42
10
65
76
98
11
12
34
65
55
21
32
23
32
43
97
42
10
14
65
(s i-1
hj
)
eij
Annotation vectorDecoder hidden state
● Feedforward neural network
● Input:
○ si-1
decoder hidden state before ith
prediction
○ hj
annotation for jth
input character
● Output:
○ eij
logit of jth
annotation for ith
prediction
● αij
normalized weight of jth
annotation for ith
prediction
● ci
context vector, weighted annotations