Deep speech

DeepSpeech
A Journey to <10% Word Error Rate
GDG Ranchi
Rabimba Karanjai
@rabimba

Outline
Part I Core Architecture
I Deep Speech Architecture
II CTC Algorithm
III Language Model
IV Performance
Part II Future Architectural Variants
I Network Variants
II CTC Variants
Part III Open Speech Corpora
I Open Speech Corpora
II Project Common Voice
Part IV Future Directions

Deep Speech Architecture: Overview
Input Features
Feedforward Layers
Bidirectional RNN Layer
Feedforward Layer
Softmax LayerSoftMax

Deep Speech Architecture: Input Features
Mel-Frequency Cepstrum Coeﬃcients
• 16 bit audio input at 16kHz
• 25ms audio window every 10ms
• 26 Cepstral Coeﬃcients
• Stride of 2
• Context window width 9
• Data “whitened” before use
SoftMax

Deep Speech Architecture: Feedforward Layers
Feedforward Layers
• 3 layers
• Layer width 2048
• RELU cells
• RELU clipped at 20
• Dropout 0.20 to 0.30
SoftMax

Deep Speech Architecture: Bidirectional RNN Layer
Bidirectional RNN Layer
• 1 layer
• LSTM cells
• No clipping
SoftMax

Deep Speech Architecture: Feedforward Layer
Feedforward Layer
• 1 layer
• RELU cells
• RELU clipped at 20
SoftMax

Deep Speech Architecture: Softmax Layer
Softmax Layer
• L ≡ Alphabet
• Output width k ≡ |L| + 1
• Extra for a “blank label”
SoftMax

CTC Algorithm: Path Probabilities
SoftMax
1-of-k 1-of-k 1-of-k
• L ≡ Alphabet
• k ≡ |L| + 1
• Extra “blank label”

SoftMax
1-of-k 1-of-k 1-of-k
Path ≡ Seq of T characters
π ∈ L’T
L’ ≡ L ∪ {blank}
T ≡ Time Ticks

SoftMax
yπ
1
yπ
2
yπ
T
1 2 T
Path Probability

CTC Algorithm: Label Probabilities
L’T
L≤T
ℬ
Paths
Labels
Def: ℬ
• Remove repeated characters
• Remove blanks

ℬ
Paths
Labels
L’T
L≤T
Def: ℬ
• Remove repeated characters
• Remove blanks

SoftMax
yπ
1
yπ
2
yπ
T
1 2 T
Label Probability

SoftMax
yπ
1
yπ
2
yπ
T
1 2 T
Label Probability
Problem Sum is Big

SoftMax
yπ
1
yπ
2
yπ
T
1 2 T
Label Probability
Solution
Forward-Backward
Algorithm

Language Model: Deﬁnition
Labels
Def: Language Model
l1
l3
l2
l4
pLM
( l1
)
pLM
( l2
)
pLM
( l3
)
pLM
( l4
)
Language Model-”Probability distribution” over sequences
of characters
Sequences of characters

Language Model: Loss Function
SoftMax
yπ
1
yπ
2
yπ
T
1 2 T
Loss Function Version 1.0

Language Model: Loss Function
SoftMax
yπ
1
yπ
2
yπ
T
1 2 T
α = 2.15 β = -0.10 β’ = 1.10

Performance: WER
Training Data
• TED (Approx 200 hours)
• Fisher (Approx 2000 hours)
• Librivox (Approx 1000 hours)

Performance: WER
Training Data
• TED (Approx 200 hours)
• Fisher (Approx 2000 hours)
• Librivox (Approx 1000 hours)
On Librivox clean test 6.48% WER

Part II Future Architectural Variants

Network Variants: Deep Speech 2 Architecture
Input Features
Convolutional Layers
(Bidirectional) RNN Layer
Softmax Layer
CTC Layer

CTC Variants: RNN Transducer
SoftMax
yπ
1
yπ
2
yπ
T
1 2 T
Path Probability

h1
(5)
h2
(5)
hT
(5)
Path Probability

h1
(5)
h2
(5)
hT
(5)
Path Probability
Character Probability

h1
(5)
h2
(5)
hT
(5)
Path Probability
Character Probability
RNN Probability

CTC Variants: Sequence-to-Sequence Model with Attention
Encoder (BiRNN) Decoder(RNN)
p lh1
h2
hT
ci
S i-1
Context vector
AttentionModule
Decoder hidden state
Annotation vectors

“a” annotation vector
h1
=(h1
f
,h2
f
,h3
f
, h1
b
,h2
b
,h3
b
)
“a” annotation vector
h4
=(h1
f
,h2
f
,h3
f
, h1
b
,h2
b
,h3
b
)
a — — a b —

2st
context vector
&
1st
hidden state
1st
context vector
&
0th
hidden state
a a b c c c
a a b c c c

21
32
23
32
43
97
42
10
65
76
98
11
12
34
65
55
21
32
23
32
43
97
42
10
14
65
(s i-1
hj
)
eij
Annotation vectorDecoder hidden state
● Feedforward neural network
● Input:
○ si-1
decoder hidden state before ith
prediction
○ hj
annotation for jth
input character
● Output:
○ eij
logit of jth
annotation for ith
prediction
● αij
normalized weight of jth
annotation for ith
prediction
● ci
context vector, weighted annotations

Open Speech Corpora: Open, Commercially Usable Corpora
Librivox VoxForge
• 1000 hours of audio
• Read speech
• Clean subset
• Dirty subset
• 100 hours of audio
• Read speech

Project Common Voice: Overview

Project Common Voice: Recording

Project Common Voice: Validating

Future Directions...
Production Ready
Packaging
Evaluating
Network Variants
Evaluating
CTC Variants
Hyperparameter
Tuning
Network
Quantization
Other
Languages

Deep speech

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep speech

Similar to Deep speech (20)

Recently uploaded

Recently uploaded (20)

Deep speech