Icml読み会 deep speech2

Deep Speech 2:
End-to-End Speech
Recognition in English and
Mandarin
Amodei, et al.
ICML2016 読み会 2016/07/21 @ドワンゴセミナールーム
株式会社プリファードインフラストラクチャー
西鳥羽二郎

自己紹介
 西鳥羽二郎
 株式会社 Preferred Infrastructure
- 製品事業部
- 研究開発
 音声認識
 自然言語処理
 その他諸々
 Twitter ID: jnishi
1

Deep Speech 2: End-to-Endの音声認識
 異なる言語(English, Mandarin), 雑音あるなしに関わらず
変更をほとんど必要としないモデル
 従来の音声認識よりも少ないコンポーネントで構成
 一般の人の書き起こしよりも精度が良い音声認識
 特徴
- 巨大なニューラルネットワークのモデル
- 巨大なデータ
- Deep Learning上の各種最適化
2

パターン認識としての音声認識
3
ニイタカヤマノボレ
音声信号データ

一般的な音声認識の構成
4
音声データ
特徴ベクトル列
音素生起確率
テキスト
スペクトル分析
ケプストラム分析
(Deep) Neural Network
言語モデル
文脈自由文法
状態系列(HMM)
最尤状態系列探索

音声認識は大変
 コンポーネント数が多く、かつそれぞれチューニングを
必要とする
 チューニングを必要とする条件も多い
- 環境の変化(≒データセットの変化)
- 言語の変化
 中国語
 (日本語)
5

Mandarin
 公用中国語
- 中国の中で最もポピュラーな言語
 話者数(Wikipedia調べ)
- 第一言語: 885,000,000 人
- 総話者: 1,365,053,177 人
6

英語と中国語の違い
English Mandarin
文字数(記号除く) 26 6000
文字体系表音文字表意文字
語彙の区切り空白句読点(?)
7

Deep Speech 2の構成
8
音声データ
特徴ベクトル列
文字生起確率
テキスト
Spectrograms of power
normalized audio clip
Recurrent Neural Network
N-gram言語モデル
Beam Search

特徴抽出
 Spectrograms of power normalized audio clip
1. 音声データをフレーム(通常20ms〜40ms)に分割する
2. 各フレーム毎のデータに離散フーリエ変換を行う
3. Mel filterbankを適用する
 Mel数(人の聴覚特性を反映した数字)を考慮したフィルタ
4. 対数を取る
5. 離散コサイン変換を行う
6. 低い次元から12個抽出する
9

一般的に使われる特徴抽出
 Log Filterbank
4. 対数を取る
10

一般的に使われる特徴抽出
 MFCC(Mel Frequency Cepstral Coefficient)
4. 対数を取る
11

ニューラルネットワークの構成
12

13
Convolution層

Convolution層
 1-D: 前後の時間と組み合わせてConvolution
 2-D: 前後の時間及び周波数の組み合わせでのConvolution
14

Convolution層
 1-D: 前後の時間と組み合わせてConvolution
 2-D: 前後の時間及び周波数の組み合わせでのConvolution
 2-Dで3層のConvolution層を用いるのが通常のデータにお
いても雑音環境下のデータにおいても精度が良い
15

16
Unidirectional GRU

Simple RNNとGRUの比較
 どの構成においてもSimple RNNよりもGRUの方が精度
が良いのでGRUを採用
17

18
Lookahead
Convolution層

Lookahead Convolution
 Bidirectional GRUは精度面では良いが、online, 低レイテ
ンシでの実行ができない
19
Here W
(6)
k and b
(6)
k denote thek’th column of the weight matrix and k’th bias, respectively.
Oncewehavecomputed aprediction for P(ct |x), wecomputetheCTC loss[13] L(ˆy, y) to me
he error in prediction. During training, we can evaluate the gradient r ˆy L(ˆy, y) with respe
he network outputs given the ground-truth character sequence y. From this point, computin
gradient with respect to all of the model parameters may bedone via back-propagation throug
est of thenetwork. WeuseNesterov’sAccelerated gradient method for training [41].3
t1 t2 tn
どの段階の値を計算するにもt1
からtnのすべての入力が必要

Lookahead Convolution
20
指定したパラメータ(τ)分だけ
先の時刻の出力を用いる

21
全結合層

22
CTC損失関数

Connectionist Temporal Classification(CTC)
損失関数
 入力と出力の系列長が違う時に用いられる損失関数
 任意のRNNやLSTM等の出力に適用できる
 blank(空白文字)を導入し、正解文字列を順番に生成する
確率を求める
- CAT
 _C_A_T_
 ____CCCCA___TT
- aab
 a_ab_
 _aa__abb
23

デコーダー
 複数のスコアを組み合わせて最終出力を構成する
- ニューラルネットワークの出力(文字列の生起確率)
- 言語モデルによるスコア
- word count
 単語数(English)
 文字数(Mandarin)
 ビームサーチを行って上記スコアが最大になるような文
章を探索する
24

デコーダーのスコアリング
25
nesecharacters.
At inference time, CTC modelsarepaired awith langua
model trained on abigger corpusof text. Weuseaspeci
ized beam search (Hannun et al., 2014b) to ﬁnd the tra
scription y that maximizes
Q(y) = log(pRNN(y|x)) + ↵ log(pLM(y)) + βwc(y)
where wc(y) is the number of words (English) or chara
ters (Chinese) in the transcription y. The weight ↵ co
trols the relative contributions of the language model a
theCTCnetwork. Theweight β encouragesmorewords
thetranscription. Theseparameters aretuned on aheld o
トランスクリプション
文字列のスコア
ニューラルネットワークが出
力する文字列の生起確率
言語モデルによるスコア
word count
α, βは学習データに応じて変更する

ニューラルネットワーク学習上の工夫:
Batch Normalization
 Batch Normalization
 SortaGrad
26

(Sequence-wise) Batch Normalization
 正則化に(Sequence-wise) Batch Normalizationを用いる
27
k k
Oncewehavecomputed aprediction for P(ct |x), wecomputetheCTC loss[13] L(ˆy, y) to mea
he error in prediction. During training, we can evaluate the gradient r ˆy L(ˆy, y) with respe
he network outputs given the ground-truth character sequence y. From this point, computing
radient with respect to all of the model parameters may bedone via back-propagation through
est of thenetwork. WeuseNesterov’sAccelerated gradient method for training [41].3
下位層からの入力にのみBatch
Normalizationを適用する
水平方向の入力にはBatch
Normalizationを適用しない

SortaGrad
 Curriculum learning
- CTCの学習初期はblank文字列を出力しがちで損失がとても大き
くなりやすい
- 学習データを系列の長さでソートし、短い音声データから学習
を行う
28
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Architecture Baseline BatchNorm GRU
5-layer, 1 RNN 13.55 14.40 10.53
5-layer, 3 RNN 11.61 10.56 8.00
7-layer, 5 RNN 10.77 9.78 7.79
9-layer, 7 RNN 10.83 9.52 8.19
9-layer, 7 RNN
no SortaGrad 11.96 9.78
Table 1: Comparison of WER on a development set as we
vary depth of RNN, application of BatchNorm and Sorta-
Grad, and type of recurrent hidden unit. All networkshave
50 100 150 200
Iteration (⇥10
20
30
40
50
60
Cost
SortaGradを用いない場合
精度が下がっている

最適化: 並列最適化
 Synchronous SGD
- 各GPUがローカルにデータのコピーを持つ
- ミニバッチの勾配を計算する
- 勾配を共有する
29

 Synchronous SGD
30
勾配の共有に時間がかかるせい
でAsynchronous SGDよりも遅
いかも知れないが、扱いやすい
ので採用

 Synchronous SGD
31
勾配の共有に時間がかかるせい
でAsynchronous SGDよりも遅
いかも知れないが、扱いやすい
ので採用
All-Reduceの高速化(4x-21x)に
より対応

学習データセット
 English
- 11,940時間
- 800万個の音源データ
 Mandarin
- 9,400時間
- 1,100万個の音源データ
32

学習データセット
33
12000時間までの範囲では学習
データが多ければ多いほど精度
が向上する

実験結果: 評価方法
 各種データに対して以下の方法で比較
- Deep Speech 2の出力
- 人手による聞き取り
 Amazon Mechanical Turkによるクラウドソーシングでの書き起こし
34

Deep Speech 2
 End-to-Endの高精度な音声認識
- 大規模なデータに対応したニューラルネットワークモデル
- 異なる言語、雑音のあるなしにも対応可能
 特徴
- Batch Normalization
- SortaGrad
- GRU
- Frequency Convolution
- Lookahead Convolution and Unidirectional Models
- Synchronous SGD
- Connectionist Temporal Classification
37

補足: Arxiv版
 https://arxiv.org/abs/1512.02595 に掲載されていてICML
版に掲載されていないこと
- Striding (in convolution)
- Language Modeling
- Scalability and Data parallelism
- Memory allocation
- Node and cluster architecture
- GPU Implementation of CTC Loss Function
- Batch Dispatch
- Data Augmentation
- Beam Search
38

Icml読み会 deep speech2

More Related Content

What's hot

Viewers also liked

Similar to Icml読み会 deep speech2

More from Jiro Nishitoba

Icml読み会 deep speech2