Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

音声認識と深層学習

104,879 views

Published on

1/16のPFIセミナー「Deep Learningと音声認識」の資料です。

Published in: Technology
  • ..............ACCESS that WEBSITE Over for All Ebooks ................ ......................................................................................................................... DOWNLOAD FULL PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • who will win this game? get free picks and predictions. ♥♥♥ https://tinyurl.com/yxcmgjf5
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

音声認識と深層学習

  1. 1. Deep  Learningと⾳音声認識識 PFIセミナー  2015/7/16 株式会社  Preferred  Infrastructure ⻄西⿃鳥⽻羽  ⼆二郎郎
  2. 2. ⾃自⼰己紹介 l  ⻄西⿃鳥⽻羽  ⼆二郎郎(にしとば  じろう) l  所属: Preferred Infrastructure 製品事業部 -  製品の導⼊入⽀支援 -  プリセールス -  サポート -  研究開発 2
  3. 3. 前回 l  「Deep Learningと⾃自然⾔言語処理理」 -  http://research.preferred.jp/2014/12/deep-learning-nlp/ -  ⾃自然⾔言語処理理でDeep Learningを使ってどんなことが⾏行行われて いるかを紹介 l  Twitter上でのポジティブ反応 l  多くの知り合いに「お前が旅⾏行行⾏行行くとサーバー壊れる ことだけは分かった」と⾔言われる 3
  4. 4. Deep  Learningによる⾳音声認識識⼿手法 l  ⾳音声認識識でDeep Learningを使ってどんなことが⾏行行わ れているかを紹介 -  精度度向上 -  構造がシンプルに 4
  5. 5. 精度度向上 データセット DNN以前 DNN CNN/RNN TIMIT (PER) 24.8% 23.0% 17.7% SwitchBoard Bank (WER) 23.6% 15.8% 8.0% Voice Search (SER) 36.2% 30.1% 5 PER(Phoneme  Error  Rate)  :  ⾳音素列列で⽐比較した時の正解率率率 WER(Word  Error  Rate)  :  テキスト化した後の単語で⽐比較した時の正解率率率 SER(Sentence  Error  Rate)  :  テキスト化した後の⽂文章単位で⽐比較した時の正解率率率
  6. 6. ⼀一般的な⾳音声認識識 6 ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト MFCC / FMLLR Gaussian Mixture Model ⽂文脈⾃自由⽂文法 ⾔言語モデル ⾳音素列列(重複あり) Hidden Markov Model
  7. 7. DNNによる⾳音声認識識 7 ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト MFCC / FMLLR Gaussian Mixture Model ⽂文脈⾃自由⽂文法 ⾔言語モデル ⾳音素列列(重複あり) Hidden Markov Model Deep Neural Network Convolutional Neural Network
  8. 8. DNNによる⾳音声認識識 8 ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト MFCC / FMLLR Gaussian Mixture Model ⽂文脈⾃自由⽂文法 ⾔言語モデル ⾳音素列列(重複あり) Hidden Markov Model Deep Neural Network Convolutional Neural Network Recurrent Neural Network
  9. 9. The  IBM  2015  English  Conversational  Telephone   Speech  Recognition  System[Saon,  et  al.,  2015] 9 ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト FMLLR(= feature-space Maximum Likelihood Logistic Regression) Gaussian Mixture Model ⽂文脈⾃自由⽂文法 ⾔言語モデル ⾳音素列列(重複あり) Hidden Markov Model DNNとCNNを組み合わせた ネットワーク RNN⾔言語モデル
  10. 10. ネットワーク構成 10 l  2系統のニューラルネットワーク -  CNNを含む -  線形レイヤのみ l  出⼒力力はHMMの状態(state) etworks [22] which 0-dimensional FM- 100-dimensional i- backwards in time the DNN have ex- n layer is recurrent dden layers (3 with ns) and one output of cross-entropy on quence discrimina- n [15]. The perfor- s their score fusion 5’00 test set (SWB ecode with a frame- o the softmax with WER CH CE ST 18.4 17.9 18.5 17.0 17.7 16.3 17.4 16.3 17.0 16.1 resulting matrix by the number of models (assuming uniform weights). An example of a joint CNN/DNN model initialized in such a way is illustrated in Figure 1. For convenience, we have indicated the sizes of the weight matrices in the oval boxes and the dimensionality of the layers is attached to the arrows. 11 x 40 + 100 2048 x 512 512 x 32000 512 2048 x 2048 2048 x 2048 2048 x 2048 2048 2048 2048 540 x 2048 2048 DNN 2048 x 2048 2048 x 2048 2048 x 2048 2048 x 2048 2048 x 512 2048 2048 2048 2048 1536 x 256 243 x 128 128x11x3 256x8x1 3 x 40 x 11 2048 x 2048 2048 x 2048 2048 x 2048 2048 x 2048 2048 x 512 2048 2048 2048 2048 512 x 32000 512 1536 x 256 243 x 128 256x8x1 3 x 40 x 11 128x11x3 CNN 2048 x 512 2048 x 2048 2048 x 2048 2048 x 2048 2048 2048 2048 540 x 2048 2048 11 x 40 + 100 1024x 32000 512 512 Joint CNN/DNN Convolution
  11. 11. ⾔言語モデル l  ⾳音素列列から⾔言語モデルを⽤用いてテキストを作成 -  n(4)-gram -  model M -  NNLM l  ⾔言語モデルの学習データには⾳音声データの書き起こし のみを使⽤用 -  Switchboard -  Fisher corpus -  Callhome 11
  12. 12. 精度度⽐比較 SWB (8.8% to 8.0%) and 1.2% on CallHome (15.3% to 14.1%). Lastly, in Table 6 we compare our results with those ob- tained by various other systems from the literature. For clarity, we also specify the type of training data that was used for acous- tic modeling in each case. System AM training data SWB CH Vesely et al. [8] SWB 12.6 24.1 Seide et al. [9] SWB+Fisher+other 13.1 – Hannun et al. [10] SWB+Fisher 12.6 19.3 Zhou et al. [11] SWB 14.2 – Maas et al. [12] SWB 14.3 26.0 Maas et al. [12] SWB+Fisher 15.0 23.0 Soltau et al. [13] SWB 10.4 19.1⇤ This system SWB+Fisher+CH 8.0 14.1 Table 6: Comparison of word error rates on Hub5’00 (SWB12 提案⼿手法 DeepSpeech(後述)
  13. 13. Discriminative  Method  for  Recurrent  Neural   Network  Language  Models    [Tachioka,  et  al.,  2015] 13 ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト MFCC Gaussian Mixture Model ⽂文脈⾃自由⽂文法 ⾔言語モデル ⾳音素列列(重複あり) Hidden Markov Model DNN RNN⾔言語モデル
  14. 14. Discriminative  Method  for  Recurrent  Neural   Network  Language  Models    [Tachioka,  et  al.,  2015] l  ⽇日本語話し⾔言葉葉コーパス(CSJ)のテストセットのWER -  baseline 11.31% -> 10.49% -  そのうちE2というデータセットでは9.84%を達成 14
  15. 15. 構成が簡単に l  他の分野を⾒見見てみるとDeep Learningは中間状態を⾃自 動で⾏行行う処理理で強みを発揮している -  画像とキャプションのマルチモーダル -  バイリンガルな⾔言語処理理 -  翻訳 l  ⾳音声認識識でも同様なことが起こっている -  ⾳音データからHMMを経由せずに⾳音素列列を⽣生成する -  ⾳音データから直接テキストを⽣生成する 15
  16. 16. End-‐‑‒to-‐‑‒end 16 ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト ⾳音素列列(重複あり) ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト RNN
  17. 17. End-‐‑‒to-‐‑‒end  Continuous  Speech  Recognition  using  Attention-‐‑‒ based  Recurrent  NN:  First  Results  [Chorowski,  et  al.,  2014] l  ⾳音声データから⾳音素列列を⽣生成する -  ⾳音素列列の⽣生成のためにHMMやオートマトンなどの中間状態の ⽣生成は⾏行行わない l  Attention-mechanismを⽤用いている -  Encoder: ⾳音声データを読みこむ -  Decoder: ⾳音素列列を⽣生成する 17
  18. 18. Attention  Mechanism 18 AL DESCRIPTION x1 x2 x3 xT + αt,1 αt,2 αt,3 αt,T yt-1 yt h1 h2 h3 hT h1 h2 h3 hT st-1 st Figure 1: The graphical illus- re, we define each conditional probability , yi 1, x) = g(yi 1, si, ci), (4) en state for time i, computed by = f(si 1, yi 1, ci). unlike the existing encoder–decoder ap- the probability is conditioned on a distinct target word yi. depends on a sequence of annotations n encoder maps the input sentence. Each ormation about the whole input sequence e parts surrounding the i-th word of the in in detail how the annotations are com- Encoder: ⼊入⼒力力列列を読み込むRNN Context: Encoderから読み込んだ 情報が埋め込まれる Decoder: Contextの情報と合わせ て出⼒力力を⽣生成するRNN
  19. 19. ネットワーク構造 The output predictions are computed with a Maxout network using two filters per unit. Input sequence: frames of 40 fMLLR features. Deep Maxout network reads 11 frames (440 features) and uses 3 hidden layers of 1024 maxout units each using 5 filters. BiRNN: Input is 1024 features per frame Each recurrent layer has 512 hidden units, thus the annotation is 1024-dimensional. Context: a score is computed to match the previous hidden state to all input annotations. The context is a weighted combination of the most closely matching annotations. The BiRNN is used to initialize the first state of the decoder. Encoder RNN: computes an annotation for each input frame. Decoder RNN: Recurrently predicts the next phoneme, input annotations are accessed through a context computed separately for each output. + Figure 1: Proposed model architecture. The system contains three parts: an encoder that computes annotations of input frames (learned features that may depend on the whole sequence), an attention 19 ⾳音声データ MLP Bidirectional RNN Context: Encoder側の情 報を埋め込んだベクトル RNN MLP
  20. 20. End-‐‑‒to-‐‑‒end 20 ⾳音声データ 特徴ベクトル列列 ⾳音素列列 テキスト ⾳音素列列(重複あり) ⾳音声データ 特徴ベクトル列列 テキスト RNN
  21. 21. Deep  Speech:  Scaling  up  end-‐‑‒to-‐‑‒end  speech   recognition  [Hannun,  et  al.,  2014] l  Baidu  Researchのグループが発表 l  ⾳音声データから⾳音素列列を経ずに直接テキストを⽣生成 -  Connectionist  Temporal  Classification関数 l  Switchboard  の  Word  Error  Rate  12.6%を達成 21
  22. 22. ネットワーク構成 Once we have computed a prediction for P(ct|x), we compute the CTC loss [13] L(ˆy, y) to measure the error in prediction. During training, we can evaluate the gradient rˆyL(ˆy, y) with respect to the network outputs given the ground-truth character sequence y. From this point, computing the gradient with respect to all of the model parameters may be done via back-propagation through the rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3 Figure 1: Structure of our RNN model and notation. 22 MLP with clipped ReLU Bidirectional RNN with clipped ReLU Softmax CTC 損失関数 Log filterbank
  23. 23. Connectionist  Temporal  Classification(CTC)   [Graves,  et  al.,  2006] l  ⼊入⼒力力と出⼒力力の系列列⻑⾧長が違う時に⽤用いられる損失関数 l  任意のRNNやLSTM等の出⼒力力に適⽤用できる l  blank(空⽩白⽂文字)を導⼊入し、正解⽂文字列列を順番に⽣生成す る確率率率を求める -  CAT l  _C_A_T_ l  ____CCCCA___TT -  aab l  a_ab_ l  _aa__abb PFI Confidential 23
  24. 24. Connectionist  Temporal  Classification PTER 7. CONNECTIONIST TEMPORAL CLASSIFICATION 24 各時刻での⽣生起確率率率 時刻t ⿊黒丸は空⽩白⽂文字 遷移
  25. 25. Connectionist  Temporal  Classification PTER 7. CONNECTIONIST TEMPORAL CLASSIFICATION 25 _C__AT
  26. 26. Connectionist  Temporal  Classificationのコス ト関数PTER 7. CONNECTIONIST TEMPORAL CLASSIFICATION 26 コスト関数: 全パスの⽣生起 確率率率の負の対数尤度度
  27. 27. Connectionist  Temporal  Classificationにおけ る勾配 PTER 7. CONNECTIONIST TEMPORAL CLASSIFICATION 27 ) @at k = yk0 kk0 yk0 yk ubstitute (7.33) and (7.31) into (7.32) to obtain @L(x, z) @at k = yt k 1 p(z|x) X u2B(z,k) ↵(t, u) (t, u) he ‘error signal’ backpropagated through the network during tr ated in Figure 7.4. Decoding network is trained, we would ideally label some unknown in by choosing the most probable labelling l⇤ : l⇤ = arg max l p(l|x) 時刻t, 出⼒力力Aにおける勾配 時刻t, 出⼒力力kにおける値 全パスの⽣生起確率率率 時刻t, 出⼒力力kに対応する点 を通る全パスの⽣生起確率率率
  28. 28. Connectionist  Temporal  Classificationにおけ る学習 l  各時刻、出⼒力力における勾配(デルタ)を求めた後はBack Propagation Through Time(BPTT)を⽤用いて学習する 28
  29. 29. 適⽤用時 l  Bidirectional RNNの部分はそのまま適⽤用する l  全ての時刻分の出⼒力力を求めた後は⾔言語モデルを考慮し ながら最も⽣生起確率率率が⾼高い⽂文字列列を動的計画法で求める 29 es of transcriptions directly from the RNN (left) with errors tha e model (right). P(c|x) of our RNN we perform a search to find the sequence of c able according to both the RNN output and the language model ( the string of characters as words). Specifically, we aim to find ombined objective: Q(c) = log(P(c|x)) + ↵ log(Plm(c)) + word count(c) re tunable parameters (set by cross-validation) that control the guage model constraint and the length of the sentence. The te e sequence c according to the N-gram model. We maximize th beam search algorithm, with a typical beam size in the range 1 escribed by Hannun et al. [16]. ⽣生起確率率率 ⾔言語モデルによる確率率率 単語の⻑⾧長さ
  30. 30. まとめ l  ⾳音声認識識においてはDeep Learningを⽤用いることによ って精度度が向上している -  Switchboard bankのWER 24.8% -> 8.0% l  最近はEnd-to-endの⼿手法が提案されている -  ⾳音素列列などの中間状態はDNNがよしなにやってくれる -  ⾳音声への⼊入⾨門がやりやすくなっている 30
  31. 31. Copyright  ©  2006-‐‑‒2015 Preferred  Infrastructure  All  Right  Reserved.
  32. 32. 特徴抽出が簡単に l  MFCC(Mel Frequency Cepstral Coefficient) 1.  ⾳音声データをフレーム(通常20ms〜~40ms)に分割する 2.  各フレーム毎のデータに離離散フーリエ変換を⾏行行う 3.  Mel filterbankを適⽤用する l  Mel数(⼈人の聴覚特性を反映した数字)を考慮したフィルタ 4.  対数を取る 5.  離離散コサイン変換を⾏行行う 6.  低い次元から12個抽出する 32
  33. 33. 特徴抽出が簡単に l  Log Filterbank 1.  ⾳音声データをフレーム(通常20ms〜~40ms)に分割する 2.  各フレーム毎のデータに離離散フーリエ変換を⾏行行う 3.  Mel filterbankを適⽤用する l  Mel数(⼈人の聴覚特性を反映した数字)を考慮したフィルタ 4.  対数を取る 5.  離離散コサイン変換を⾏行行う 6.  低い次元から12個抽出する 33
  34. 34. 特徴抽出が簡単に l  Mel Filterbank 1.  ⾳音声データをフレーム(通常20ms〜~40ms)に分割する 2.  各フレーム毎のデータに離離散フーリエ変換を⾏行行う 3.  Mel filterbankを適⽤用する l  Mel数(⼈人の聴覚特性を反映した数字)を考慮したフィルタ 4.  対数を取る 5.  離離散コサイン変換を⾏行行う 6.  低い次元から12個抽出する 34

×