Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

WaveNet

An introduction to WaveNet, presented at TIS-Albert seminar.

  • Be the first to comment

WaveNet

  1. 1. WaveNet: A Generative Model for Raw Audio TIS + Albert 勉強会 2017/01/24 最上嗣生 tsuguo_mogami@albert2005.co.jp
  2. 2. Why? • Autoregressive model (e.g. PixelCNN)が良く成功している。 • • → 音声はどうであろうか。 • それをRNNより効率的なCNNで行いたい。
  3. 3. Contributions • これまでにない品質の音声合成。 • Dilated convolutionを使い、大きな受容野を持つにも関わらず 効率的なアーキテクチャ • (音声認識も)
  4. 4. Dilated convolutionとは https://github.com/vdumoulin/conv_arithmetic 大まかに言えば、本当は大きなkernel sizeのフィルタを使いたいとき、 これを使えば計算量を増やさずに、大きなカーネルと近似の結果が出せる。
  5. 5. stack of dilated causal convolutional layers 受容野の拡大の概念図であり、実際はResNet風blockの繰り返しです
  6. 6. Repetition Structure 1,2,4,…,512, 1,2,4,…,512, 1,2,4,…,512. Suspected to be repeating the 1…512 blocks 16 times
  7. 7. Autoregression https://deepmind.com/blog/wavenet-generative-model-raw-audio/
  8. 8. residual block and the entire architecture ちょっとわかりにくいので普通の表記に描きなおします
  9. 9. ・・・
  10. 10. Gated activation units • • K: layerindex, f for filter, g for gate • : elementwise multipulication • h: condition (person, text, etc.) • Why? • PixelCNN (1606.05328)で導入 • それ以前のCNN生成モデルがPixelRNNに劣ったのは LSTMのゲート構造のせいだと考えてLSTM似のゲー トを導入した
  11. 11. Input/output http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/ 雑に言えばLogスケールで quantizeして256段階にコード
  12. 12. Things not described and Guesses • Kernel size of the dilation filters 2 • Number of the layers (ResNet-blocks) 4*10~ 6*10 • Number of the channels in hidden layers hundreds? 256? • the other activation function in a Res-block? may be no • Batch normalization no reason not to use • Sampling frequency ‘at least 16kHz’ • Where to let the skip connection out? Every 10? • Skip connections have weights yes?
  13. 13. Experiments
  14. 14. Text-to-Speech (TTS) • Single-speaker speech dataset • North American English dataset: 24.6hr • Mandarin Chinese dataset: 34.8hr • Receptive field 240ms • Ad hoc architecture as → WaveNet Audio(t) Yet another model Liguistic feature h_i (possibly phoneme) Another model Fundamental frequency F0(t) duration(t) Liguistic feature h(t) ※論文とは違った記号を使っています。
  15. 15. TTS: Mean Opinion Score https://deepmind.com/blog/wavenet-generative-model-raw-audio/
  16. 16. Speech Recoginition • TIMIT dataset (possibly ~4hrs) • Add pooling layer after dilated convolution • of 160x down sampling (Does it mean 7th layer?) • Then a few non-causal convolutions. • Loss to predict the next sample (same as ordinary WaveNet) • And a loss to classify the frame • 18.8PER, which is best score among raw-audio models.
  17. 17. End
  18. 18. (Multi-speaker) Speech Generation • Conditioned on the speaker • 44 hours of data (from 109 speakers)
  19. 19. TTS: Mean opinion score
  20. 20. μ-law transformation (ITU-T, 1988) • • で-1,1の間を256分割している。 • 大雑把には log でコードしているだけ。

    Be the first to comment

    Login to see the comments

  • ssuserd62938

    May. 24, 2017

An introduction to WaveNet, presented at TIS-Albert seminar.

Views

Total views

1,654

On Slideshare

0

From embeds

0

Number of embeds

727

Actions

Downloads

22

Shares

0

Comments

0

Likes

1

×