SlideShare a Scribd company logo
ICLR2020 reviews on Speech domain
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
21, May. 2020.
ICLR 2020 (2020.04.26 ~ 2020.04.30)
Content
• DDSP: Differentiable Digital Signal Processing (Spotlight)
• Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts
• High Fidelity Speech Synthesis with Adversarial Networks (Talk)
• Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen
Simonyan
DDSP: Differentiable Digital Signal
Processing (Spotlight)
Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts
Google Research, Brain Team
Overview
• Digital Signal Processing (DSP) is one of the backbones of modern society, integral to
• Telecommunications, Transportation, Audio, Many medical technologies
• Key idea
• Use simple interpretable DSP elements to create complex realistic signals by precisely controlling their many parameters
• E.g., a collection of linear filters and sinusoidal oscillators (DSP elements) can create the sound of a realistic violin
• In this paper,
• Use a neural network to convert a user’s input into complex DSP controls that can produce more realistic signals
Challenges of “pure” neural audio synthesis
Audio is highly periodic
Ears are sensitive to discontinuities
DSP Components
• Oscillators (Harmonic Sinusoids)
Differentiable Additive Synthesizer
Filters (LTV-FIR)
• Linear Time Variant Filter
Magnitude 𝐼𝐷𝐹𝑇 Impulse
(freq, t) + 𝑤𝑖𝑛𝑑𝑜𝑤 Response (t)
Room Reverberation (Reverb)
• Very long 1-D convolution (filter size = 64k)
• Learned for a given dataset
• Can also be generated by other DDSP components
Room Reverberation (Reverb)
Where,
𝑹𝑻 𝟔𝟎 = reverberation time, in seconds
𝑽 = volume of room, in cubic feet (or m^3)
𝑺 = surface area, in square feet (or m^2)
𝜶 = average absorbtion coefficient
𝑅𝑇60 =
24 𝑙𝑛10 𝑉
𝑐20 𝑆 𝑎
𝑅𝑇60 =
0.161𝑉
𝑆𝛼
Overview to until this page
Additive Synthesizer Parameters
Noise Synthesizer and Reverb Parameters
Proposed model
Proposed model
Proposed Model (Encoder)
Proposed Model (Decoder)
Result
Timbre Transfer (singing voice to violin)
Extrapolation
Dereverberation and Acoustic Transfer
High Fidelity Speech Synthesis with
Adversarial Networks (Talk)
Mikołaj Bińkowski et al., Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis
C. Cobo, Karen Simonyan
DeepMind
Overview
• Neural Text-to-Speech (TTS)
• Acoustic model: receives text and predicts an intermediate representation such as a mel-spectrogram (ex: Tacotron,
Transformer-TTS, MelNet)
• Vocoder: converts predicted mel-spectrogram into audible raw audio (ex: WaveNet, WaveRNN, WaveGlow, MelGAN)
• Evaluation method of TTS
• Mean Opinion Score (MOS), which is evaluated by humans
• Contribution
• They used the power of linguistic feature extractor to train a generator that produces raw audio from text with GAN
• Also, four metrics that can be used instead of MOS are presented
GAN-TTS
Generator
• 567 input features per 5ms
windows
• Generator gradually
upsamples the representation
• Residual GBlocks use dilated
convolutions and batch-norm
conditioned on the noise
• 30 layers in total
Generator block
• GBlocks are 4-layer residual blocks
with 2 skip-connections, upsampling
and dilated convolutions
[batch, 1, time] (wav)
[batch, 567, time] [batch, Latent dim]
Dilated convolution
Compare (Dilated Conv. vs Upsampled Conv.)
Discriminator
[batch, 1, time], [batch, 567, time]
𝝎 = 𝟐𝟒𝟎
𝝎𝒌, 𝟏 to [𝝎, 𝒌], where 𝒌 is the downsample factor
(e.g. 𝒌 = 𝟖 for input window size 1920)
[batch, 1]
Discriminator
[batch, 567, time]
[batch, l]
Discriminator
• Discriminator Block
Discriminator
Discriminator
Discriminator
Pseudo code of TTS-GAN
Algorithm 1
• 𝑁: waveform length
• 𝜆: waveform-conditioning frequency ratio
• 𝜔: base window size
• 𝑛 𝑠𝑡𝑒𝑝𝑠: number of training steps
• 𝑛 𝑏𝑎𝑡𝑐ℎ: batch size
• 𝜂𝐷, 𝜂𝐺: discriminator and generator learning rates
Experiments and results
• Same scale  Performance degradation
• Random window  Data augmentation, getting fast of learning speed
• If input size is fixed  it can be accelerated with torch.backends.cudnn.benchmark = True (based on PyTorch)
• Three times faster than Parallel WaveNet, MOS is almost same
• Despite being a GAN, it was learned very stably.
Note
• All the figures are from authors, papers, blogs
Reference
• Engel, Jesse, et al. "DDSP: Differentiable Digital Signal Processing." ICLR (2020).
• https://www.dsprelated.com/freebooks/filters/View_Linear_Time_Varying.html (A view of linear varying digital
filters)
• https://www.bobgolds.com/RT60/rt60.htm (How to do a RT60 calculation)
• https://ccrma.stanford.edu/~adnanm/SCI220/Music318ir.pdf (Room impulse response measurement and analysis)
• Bińkowski, Mikołaj, et al. "High fidelity speech synthesis with adversarial networks." ICLR (2020)
• Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." ICLR (2016).
Thank you!

More Related Content

Similar to ICLR 2 papers review in signal processing domain

Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)
Harshal Ladhe
 
M1L1-2.ppt
M1L1-2.pptM1L1-2.ppt
M1L1-2.ppt
shareea2002
 
SIGGRAPH 2014論文紹介 - Sound & Light + Fabrication Session
SIGGRAPH 2014論文紹介 - Sound & Light + Fabrication SessionSIGGRAPH 2014論文紹介 - Sound & Light + Fabrication Session
SIGGRAPH 2014論文紹介 - Sound & Light + Fabrication Session
yamo_o
 
The analog to digital conversion process
The analog to digital conversion processThe analog to digital conversion process
The analog to digital conversion process
DJNila
 
Lecture6 audio
Lecture6   audioLecture6   audio
Lecture6 audio
Mr SMAK
 
Lecture6 audio
Lecture6   audioLecture6   audio
Lecture6 audio
Mr SMAK
 
Speech coding techniques
Speech coding techniquesSpeech coding techniques
Speech coding techniques
Hemaraja Nayaka S
 
Digital Technology
Digital TechnologyDigital Technology
Digital Technology
simonandisa
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniques
idescitation
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
NUGU developers
 
Chap65
Chap65Chap65
Chap65
dkd_woohoo
 
D1103023339
D1103023339D1103023339
D1103023339
IOSR Journals
 
Dsp ppt
Dsp pptDsp ppt
Dsp ppt
Sushant Burde
 
N017657985
N017657985N017657985
N017657985
IOSR Journals
 
Data Compression using Multiple Transformation Techniques for Audio Applicati...
Data Compression using Multiple Transformation Techniques for Audio Applicati...Data Compression using Multiple Transformation Techniques for Audio Applicati...
Data Compression using Multiple Transformation Techniques for Audio Applicati...
iosrjce
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
June-Woo Kim
 
VII Compression Introduction
VII Compression IntroductionVII Compression Introduction
VII Compression Introduction
sangusajjan
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
MLconf
 

Similar to ICLR 2 papers review in signal processing domain (20)

Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)Speech compression using loosy predictive coding (lpc)
Speech compression using loosy predictive coding (lpc)
 
M1L1-2.ppt
M1L1-2.pptM1L1-2.ppt
M1L1-2.ppt
 
SIGGRAPH 2014論文紹介 - Sound & Light + Fabrication Session
SIGGRAPH 2014論文紹介 - Sound & Light + Fabrication SessionSIGGRAPH 2014論文紹介 - Sound & Light + Fabrication Session
SIGGRAPH 2014論文紹介 - Sound & Light + Fabrication Session
 
The analog to digital conversion process
The analog to digital conversion processThe analog to digital conversion process
The analog to digital conversion process
 
Lecture6 audio
Lecture6   audioLecture6   audio
Lecture6 audio
 
Lecture6 audio
Lecture6   audioLecture6   audio
Lecture6 audio
 
Speech coding techniques
Speech coding techniquesSpeech coding techniques
Speech coding techniques
 
Digital Technology
Digital TechnologyDigital Technology
Digital Technology
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniques
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Chap65
Chap65Chap65
Chap65
 
D1103023339
D1103023339D1103023339
D1103023339
 
Dsp ppt
Dsp pptDsp ppt
Dsp ppt
 
N017657985
N017657985N017657985
N017657985
 
Data Compression using Multiple Transformation Techniques for Audio Applicati...
Data Compression using Multiple Transformation Techniques for Audio Applicati...Data Compression using Multiple Transformation Techniques for Audio Applicati...
Data Compression using Multiple Transformation Techniques for Audio Applicati...
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
VII Compression Introduction
VII Compression IntroductionVII Compression Introduction
VII Compression Introduction
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
 

More from June-Woo Kim

Conformer review
Conformer reviewConformer review
Conformer review
June-Woo Kim
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
June-Woo Kim
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
June-Woo Kim
 
Blow review
Blow reviewBlow review
Blow review
June-Woo Kim
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
June-Woo Kim
 
Voice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks reviewVoice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks review
June-Woo Kim
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
June-Woo Kim
 

More from June-Woo Kim (7)

Conformer review
Conformer reviewConformer review
Conformer review
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
Blow review
Blow reviewBlow review
Blow review
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
Voice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks reviewVoice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks review
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 

Recently uploaded

Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
Aditya Rajan Patra
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 

Recently uploaded (20)

Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 

ICLR 2 papers review in signal processing domain

  • 1. ICLR2020 reviews on Speech domain Presented by: June-Woo Kim Artificial Brain Research Lab., School of Sensor and Display, Kyungpook National University 21, May. 2020. ICLR 2020 (2020.04.26 ~ 2020.04.30)
  • 2. Content • DDSP: Differentiable Digital Signal Processing (Spotlight) • Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts • High Fidelity Speech Synthesis with Adversarial Networks (Talk) • Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan
  • 3. DDSP: Differentiable Digital Signal Processing (Spotlight) Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts Google Research, Brain Team
  • 4. Overview • Digital Signal Processing (DSP) is one of the backbones of modern society, integral to • Telecommunications, Transportation, Audio, Many medical technologies • Key idea • Use simple interpretable DSP elements to create complex realistic signals by precisely controlling their many parameters • E.g., a collection of linear filters and sinusoidal oscillators (DSP elements) can create the sound of a realistic violin • In this paper, • Use a neural network to convert a user’s input into complex DSP controls that can produce more realistic signals
  • 5. Challenges of “pure” neural audio synthesis Audio is highly periodic Ears are sensitive to discontinuities
  • 6. DSP Components • Oscillators (Harmonic Sinusoids) Differentiable Additive Synthesizer
  • 7. Filters (LTV-FIR) • Linear Time Variant Filter Magnitude 𝐼𝐷𝐹𝑇 Impulse (freq, t) + 𝑤𝑖𝑛𝑑𝑜𝑤 Response (t)
  • 8. Room Reverberation (Reverb) • Very long 1-D convolution (filter size = 64k) • Learned for a given dataset • Can also be generated by other DDSP components
  • 9. Room Reverberation (Reverb) Where, 𝑹𝑻 𝟔𝟎 = reverberation time, in seconds 𝑽 = volume of room, in cubic feet (or m^3) 𝑺 = surface area, in square feet (or m^2) 𝜶 = average absorbtion coefficient 𝑅𝑇60 = 24 𝑙𝑛10 𝑉 𝑐20 𝑆 𝑎 𝑅𝑇60 = 0.161𝑉 𝑆𝛼
  • 10. Overview to until this page Additive Synthesizer Parameters Noise Synthesizer and Reverb Parameters
  • 16. Timbre Transfer (singing voice to violin)
  • 19. High Fidelity Speech Synthesis with Adversarial Networks (Talk) Mikołaj Bińkowski et al., Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan DeepMind
  • 20. Overview • Neural Text-to-Speech (TTS) • Acoustic model: receives text and predicts an intermediate representation such as a mel-spectrogram (ex: Tacotron, Transformer-TTS, MelNet) • Vocoder: converts predicted mel-spectrogram into audible raw audio (ex: WaveNet, WaveRNN, WaveGlow, MelGAN) • Evaluation method of TTS • Mean Opinion Score (MOS), which is evaluated by humans • Contribution • They used the power of linguistic feature extractor to train a generator that produces raw audio from text with GAN • Also, four metrics that can be used instead of MOS are presented
  • 22. Generator • 567 input features per 5ms windows • Generator gradually upsamples the representation • Residual GBlocks use dilated convolutions and batch-norm conditioned on the noise • 30 layers in total
  • 23. Generator block • GBlocks are 4-layer residual blocks with 2 skip-connections, upsampling and dilated convolutions [batch, 1, time] (wav) [batch, 567, time] [batch, Latent dim]
  • 25. Compare (Dilated Conv. vs Upsampled Conv.)
  • 26. Discriminator [batch, 1, time], [batch, 567, time] 𝝎 = 𝟐𝟒𝟎 𝝎𝒌, 𝟏 to [𝝎, 𝒌], where 𝒌 is the downsample factor (e.g. 𝒌 = 𝟖 for input window size 1920) [batch, 1]
  • 32. Pseudo code of TTS-GAN Algorithm 1 • 𝑁: waveform length • 𝜆: waveform-conditioning frequency ratio • 𝜔: base window size • 𝑛 𝑠𝑡𝑒𝑝𝑠: number of training steps • 𝑛 𝑏𝑎𝑡𝑐ℎ: batch size • 𝜂𝐷, 𝜂𝐺: discriminator and generator learning rates
  • 33. Experiments and results • Same scale  Performance degradation • Random window  Data augmentation, getting fast of learning speed • If input size is fixed  it can be accelerated with torch.backends.cudnn.benchmark = True (based on PyTorch) • Three times faster than Parallel WaveNet, MOS is almost same • Despite being a GAN, it was learned very stably.
  • 34. Note • All the figures are from authors, papers, blogs
  • 35. Reference • Engel, Jesse, et al. "DDSP: Differentiable Digital Signal Processing." ICLR (2020). • https://www.dsprelated.com/freebooks/filters/View_Linear_Time_Varying.html (A view of linear varying digital filters) • https://www.bobgolds.com/RT60/rt60.htm (How to do a RT60 calculation) • https://ccrma.stanford.edu/~adnanm/SCI220/Music318ir.pdf (Room impulse response measurement and analysis) • Bińkowski, Mikołaj, et al. "High fidelity speech synthesis with adversarial networks." ICLR (2020) • Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." ICLR (2016).

Editor's Notes

  1. Hello everyone, I am June-Woo Kim from ABR LAB. I will presenting the paper: ICLR2020.
  2. Here is summary.
  3. RWD ensemble in discriminator