SlideShare a Scribd company logo
Conditional
Generative Model
for Audio
발표자: 최형석 & 이주헌
2019/11/30 (Sat.)
최형석
Hyeong-Seok Choi
kekepa15@snu.ac.kr
이주헌
Juheon Lee
juheon2@snu.ac.kr
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Audio Source Separation
 Speech Enhancement
 Self-supervised representation learning &
generation
 Singing Voice Synthesis
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Singing Voice Synthesis
 Lyric-to-audio Alignment
 Cover Song Identification
 Abnormal Sound Detection
 Choreography Generation
3
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)
4
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)
X
𝑝(𝑿)
5
Generative models
Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?)
X
𝑝(𝑿; 𝜽)
𝑝(𝑿; 𝜽)
VAE, Autoregressive models, …
6
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
7
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
8
Conditional generative models
Application dependent modeling
1. Given a piano roll, I want to generate an expressive piano performance
2. Given a mel-spectrogram, I want to generate a raw audio signal
3. Given a linguistic feature, I want to generate a speech signal
…
Generative
Model
Output
1. Signal
Condition
1. Controllability
9
Conditional generative models
What does conditional generative model do?
 Reconstruct a signal from a given information (filling in the missing
information)
Level of “missing information”? (In music&audio point of view)
Condition Abstract Level
Abstract (Sparse)
Realistic (Dense)
Instrument class
Sound class
Non-expressive score
Linguistic Feature
Audio features
(mel-spectrogram)
MIDI score w/ velocity and etc…
Linguistic Feature w/ pitch
10
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs
11
Conditional generative models: applications
Example of densely conditioned models: Vocoders (WaveRNN: training)
Upsample net
GRUs
… …
Input2:
wave[0:dim-1]
GroundTruth:
wave[1:dim]
Input1: mel-spectrogram
Num class: 2 𝑏𝑖𝑡𝑠
Training
12
Conditional generative models: applications
Example of densely conditioned models: Vocoders (WaveRNN: training)
Inference
Upsample net
… …
Input: mel-spectrogram
0
0
Zero state
sample sample sample
x[1] x[2]
sample sample
x[N-1] x[N]x[N-2]…
…
output
13
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs
14
Conditional generative models: applications
15
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
1. Parametric Resynthesis with Neural Vocoders (Waspaa2019)
2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
3. Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
4. A Speech Synthesis Approach for High Quality Speech Separation and Generation
(IEEE Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
16
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
• Parametric Resynthesis with Neural Vocoders (Waspaa2019)
• Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
• Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
• A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE
Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
Some of my preliminary results…
Noisy
Generated
17
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Some other practical/interesting application: Next generation codec
1. Wavenet based low rate speech coding (ICASSP 2018)
2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019)
3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)
 Key idea:
1. Deep learning is good at learning a compressed representation (Encoder).
2. Deep learning is good at synthesizing (Decoder).
Pros: Good bit rate (bps) 
Cons: ???
Encoder
Server1
Compressed
representation
Decoder
Server2
Reconstructed
signal (speech)
18
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Training stage
19
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
TEXT MIDI
Conditioned wave
Generation stage
20
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Main Idea : Disentangling Formant mask & Pitch skeleton
• We wanted pitch and text information to be modelled as independent
acoustic features, and we designed the network to reflect that
21
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
22
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do do do do do do do do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
23
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C C C C C C C C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
24
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다”
“arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da”
Input pitch
Generated
result
Generated singing
Audio samples
25
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
26
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
27
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
Timbre A + Style B Timbre B + Style A
28
Conditional generative models: applications
Generative
Model
Output
1. Signal
Condition
1. Controllability
Generative
Model
Output
a. Signal (Audio/Image)
Condition
a. Controllability
b. Signal (Image/Audio)
Randomness
a. Uncertainty
b. Creativity
What is lacking?...
Multi-modal transform
Deterministic
Some stochasticity
Can be seen as a supervised-way of disentangling representation
1.
2.
29
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
30
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
31
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
Pose sequence
Music sequence
(concat.)
2 3 4 5 6 7 8 9
Estimated
Pose sequence
32
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
Black pink - 불장난
Red velvet - Rookie
33
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
34
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
• By using autoencoder, obtain
Reduced Acoustics Features
• With Temporal Indexes mask,
Transform the frame-indexed
acoustic features into beat-
indexed acoustic features.
35
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
36
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
Learning How to Compose
Generation
37
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
• Decompose dance sequence with kinematic beat
• With VAE, disentangle dance into initial pose + movement
38
Conditional generative models (multi-modal)
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Compose
• Learns how to meaningfully compose a sequence of basic
movements into a dance conditioned on the input music.
• Conditional adversarial training for correspondence M&D
39
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
: conditioning applied
40
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
41
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Stochastic part
(Uncertainty)
42
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Spk1
Spk2
Spk3
Spk4
43
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix z & Change c (speech embedding)
44
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix c & Change z (random sampling)
Thank You!
Questions?

More Related Content

What's hot

Sound recording glossary
Sound recording glossarySound recording glossary
Sound recording glossary
PhillipWynne12281991
 
Sound recording glossary - IMPROVED
Sound recording glossary - IMPROVEDSound recording glossary - IMPROVED
Sound recording glossary - IMPROVED
PaulinaKucharska
 
IG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrneIG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrne
terry96
 
Query By humming - Music retrieval technology
Query By humming - Music retrieval technologyQuery By humming - Music retrieval technology
Query By humming - Music retrieval technology
Shital Kat
 
IG2 Task 1 Worksheet
IG2 Task 1 WorksheetIG2 Task 1 Worksheet
IG2 Task 1 Worksheet
SamDuxburyGDS
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
luisfvazquez1
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
Stuart_Preston
 
Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2
JoeBrannigan
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet s
Shaz Riches
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet s
Shaz Riches
 
Mono and stereo
Mono and stereoMono and stereo
Mono and stereo
k13086
 
Surround sount system
Surround sount systemSurround sount system
Surround sount system
Chetan Gakhare
 
Sound recording glossary
Sound recording glossarySound recording glossary
Sound recording glossary
Ben Atherton
 
IG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence ByrneIG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence Byrne
terry96
 
Convolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classificationConvolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classification
Keunwoo Choi
 
Multimedia elements
Multimedia elementsMultimedia elements
Multimedia elements
Niteshwar Bhardwaj
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
HaydenParkes270497
 
Jordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisitedJordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisited
JordanSmith96
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
ryanmtucker1998
 
Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)
Mrrrjones
 

What's hot (20)

Sound recording glossary
Sound recording glossarySound recording glossary
Sound recording glossary
 
Sound recording glossary - IMPROVED
Sound recording glossary - IMPROVEDSound recording glossary - IMPROVED
Sound recording glossary - IMPROVED
 
IG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrneIG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrne
 
Query By humming - Music retrieval technology
Query By humming - Music retrieval technologyQuery By humming - Music retrieval technology
Query By humming - Music retrieval technology
 
IG2 Task 1 Worksheet
IG2 Task 1 WorksheetIG2 Task 1 Worksheet
IG2 Task 1 Worksheet
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet s
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet s
 
Mono and stereo
Mono and stereoMono and stereo
Mono and stereo
 
Surround sount system
Surround sount systemSurround sount system
Surround sount system
 
Sound recording glossary
Sound recording glossarySound recording glossary
Sound recording glossary
 
IG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence ByrneIG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence Byrne
 
Convolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classificationConvolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classification
 
Multimedia elements
Multimedia elementsMultimedia elements
Multimedia elements
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Jordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisitedJordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisited
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)
 

Similar to Conditional generative model for audio

Using a Manifold Vocoder for Spectral Voice and Style Conversion
Using a Manifold Vocoder for Spectral Voice and Style ConversionUsing a Manifold Vocoder for Spectral Voice and Style Conversion
Using a Manifold Vocoder for Spectral Voice and Style Conversion
OHSU | Oregon Health & Science University
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
Yandex
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
csandit
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
Yuki Saito
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhou
Ankit Shah
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
Forward Gradient
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)
Yi-Hsuan Yang
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
NAVER Engineering
 
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
Yi-Hsuan Yang
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorial
Ben Fields
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorial
claudio b
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
Hardik Kanjariya
 
Analysis Synthesis Comparison
Analysis Synthesis ComparisonAnalysis Synthesis Comparison
Analysis Synthesis Comparison
Jim Webb
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET Journal
 
Nithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier research_proposal
Nithin Xavier research_proposal
Nithin Xavier
 
Sltu12
Sltu12Sltu12
Sltu12
tihtow
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
RIILP
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
Yuki Saito
 

Similar to Conditional generative model for audio (20)

Using a Manifold Vocoder for Spectral Voice and Style Conversion
Using a Manifold Vocoder for Spectral Voice and Style ConversionUsing a Manifold Vocoder for Spectral Voice and Style Conversion
Using a Manifold Vocoder for Spectral Voice and Style Conversion
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhou
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
 
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorial
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorial
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Analysis Synthesis Comparison
Analysis Synthesis ComparisonAnalysis Synthesis Comparison
Analysis Synthesis Comparison
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
 
Nithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier research_proposal
Nithin Xavier research_proposal
 
Sltu12
Sltu12Sltu12
Sltu12
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 

More from Keunwoo Choi

가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술
Keunwoo Choi
 
The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...
Keunwoo Choi
 
dl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Koreadl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Korea
Keunwoo Choi
 
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Keunwoo Choi
 
Deep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - OverviewDeep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - Overview
Keunwoo Choi
 
Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24
Keunwoo Choi
 
딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)
Keunwoo Choi
 
Understanding Music Playlists
Understanding Music PlaylistsUnderstanding Music Playlists
Understanding Music Playlists
Keunwoo Choi
 

More from Keunwoo Choi (8)

가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술
 
The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...
 
dl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Koreadl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Korea
 
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
 
Deep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - OverviewDeep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - Overview
 
Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24
 
딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)
 
Understanding Music Playlists
Understanding Music PlaylistsUnderstanding Music Playlists
Understanding Music Playlists
 

Recently uploaded

Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 

Recently uploaded (20)

Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 

Conditional generative model for audio

  • 1. Conditional Generative Model for Audio 발표자: 최형석 & 이주헌 2019/11/30 (Sat.)
  • 2. 최형석 Hyeong-Seok Choi kekepa15@snu.ac.kr 이주헌 Juheon Lee juheon2@snu.ac.kr  Affiliation  Seoul National University  Music & Audio Research Group  Research interest  Audio Source Separation  Speech Enhancement  Self-supervised representation learning & generation  Singing Voice Synthesis  Affiliation  Seoul National University  Music & Audio Research Group  Research interest  Singing Voice Synthesis  Lyric-to-audio Alignment  Cover Song Identification  Abnormal Sound Detection  Choreography Generation
  • 3. 3 Generative models Dataset: Examples drawn from 𝑝(𝑿) 𝒙~𝑝(𝑿)
  • 4. 4 Generative models Dataset: Examples drawn from 𝑝(𝑿) 𝒙~𝑝(𝑿) X 𝑝(𝑿)
  • 5. 5 Generative models Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?) X 𝑝(𝑿; 𝜽) 𝑝(𝑿; 𝜽) VAE, Autoregressive models, …
  • 6. 6 Generative models Implicit models: I don’t care about the parameters, just give me some nice cats when I roll the dice! (sampling) X 𝑝(𝑿; 𝜽) GANs…
  • 7. 7 Generative models Implicit models: I don’t care about the parameters, just give me some nice cats when I roll the dice! (sampling) X 𝑝(𝑿; 𝜽) GANs…
  • 8. 8 Conditional generative models Application dependent modeling 1. Given a piano roll, I want to generate an expressive piano performance 2. Given a mel-spectrogram, I want to generate a raw audio signal 3. Given a linguistic feature, I want to generate a speech signal … Generative Model Output 1. Signal Condition 1. Controllability
  • 9. 9 Conditional generative models What does conditional generative model do?  Reconstruct a signal from a given information (filling in the missing information) Level of “missing information”? (In music&audio point of view) Condition Abstract Level Abstract (Sparse) Realistic (Dense) Instrument class Sound class Non-expressive score Linguistic Feature Audio features (mel-spectrogram) MIDI score w/ velocity and etc… Linguistic Feature w/ pitch
  • 10. 10 Conditional generative models: applications Example of densely conditioned models: Vocoders • Representative application: TTS • TTS • Next generation codec • Speech enhancement • Some representative models • Autoregressive generation • Wavenet • WaveRNN • Parallel generation • Parallel wavenet • WaveGlow/FloWaveNet • MelGANs
  • 11. 11 Conditional generative models: applications Example of densely conditioned models: Vocoders (WaveRNN: training) Upsample net GRUs … … Input2: wave[0:dim-1] GroundTruth: wave[1:dim] Input1: mel-spectrogram Num class: 2 𝑏𝑖𝑡𝑠 Training
  • 12. 12 Conditional generative models: applications Example of densely conditioned models: Vocoders (WaveRNN: training) Inference Upsample net … … Input: mel-spectrogram 0 0 Zero state sample sample sample x[1] x[2] sample sample x[N-1] x[N]x[N-2]… … output
  • 13. 13 Conditional generative models: applications Example of densely conditioned models: Vocoders • Representative application: TTS • TTS • Next generation codec • Speech enhancement • Some representative models • Autoregressive generation • Wavenet • WaveRNN • Parallel generation • Parallel wavenet • WaveGlow/FloWaveNet • MelGANs
  • 15. 15 Conditional generative models: applications Example of densely conditioned models: Vocoders Practical/interesting application of vocoders: Generative speech enhancement 1. Parametric Resynthesis with Neural Vocoders (Waspaa2019) 2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019) 3. Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement (arxiv, 2019) 4. A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE Signal processing letters, 2019)  Key idea: Ensemble the power of discriminative & generative approach! Pros: Almost no artifacts  Cons: Inaccurate pronunciation in low SNR condition  Separator Synthesizer (Vocoders) Noisy mel-spectrogram Estimated clean mel-spectrogram Discriminative Generative Synthesized clean raw wave
  • 16. 16 Conditional generative models: applications Example of densely conditioned models: Vocoders Practical/interesting application of vocoders: Generative speech enhancement • Parametric Resynthesis with Neural Vocoders (Waspaa2019) • Generative Speech Enhancement Based on Cloned Networks (Waspaa2019) • Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement (arxiv, 2019) • A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE Signal processing letters, 2019)  Key idea: Ensemble the power of discriminative & generative approach! Pros: Almost no artifacts  Cons: Inaccurate pronunciation in low SNR condition  Separator Synthesizer (Vocoders) Noisy mel-spectrogram Estimated clean mel-spectrogram Discriminative Generative Synthesized clean raw wave Some of my preliminary results… Noisy Generated
  • 17. 17 Conditional generative models: applications Example of densely conditioned models: Vocoders • Some other practical/interesting application: Next generation codec 1. Wavenet based low rate speech coding (ICASSP 2018) 2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019) 3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)  Key idea: 1. Deep learning is good at learning a compressed representation (Encoder). 2. Deep learning is good at synthesizing (Decoder). Pros: Good bit rate (bps)  Cons: ??? Encoder Server1 Compressed representation Decoder Server2 Reconstructed signal (speech)
  • 18. 18 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Training stage
  • 19. 19 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) TEXT MIDI Conditioned wave Generation stage
  • 20. 20 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Main Idea : Disentangling Formant mask & Pitch skeleton • We wanted pitch and text information to be modelled as independent acoustic features, and we designed the network to reflect that
  • 21. 21 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do re mi fa sol ra ti do” Input pitch : [C D E F G A B C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 22. 22 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do do do do do do do do” Input pitch : [C D E F G A B C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 23. 23 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do re mi fa sol ra ti do” Input pitch : [C C C C C C C C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 24. 24 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다” “arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da” Input pitch Generated result Generated singing Audio samples
  • 25. 25 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style.
  • 26. 26 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style. Generation Result Singer A Singer B
  • 27. 27 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style. Generation Result Singer A Singer B Timbre A + Style B Timbre B + Style A
  • 28. 28 Conditional generative models: applications Generative Model Output 1. Signal Condition 1. Controllability Generative Model Output a. Signal (Audio/Image) Condition a. Controllability b. Signal (Image/Audio) Randomness a. Uncertainty b. Creativity What is lacking?... Multi-modal transform Deterministic Some stochasticity Can be seen as a supervised-way of disentangling representation 1. 2.
  • 29. 29 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
  • 30. 30 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
  • 31. 31 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019) 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 Pose sequence Music sequence (concat.) 2 3 4 5 6 7 8 9 Estimated Pose sequence
  • 32. 32 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019) Black pink - 불장난 Red velvet - Rookie
  • 33. 33 Conditional generative models: applications Audio Driven Dance Generation – Dance with melody T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
  • 34. 34 Conditional generative models: applications Audio Driven Dance Generation – Dance with melody T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018) • By using autoencoder, obtain Reduced Acoustics Features • With Temporal Indexes mask, Transform the frame-indexed acoustic features into beat- indexed acoustic features.
  • 35. 35 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
  • 36. 36 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Move Learning How to Compose Generation
  • 37. 37 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Move • Decompose dance sequence with kinematic beat • With VAE, disentangle dance into initial pose + movement
  • 38. 38 Conditional generative models (multi-modal) Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Compose • Learns how to meaningfully compose a sequence of basic movements into a dance conditioned on the input music. • Conditional adversarial training for correspondence M&D
  • 39. 39 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) : conditioning applied
  • 40. 40 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
  • 41. 41 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Stochastic part (Uncertainty)
  • 42. 42 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Spk1 Spk2 Spk3 Spk4
  • 43. 43 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Fix z & Change c (speech embedding)
  • 44. 44 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Fix c & Change z (random sampling)

Editor's Notes

  1. 무엇을 채워 넣는지?에 따라서 어플리케이션이 달라짐
  2. Generator Architecture Stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer followed by a stack of residual blocks. Induced Receptive Field Residual blocks with dilations so temporally far output activations of each layer has significant overlapping inputs. Receptive field of a stack of dilated convolution layers increases exponentially with the number of layers. Discriminator Multiscale Architecture 3 discriminators (identical structure) operate on different audio scales -- original scale, 2x and 4x downsampled. Each discriminator biased to learn features for different frequency range of the audio. Window-based objective Each individual discriminator is a Markovian window-based discriminator (analogues to image patches, Isola et al. (2017)) Discriminator learns to classify between distributions of small audio chunks. Overlapping large windows maintain coherence across patches
  3. 1. 춤 2. Audio signal generation 3. Aumon (stochasticy반영) 4. Futurework with the example of Image generation with stochasticity
  4. 얼굴을 목소리로부터 100% 추정해낼 수 없음.