SlideShare a Scribd company logo
1 of 45
Conditional
Generative Model
for Audio
발표자: 최형석 & 이주헌
2019/11/30 (Sat.)
최형석
Hyeong-Seok Choi
kekepa15@snu.ac.kr
이주헌
Juheon Lee
juheon2@snu.ac.kr
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Audio Source Separation
 Speech Enhancement
 Self-supervised representation learning &
generation
 Singing Voice Synthesis
 Affiliation
 Seoul National University
 Music & Audio Research Group
 Research interest
 Singing Voice Synthesis
 Lyric-to-audio Alignment
 Cover Song Identification
 Abnormal Sound Detection
 Choreography Generation
3
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)
4
Generative models
Dataset: Examples drawn from 𝑝(𝑿)
𝒙~𝑝(𝑿)
X
𝑝(𝑿)
5
Generative models
Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?)
X
𝑝(𝑿; 𝜽)
𝑝(𝑿; 𝜽)
VAE, Autoregressive models, …
6
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
7
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
8
Conditional generative models
Application dependent modeling
1. Given a piano roll, I want to generate an expressive piano performance
2. Given a mel-spectrogram, I want to generate a raw audio signal
3. Given a linguistic feature, I want to generate a speech signal
…
Generative
Model
Output
1. Signal
Condition
1. Controllability
9
Conditional generative models
What does conditional generative model do?
 Reconstruct a signal from a given information (filling in the missing
information)
Level of “missing information”? (In music&audio point of view)
Condition Abstract Level
Abstract (Sparse)
Realistic (Dense)
Instrument class
Sound class
Non-expressive score
Linguistic Feature
Audio features
(mel-spectrogram)
MIDI score w/ velocity and etc…
Linguistic Feature w/ pitch
10
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs
11
Conditional generative models: applications
Example of densely conditioned models: Vocoders (WaveRNN: training)
Upsample net
GRUs
… …
Input2:
wave[0:dim-1]
GroundTruth:
wave[1:dim]
Input1: mel-spectrogram
Num class: 2 𝑏𝑖𝑡𝑠
Training
12
Conditional generative models: applications
Example of densely conditioned models: Vocoders (WaveRNN: training)
Inference
Upsample net
… …
Input: mel-spectrogram
0
0
Zero state
sample sample sample
x[1] x[2]
sample sample
x[N-1] x[N]x[N-2]…
…
output
13
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Representative application: TTS
• TTS
• Next generation codec
• Speech enhancement
• Some representative models
• Autoregressive generation
• Wavenet
• WaveRNN
• Parallel generation
• Parallel wavenet
• WaveGlow/FloWaveNet
• MelGANs
14
Conditional generative models: applications
15
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
1. Parametric Resynthesis with Neural Vocoders (Waspaa2019)
2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
3. Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
4. A Speech Synthesis Approach for High Quality Speech Separation and Generation
(IEEE Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
16
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
• Parametric Resynthesis with Neural Vocoders (Waspaa2019)
• Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
• Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
• A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE
Signal processing letters, 2019)
 Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts 
Cons: Inaccurate pronunciation in low SNR condition 
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
Some of my preliminary results…
Noisy
Generated
17
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Some other practical/interesting application: Next generation codec
1. Wavenet based low rate speech coding (ICASSP 2018)
2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019)
3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)
 Key idea:
1. Deep learning is good at learning a compressed representation (Encoder).
2. Deep learning is good at synthesizing (Decoder).
Pros: Good bit rate (bps) 
Cons: ???
Encoder
Server1
Compressed
representation
Decoder
Server2
Reconstructed
signal (speech)
18
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Training stage
19
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
TEXT MIDI
Conditioned wave
Generation stage
20
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Main Idea : Disentangling Formant mask & Pitch skeleton
• We wanted pitch and text information to be modelled as independent
acoustic features, and we designed the network to reflect that
21
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
22
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do do do do do do do do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
23
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C C C C C C C C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
24
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다”
“arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da”
Input pitch
Generated
result
Generated singing
Audio samples
25
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
26
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
27
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
Timbre A + Style B Timbre B + Style A
28
Conditional generative models: applications
Generative
Model
Output
1. Signal
Condition
1. Controllability
Generative
Model
Output
a. Signal (Audio/Image)
Condition
a. Controllability
b. Signal (Image/Audio)
Randomness
a. Uncertainty
b. Creativity
What is lacking?...
Multi-modal transform
Deterministic
Some stochasticity
Can be seen as a supervised-way of disentangling representation
1.
2.
29
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
30
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
31
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
Pose sequence
Music sequence
(concat.)
2 3 4 5 6 7 8 9
Estimated
Pose sequence
32
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
Black pink - 불장난
Red velvet - Rookie
33
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
34
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
• By using autoencoder, obtain
Reduced Acoustics Features
• With Temporal Indexes mask,
Transform the frame-indexed
acoustic features into beat-
indexed acoustic features.
35
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
36
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
Learning How to Compose
Generation
37
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
• Decompose dance sequence with kinematic beat
• With VAE, disentangle dance into initial pose + movement
38
Conditional generative models (multi-modal)
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Compose
• Learns how to meaningfully compose a sequence of basic
movements into a dance conditioned on the input music.
• Conditional adversarial training for correspondence M&D
39
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
: conditioning applied
40
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
41
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Stochastic part
(Uncertainty)
42
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Spk1
Spk2
Spk3
Spk4
43
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix z & Change c (speech embedding)
44
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix c & Change z (random sampling)
Thank You!
Questions?

More Related Content

What's hot

Sound recording glossary - IMPROVED
Sound recording glossary - IMPROVEDSound recording glossary - IMPROVED
Sound recording glossary - IMPROVEDPaulinaKucharska
 
IG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrneIG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrneterry96
 
Query By humming - Music retrieval technology
Query By humming - Music retrieval technologyQuery By humming - Music retrieval technology
Query By humming - Music retrieval technologyShital Kat
 
IG2 Task 1 Worksheet
IG2 Task 1 WorksheetIG2 Task 1 Worksheet
IG2 Task 1 WorksheetSamDuxburyGDS
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheetluisfvazquez1
 
Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2JoeBrannigan
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet sShaz Riches
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet sShaz Riches
 
Mono and stereo
Mono and stereoMono and stereo
Mono and stereok13086
 
Sound recording glossary
Sound recording glossarySound recording glossary
Sound recording glossaryBen Atherton
 
IG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence ByrneIG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence Byrneterry96
 
Convolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classificationConvolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classificationKeunwoo Choi
 
Jordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisitedJordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisitedJordanSmith96
 
Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)Mrrrjones
 

What's hot (20)

Sound recording glossary
Sound recording glossarySound recording glossary
Sound recording glossary
 
Sound recording glossary - IMPROVED
Sound recording glossary - IMPROVEDSound recording glossary - IMPROVED
Sound recording glossary - IMPROVED
 
IG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrneIG2 task 1 work sheet terence byrne
IG2 task 1 work sheet terence byrne
 
Query By humming - Music retrieval technology
Query By humming - Music retrieval technologyQuery By humming - Music retrieval technology
Query By humming - Music retrieval technology
 
IG2 Task 1 Worksheet
IG2 Task 1 WorksheetIG2 Task 1 Worksheet
IG2 Task 1 Worksheet
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2Ig2 task 1 work sheet 2
Ig2 task 1 work sheet 2
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet s
 
Ig2 task 1 work sheet s
Ig2 task 1 work sheet sIg2 task 1 work sheet s
Ig2 task 1 work sheet s
 
Mono and stereo
Mono and stereoMono and stereo
Mono and stereo
 
Surround sount system
Surround sount systemSurround sount system
Surround sount system
 
Sound recording glossary
Sound recording glossarySound recording glossary
Sound recording glossary
 
IG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence ByrneIG2 Task 1 Work Sheet Terence Byrne
IG2 Task 1 Work Sheet Terence Byrne
 
Convolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classificationConvolutional recurrent neural networks for music classification
Convolutional recurrent neural networks for music classification
 
Multimedia elements
Multimedia elementsMultimedia elements
Multimedia elements
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Jordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisitedJordan smith ig2 task 1 revisited
Jordan smith ig2 task 1 revisited
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)Ig2 task 1 work sheet (revisited)
Ig2 task 1 work sheet (revisited)
 

Similar to Conditional generative model for audio

"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...Yandex
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...csandit
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfYuki Saito
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhouAnkit Shah
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Yi-Hsuan Yang
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communicationsNAVER Engineering
 
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)Yi-Hsuan Yang
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialBen Fields
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialclaudio b
 
Analysis Synthesis Comparison
Analysis Synthesis ComparisonAnalysis Synthesis Comparison
Analysis Synthesis ComparisonJim Webb
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET Journal
 
Nithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier
 
Sltu12
Sltu12Sltu12
Sltu12tihtow
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 

Similar to Conditional generative model for audio (20)

Using a Manifold Vocoder for Spectral Voice and Style Conversion
Using a Manifold Vocoder for Spectral Voice and Style ConversionUsing a Manifold Vocoder for Spectral Voice and Style Conversion
Using a Manifold Vocoder for Spectral Voice and Style Conversion
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
 
Dolby audio ai workshop speech coding - cong zhou
Dolby audio ai workshop   speech coding - cong zhouDolby audio ai workshop   speech coding - cong zhou
Dolby audio ai workshop speech coding - cong zhou
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
 
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorial
 
Mining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorialMining the social web for music-related data: a hands-on tutorial
Mining the social web for music-related data: a hands-on tutorial
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Analysis Synthesis Comparison
Analysis Synthesis ComparisonAnalysis Synthesis Comparison
Analysis Synthesis Comparison
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
 
Nithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier research_proposal
Nithin Xavier research_proposal
 
Sltu12
Sltu12Sltu12
Sltu12
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 

More from Keunwoo Choi

가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술Keunwoo Choi
 
The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...Keunwoo Choi
 
dl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Koreadl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, KoreaKeunwoo Choi
 
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Keunwoo Choi
 
Deep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - OverviewDeep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - OverviewKeunwoo Choi
 
Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Keunwoo Choi
 
딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)Keunwoo Choi
 
Understanding Music Playlists
Understanding Music PlaylistsUnderstanding Music Playlists
Understanding Music PlaylistsKeunwoo Choi
 

More from Keunwoo Choi (8)

가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술가상현실을 위한 오디오 기술
가상현실을 위한 오디오 기술
 
The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...The effects of noisy labels on deep convolutional neural networks for music t...
The effects of noisy labels on deep convolutional neural networks for music t...
 
dl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Koreadl4mir tutorial at ETRI, Korea
dl4mir tutorial at ETRI, Korea
 
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
 
Deep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - OverviewDeep Convolutional Neural Networks - Overview
Deep Convolutional Neural Networks - Overview
 
Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24
 
딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)
 
Understanding Music Playlists
Understanding Music PlaylistsUnderstanding Music Playlists
Understanding Music Playlists
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Conditional generative model for audio

  • 1. Conditional Generative Model for Audio 발표자: 최형석 & 이주헌 2019/11/30 (Sat.)
  • 2. 최형석 Hyeong-Seok Choi kekepa15@snu.ac.kr 이주헌 Juheon Lee juheon2@snu.ac.kr  Affiliation  Seoul National University  Music & Audio Research Group  Research interest  Audio Source Separation  Speech Enhancement  Self-supervised representation learning & generation  Singing Voice Synthesis  Affiliation  Seoul National University  Music & Audio Research Group  Research interest  Singing Voice Synthesis  Lyric-to-audio Alignment  Cover Song Identification  Abnormal Sound Detection  Choreography Generation
  • 3. 3 Generative models Dataset: Examples drawn from 𝑝(𝑿) 𝒙~𝑝(𝑿)
  • 4. 4 Generative models Dataset: Examples drawn from 𝑝(𝑿) 𝒙~𝑝(𝑿) X 𝑝(𝑿)
  • 5. 5 Generative models Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?) X 𝑝(𝑿; 𝜽) 𝑝(𝑿; 𝜽) VAE, Autoregressive models, …
  • 6. 6 Generative models Implicit models: I don’t care about the parameters, just give me some nice cats when I roll the dice! (sampling) X 𝑝(𝑿; 𝜽) GANs…
  • 7. 7 Generative models Implicit models: I don’t care about the parameters, just give me some nice cats when I roll the dice! (sampling) X 𝑝(𝑿; 𝜽) GANs…
  • 8. 8 Conditional generative models Application dependent modeling 1. Given a piano roll, I want to generate an expressive piano performance 2. Given a mel-spectrogram, I want to generate a raw audio signal 3. Given a linguistic feature, I want to generate a speech signal … Generative Model Output 1. Signal Condition 1. Controllability
  • 9. 9 Conditional generative models What does conditional generative model do?  Reconstruct a signal from a given information (filling in the missing information) Level of “missing information”? (In music&audio point of view) Condition Abstract Level Abstract (Sparse) Realistic (Dense) Instrument class Sound class Non-expressive score Linguistic Feature Audio features (mel-spectrogram) MIDI score w/ velocity and etc… Linguistic Feature w/ pitch
  • 10. 10 Conditional generative models: applications Example of densely conditioned models: Vocoders • Representative application: TTS • TTS • Next generation codec • Speech enhancement • Some representative models • Autoregressive generation • Wavenet • WaveRNN • Parallel generation • Parallel wavenet • WaveGlow/FloWaveNet • MelGANs
  • 11. 11 Conditional generative models: applications Example of densely conditioned models: Vocoders (WaveRNN: training) Upsample net GRUs … … Input2: wave[0:dim-1] GroundTruth: wave[1:dim] Input1: mel-spectrogram Num class: 2 𝑏𝑖𝑡𝑠 Training
  • 12. 12 Conditional generative models: applications Example of densely conditioned models: Vocoders (WaveRNN: training) Inference Upsample net … … Input: mel-spectrogram 0 0 Zero state sample sample sample x[1] x[2] sample sample x[N-1] x[N]x[N-2]… … output
  • 13. 13 Conditional generative models: applications Example of densely conditioned models: Vocoders • Representative application: TTS • TTS • Next generation codec • Speech enhancement • Some representative models • Autoregressive generation • Wavenet • WaveRNN • Parallel generation • Parallel wavenet • WaveGlow/FloWaveNet • MelGANs
  • 15. 15 Conditional generative models: applications Example of densely conditioned models: Vocoders Practical/interesting application of vocoders: Generative speech enhancement 1. Parametric Resynthesis with Neural Vocoders (Waspaa2019) 2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019) 3. Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement (arxiv, 2019) 4. A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE Signal processing letters, 2019)  Key idea: Ensemble the power of discriminative & generative approach! Pros: Almost no artifacts  Cons: Inaccurate pronunciation in low SNR condition  Separator Synthesizer (Vocoders) Noisy mel-spectrogram Estimated clean mel-spectrogram Discriminative Generative Synthesized clean raw wave
  • 16. 16 Conditional generative models: applications Example of densely conditioned models: Vocoders Practical/interesting application of vocoders: Generative speech enhancement • Parametric Resynthesis with Neural Vocoders (Waspaa2019) • Generative Speech Enhancement Based on Cloned Networks (Waspaa2019) • Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement (arxiv, 2019) • A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE Signal processing letters, 2019)  Key idea: Ensemble the power of discriminative & generative approach! Pros: Almost no artifacts  Cons: Inaccurate pronunciation in low SNR condition  Separator Synthesizer (Vocoders) Noisy mel-spectrogram Estimated clean mel-spectrogram Discriminative Generative Synthesized clean raw wave Some of my preliminary results… Noisy Generated
  • 17. 17 Conditional generative models: applications Example of densely conditioned models: Vocoders • Some other practical/interesting application: Next generation codec 1. Wavenet based low rate speech coding (ICASSP 2018) 2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019) 3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)  Key idea: 1. Deep learning is good at learning a compressed representation (Encoder). 2. Deep learning is good at synthesizing (Decoder). Pros: Good bit rate (bps)  Cons: ??? Encoder Server1 Compressed representation Decoder Server2 Reconstructed signal (speech)
  • 18. 18 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Training stage
  • 19. 19 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) TEXT MIDI Conditioned wave Generation stage
  • 20. 20 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Main Idea : Disentangling Formant mask & Pitch skeleton • We wanted pitch and text information to be modelled as independent acoustic features, and we designed the network to reflect that
  • 21. 21 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do re mi fa sol ra ti do” Input pitch : [C D E F G A B C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 22. 22 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do do do do do do do do” Input pitch : [C D E F G A B C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 23. 23 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text : “do re mi fa sol ra ti do” Input pitch : [C C C C C C C C] Generated audio : FormantmaskPitchskeletonGeneratedmelspec.
  • 24. 24 Conditional generative models: applications Singing Voice Generation – single singer J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019) Generation Result Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다” “arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da” Input pitch Generated result Generated singing Audio samples
  • 25. 25 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style.
  • 26. 26 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style. Generation Result Singer A Singer B
  • 27. 27 Conditional generative models: applications Singing Voice Generation – multi singer J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020) • Based on single-singer model, added Singer Identity Encoder. • Disentangling Singer identity into Timbre and Singing Style. Generation Result Singer A Singer B Timbre A + Style B Timbre B + Style A
  • 28. 28 Conditional generative models: applications Generative Model Output 1. Signal Condition 1. Controllability Generative Model Output a. Signal (Audio/Image) Condition a. Controllability b. Signal (Image/Audio) Randomness a. Uncertainty b. Creativity What is lacking?... Multi-modal transform Deterministic Some stochasticity Can be seen as a supervised-way of disentangling representation 1. 2.
  • 29. 29 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
  • 30. 30 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
  • 31. 31 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019) 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 Pose sequence Music sequence (concat.) 2 3 4 5 6 7 8 9 Estimated Pose sequence
  • 32. 32 Conditional generative models: applications Audio Driven Dance Generation – Listen to dance J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019) Black pink - 불장난 Red velvet - Rookie
  • 33. 33 Conditional generative models: applications Audio Driven Dance Generation – Dance with melody T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
  • 34. 34 Conditional generative models: applications Audio Driven Dance Generation – Dance with melody T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018) • By using autoencoder, obtain Reduced Acoustics Features • With Temporal Indexes mask, Transform the frame-indexed acoustic features into beat- indexed acoustic features.
  • 35. 35 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
  • 36. 36 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Move Learning How to Compose Generation
  • 37. 37 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Move • Decompose dance sequence with kinematic beat • With VAE, disentangle dance into initial pose + movement
  • 38. 38 Conditional generative models (multi-modal) Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) Learning How to Compose • Learns how to meaningfully compose a sequence of basic movements into a dance conditioned on the input music. • Conditional adversarial training for correspondence M&D
  • 39. 39 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019) : conditioning applied
  • 40. 40 Conditional generative models: applications Audio Driven Dance Generation – Dancing to music Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
  • 41. 41 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Stochastic part (Uncertainty)
  • 42. 42 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Spk1 Spk2 Spk3 Spk4
  • 43. 43 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Fix z & Change c (speech embedding)
  • 44. 44 Conditional generative models: applications From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech Anonymous authors (ICLR2020 openreview) Fix c & Change z (random sampling)

Editor's Notes

  1. 무엇을 채워 넣는지?에 따라서 어플리케이션이 달라짐
  2. Generator Architecture Stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer followed by a stack of residual blocks. Induced Receptive Field Residual blocks with dilations so temporally far output activations of each layer has significant overlapping inputs. Receptive field of a stack of dilated convolution layers increases exponentially with the number of layers. Discriminator Multiscale Architecture 3 discriminators (identical structure) operate on different audio scales -- original scale, 2x and 4x downsampled. Each discriminator biased to learn features for different frequency range of the audio. Window-based objective Each individual discriminator is a Markovian window-based discriminator (analogues to image patches, Isola et al. (2017)) Discriminator learns to classify between distributions of small audio chunks. Overlapping large windows maintain coherence across patches
  3. 1. 춤 2. Audio signal generation 3. Aumon (stochasticy반영) 4. Futurework with the example of Image generation with stochasticity
  4. 얼굴을 목소리로부터 100% 추정해낼 수 없음.