2. 최형석
Hyeong-Seok Choi
kekepa15@snu.ac.kr
이주헌
Juheon Lee
juheon2@snu.ac.kr
Affiliation
Seoul National University
Music & Audio Research Group
Research interest
Audio Source Separation
Speech Enhancement
Self-supervised representation learning &
generation
Singing Voice Synthesis
Affiliation
Seoul National University
Music & Audio Research Group
Research interest
Singing Voice Synthesis
Lyric-to-audio Alignment
Cover Song Identification
Abnormal Sound Detection
Choreography Generation
5. 5
Generative models
Explicit models: infer the parameters of 𝑝 𝑿; 𝜽 . (i.e., how likely is this cat?)
X
𝑝(𝑿; 𝜽)
𝑝(𝑿; 𝜽)
VAE, Autoregressive models, …
6. 6
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
7. 7
Generative models
Implicit models: I don’t care about the parameters, just give me some nice cats when I
roll the dice! (sampling)
X
𝑝(𝑿; 𝜽)
GANs…
8. 8
Conditional generative models
Application dependent modeling
1. Given a piano roll, I want to generate an expressive piano performance
2. Given a mel-spectrogram, I want to generate a raw audio signal
3. Given a linguistic feature, I want to generate a speech signal
…
Generative
Model
Output
1. Signal
Condition
1. Controllability
9. 9
Conditional generative models
What does conditional generative model do?
Reconstruct a signal from a given information (filling in the missing
information)
Level of “missing information”? (In music&audio point of view)
Condition Abstract Level
Abstract (Sparse)
Realistic (Dense)
Instrument class
Sound class
Non-expressive score
Linguistic Feature
Audio features
(mel-spectrogram)
MIDI score w/ velocity and etc…
Linguistic Feature w/ pitch
15. 15
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
1. Parametric Resynthesis with Neural Vocoders (Waspaa2019)
2. Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
3. Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
4. A Speech Synthesis Approach for High Quality Speech Separation and Generation
(IEEE Signal processing letters, 2019)
Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts
Cons: Inaccurate pronunciation in low SNR condition
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
16. 16
Conditional generative models: applications
Example of densely conditioned models: Vocoders
Practical/interesting application of vocoders: Generative speech enhancement
• Parametric Resynthesis with Neural Vocoders (Waspaa2019)
• Generative Speech Enhancement Based on Cloned Networks (Waspaa2019)
• Speaker independence of neural vocoders and their effect on parametric resynthesis
speech enhancement (arxiv, 2019)
• A Speech Synthesis Approach for High Quality Speech Separation and Generation (IEEE
Signal processing letters, 2019)
Key idea: Ensemble the power of discriminative & generative approach!
Pros: Almost no artifacts
Cons: Inaccurate pronunciation in low SNR condition
Separator
Synthesizer
(Vocoders)
Noisy mel-spectrogram Estimated clean mel-spectrogram
Discriminative Generative
Synthesized clean raw wave
Some of my preliminary results…
Noisy
Generated
17. 17
Conditional generative models: applications
Example of densely conditioned models: Vocoders
• Some other practical/interesting application: Next generation codec
1. Wavenet based low rate speech coding (ICASSP 2018)
2. Low bit-rate speech coding with vq-vae and a wavenet decoder (ICASSP 2019)
3. Improving opus low bit rate quality with neural speech synthesis (arxiv, 2019)
Key idea:
1. Deep learning is good at learning a compressed representation (Encoder).
2. Deep learning is good at synthesizing (Decoder).
Pros: Good bit rate (bps)
Cons: ???
Encoder
Server1
Compressed
representation
Decoder
Server2
Reconstructed
signal (speech)
18. 18
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Training stage
19. 19
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
TEXT MIDI
Conditioned wave
Generation stage
20. 20
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Main Idea : Disentangling Formant mask & Pitch skeleton
• We wanted pitch and text information to be modelled as independent
acoustic features, and we designed the network to reflect that
21. 21
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
22. 22
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do do do do do do do do”
Input pitch : [C D E F G A B C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
23. 23
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text : “do re mi fa sol ra ti do”
Input pitch : [C C C C C C C C]
Generated audio :
FormantmaskPitchskeletonGeneratedmelspec.
24. 24
Conditional generative models: applications
Singing Voice Generation – single singer
J Lee et al, Adversarially Trained End-to-end Korean Singing Voice Synthesis System (Interspeech 2019)
Generation Result
Input text “아리랑 아리랑 아라리오 아리랑 고개로 넘어간다 나를 버리고 가시는 님은 십리도 못 가서 발병 난다”
“arirang arirang arario arirang go gae ro neom eo gan da na reul beo ri go ga shi neun nim eun sib ri do mot ga seo bal byung nan da”
Input pitch
Generated
result
Generated singing
Audio samples
25. 25
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
26. 26
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
27. 27
Conditional generative models: applications
Singing Voice Generation – multi singer
J Lee et al, Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System (submitted to ICASSP 2020)
• Based on single-singer model, added Singer Identity Encoder.
• Disentangling Singer identity into Timbre and Singing Style.
Generation Result
Singer A Singer B
Timbre A + Style B Timbre B + Style A
28. 28
Conditional generative models: applications
Generative
Model
Output
1. Signal
Condition
1. Controllability
Generative
Model
Output
a. Signal (Audio/Image)
Condition
a. Controllability
b. Signal (Image/Audio)
Randomness
a. Uncertainty
b. Creativity
What is lacking?...
Multi-modal transform
Deterministic
Some stochasticity
Can be seen as a supervised-way of disentangling representation
1.
2.
29. 29
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
30. 30
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
32. 32
Conditional generative models: applications
Audio Driven Dance Generation – Listen to dance
J Lee et al, Automatic choreography generation with convolutional encoder-decoder network (ISMIR 2019)
Black pink - 불장난
Red velvet - Rookie
33. 33
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
34. 34
Conditional generative models: applications
Audio Driven Dance Generation – Dance with melody
T Tang et al, Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis (ACMMM 2018)
• By using autoencoder, obtain
Reduced Acoustics Features
• With Temporal Indexes mask,
Transform the frame-indexed
acoustic features into beat-
indexed acoustic features.
35. 35
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
36. 36
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
Learning How to Compose
Generation
37. 37
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Move
• Decompose dance sequence with kinematic beat
• With VAE, disentangle dance into initial pose + movement
38. 38
Conditional generative models (multi-modal)
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
Learning How to Compose
• Learns how to meaningfully compose a sequence of basic
movements into a dance conditioned on the input music.
• Conditional adversarial training for correspondence M&D
39. 39
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
: conditioning applied
40. 40
Conditional generative models: applications
Audio Driven Dance Generation – Dancing to music
Hsin-Ying Lee et al, Dancing to music (Neurlps 2019)
41. 41
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Stochastic part
(Uncertainty)
42. 42
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Spk1
Spk2
Spk3
Spk4
43. 43
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix z & Change c (speech embedding)
44. 44
Conditional generative models: applications
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face
from Speech
Anonymous authors (ICLR2020 openreview)
Fix c & Change z (random sampling)
Generator
Architecture
Stack of transposed convolutional layers to upsample the input sequence.
Each transposed convolutional layer followed by a stack of residual blocks.
Induced Receptive Field
Residual blocks with dilations so
temporally far output activations of each layer has significant overlapping inputs.
Receptive field of a stack of dilated convolution layers increases exponentially with the number of layers.
Discriminator
Multiscale Architecture
3 discriminators (identical structure) operate on different audio scales -- original scale, 2x and 4x downsampled.
Each discriminator biased to learn features for different frequency range of the audio.
Window-based objective
Each individual discriminator is a Markovian window-based discriminator (analogues to image patches, Isola et al. (2017))
Discriminator learns to classify between distributions of small audio chunks.
Overlapping large windows maintain coherence across patches
1. 춤
2. Audio signal generation
3. Aumon (stochasticy반영)
4. Futurework with the example of Image generation with stochasticity