SlideShare a Scribd company logo
1 of 113
Download to read offline
Generating Music
with GANs
An Overview and Case Studies
Hao-Wen Dong
UC San Diego & Academia Sinica
Yi-Hsuan Yang
Taiwan AI Labs & Academia Sinica
November 4th, 2019
salu133445.github.io/ismir2019tutorial/
Outline
2
Section 1
Overview of music generation research
Section 2
Introduction to GANs
Coding session 1: GAN for images
Section 3
Case studies of GAN-based systems (I)
Coding session 2: GAN for piano rolls
Case studies of GAN-based systems (II)
Section 4
Current limitations
Future research directionssalu133445.github.io/ismir2019tutorial/
◁ break
◁ break
About us
Hao-Wen Dong
• Ph.D. student, UC San Diego (2019-)
• Research intern, Yamaha Corporation (2019)
• Research assistant, Academia Sinica (2017-2019)
• First author of MuseGAN & BMuseGAN
Yi-Hsuan Yang
• Chief Music Scientist, Taiwan AI Labs (2019-)
• Research professor, Academia Sinica (2011-)
• Ph.D., National Taiwan University (2006-2010)
• Associate Editor of IEEE TMM and TAFFC (2017-2019)
• Program Chair of ISMIR @ Taipei, Taiwan (2014)
• Tutorial speaker of ISMIR (2012 last time)
3
About the Music and AI Lab @ Sinica
• About Academia Sinica
• National academy of Taiwan, founded in 1928 (not a university)
• About 1,000 full, associate, assistant research professors
• About Music and AI Lab
• https://musicai.citi.sinica.edu.tw/
• Since Sep 2011
• Members
• PI [me]
• research assistants
• PhD/master students
• 3 AAAI full papers + 3 IJCAI full papers in 2018 and 2019
4
About the Music Team @ Taiwan AI Labs
• About Taiwan AI Labs
• https://ailabs.tw/
• Privately-funded research organization (like openAI), founded in 2017
• Three main research area: 1) HCI, 2) medicine, 3) smart city
• 100+ employees (late 2019)
• About the “Yating” Music AI team
• Members
• scientist [me]
• ML engineers (for models)
• musicians
• program manager
• software engineers (for frontend/backend)
5
Outline
6
Section 1
Overview of music generation research
Section 2
Introduction to GANs
Section 3
Case studies of GAN-based systems
Section 4
Current limitations
Future research directions
From a music production view
• Composing / songwriting
• Melody
• Chords
• Lyrics
• Arranging
• Instrumentation
• Structure
• Mixing
• Timbres/tones
• Balancing
7
Use cases of music AI
• Make musicians’ life easier
• Inspire ideas
• Suggest continuations, accompaniments, lyrics, or drum loops
• Suggest mixing presets
• Empower everyone to make music
• Democratization of music creation
• Create copyright free music for videos or games
• Music education (e.g., auto-accompaniment)
8
Companies involved in automatic music
generation
9
… and more!
Demo: Magenta Studio
• https://magenta.tensorflow.org/studio/
10
Demo: Jamming with Yating
11
• https://www.youtube.com/watch?v=9ZIJrr6lmHg
• Jamming with Yating [1,2]
• Input (by human): piano
• Output: piano + bass + drum
[1] Hsiao et al., “Jamming with Yating: Interactive
demonstration of a music composition AI,”
ISMIR-LBD 2019
[2] Yeh et al., “Learning to generate Jazz and Pop
piano music from audio via MIR techniques,”
ISMIR-LBD 2019
From a deep learning view
• Input representation
• Model
• Output representation
12
Models
• Rule based methods
• Concatenation based methods
• Machine learning based methods
• VAE: variational autoencoder
• GAN: generative adversarial network
• See Section 2: Introduction to GANs
13
Why GAN?
• State-of-the-art model in:
• Image generation: BigGAN [1]
• Text-to-speech audio synthesis: GAN-TTS [2]
• Note-level instrument audio synthesis: GANSynth [3]
• Also see ICASSP 2018 tutorial: “GAN and its applications to signal processing and NLP” [4]
• Its potential for music generation has not been fully realized
• Adversarial training has many other applications
• For example, source separation [5], domain adapation [6], music transcription [7]
14
[1] “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019
[2] “High fidelity speech synthesis with adversarial networks,” ICLR 2020 submission
[3] “GANSynth: Adversarial neural audio synthesis,” ICLR 2019
[4] https://tinyurl.com/y23ywv4s (on slideshare)
[5] “SVSGAN: Singing voice separation via generative adversarial network,” ICASSP 2018
[6] “Cross-cultural music emotion recognition by adversarial discriminative domain adaptation,” ICMLA 2018
[7] “Adversarial learning for improved onsets and frames music transcription,” ISMIR 2019
Input/output representations
• Symbolic output
• Piano rolls
• MIDI events
• Score
• Audio output
• Sepctrogram
• Waveform
15
I/O representations
• Symbolic output
• Piano rolls (image-like): easier for GANs to work with
• MIDI events (text-like)
• Score (hybrid)
• Audio output
• Sepctrogram (image-like)
• Waveform
16
I/O representations
• Symbolic output
• Piano rolls (image-like):
MidiNet [1], MuseGAN [2]
• MIDI events (text-like):
Music Transformer [3], MuseNet [4]
• Scores (hybrid):
Thickstun [5], measure-by-measure [6]
17
[1] https://arxiv.org/abs/1703.10847, ISMIR 2017
[2] https://arxiv.org/abs/1709.06298, AAAI 2018
[3] https://openreview.net/pdf?id=rJe4ShAcF7, ICLR 2019
[4] https://openai.com/blog/musenet/
[5] https://arxiv.org/abs/1811.08045, ISMIR 2019
[6] https://openreview.net/forum?id=Hklk6xrYPB, ICLR 2020 submission
Scope of music generation
• Generation from scratch
• X  melody
• X  piano roll
• X  audio
• Conditional generation
• melody  piano roll (accompaniment)
• piano roll  audio (synthesis)
• piano roll  piano roll’ (rearrangement)
• audio  audio’
• See Section 3: Case studies
• and, https://github.com/affige/genmusic_demo_list
18
We will talk about
1. Symbolic melody generation: MidiNet [1], SSMGAN [2]
2. Arrangement generation: MuseGAN [3], BinaryMuseGAN [4], LeadSheetGAN [5]
3. Style transfer: CycleGAN [6], TimbreTron [7], Play-as-you-like [8], CycleBEGAN [9]
4. Audio generation: WaveGAN [10], GANSynth [11]
19
[1] “MidiNet: A convolutional GAN for symbolic-domain music generation,” ISMIR 2017
[2] “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019
[3] “MuseGAN: Multi-track sequential GANs for symbolic music generation and accompaniment,” AAAI 2018
[4] “Convolutional GANs with binary neurons for polyphonic music generation,” ISMIR 2018
[5] “Lead sheet generation and arrangement by conditional GAN,” ISMIR-LBD 2018
[6] “Symbolic music genre transfer with CycleGAN,” ICTAI 2018
[7] “TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer,” ICLR 2019
[8] “Play as You Like: Timbre-enhanced multi-modal music style transfer,” AAAI 2019
[9] “Singing style transfer using cycle-consistent boundary equilibrium GANs,” ICML workshop 2018
[10] “Adversarial audio synthesis,” ICLR 2019
[11] “GANSynth: Adversarial neural audio synthesis,” ICLR 2019
Outline
20
Section 1
Overview of music generation research
Section 2
Introduction to GANs
Section 3
Case studies of GAN-based systems
Section 4
Current limitations
Future research directions
What is a GAN?
Generative Adversarial Network
21
an adversarial game
between two competitors
a generative model a deep neural network
A loss function for training generative models
22
D 1/0
real samples
Gz~pZ G(z)
random noise fake samples
x~pX
latent variable model
(to be learned)
Goodfellow et al., “Generative adversarial networks,” NeurIPS 2014
A loss function for training generative models
23
D 1/0
log(1-D(x)) + log(D(G(z)))
log(1-D(G(z)))
Generator
Make G(z) indistinguishable
from real data for D
Discriminator
Tell G(z) as fake data from x
being real ones
real samples
Gz~pZ G(z)
random noise fake samples
x~pX
Goodfellow et al., “Generative adversarial networks,” NeurIPS 2014
Problems of unregularized GANs
• Key—discriminator provides generator with
gradients as a guidance for improvement
• Discrimination is easier than generation
• Discriminator tends to provide large gradients
• Result in unstable training of the generator
• Common failure cases
• Mode collapse
• Missing modes
24
(Colors show the outputs of the discriminator)
sharp color changes  large gradients
Regularizing GANs
25
Locally regularized
Advantages of gradient regularization
• provide a smoother guidance to the generator
• alleviate mode collapse and missing modes issues
Globally regularizedUnregularized
gradient clipping [1]
gradient penalties [2,3]
spectral normalization [4]
[1] Arjovsky et al., “Wasserstein generative adversarial networks,” ICML 2017
[2] Gulrajani et al., “Improved training of Wasserstein GANs,” NeurIPS 2017
[3] Kodali et al., “On convergence and stability of GANs,” arXiv 2017
[4] Miyato et al., “Spectral normalization for generative adversarial networks,” ICLR 2018
Coding session I—GAN for images
Google Colab notebook link
https://colab.research.google.com/drive/1Cnq9z3QvxIsVntlXKjPjbwttxeDH47Xl
You can also find the link on the tutorial website
https://salu133445.github.io/ismir2019tutorial/
Click
26
Deep convolutional GAN (DCGAN)
27
D 1/0
real samples
Gz~pZ G(z)
random noise fake samples
x~pX
transposed convolutional layers
Key—Use CNNs for both G and D
Radford et al., “Unsupervised representation learning with deep convolutional GANs,” ICLR 2016
GANs vs VAEs
28
GAN VAE
Objective
(generator) fool the discriminator
(discriminator) tell real data from fake ones
reconstruct real data using pixel-wise loss
Results
tend to be sharper tend to be more blurred
Diversity Higher Lower
Stability Lower Higher
Larsen et al., “Autoencoding beyond pixels using a learned similarity metric,” ICML 2016
State of the arts—BigGANs
29
[Colab notebook demo]
https://colab.research.google.com/github/tensorflow/hub/blob/
master/examples/colab/biggan_generation_with_tf_hub.ipynb (hard classes)
Brock et al., “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019
Interpolation on the latent space
30
Brock et al., “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019
latent space
data space
Conditional GAN (CGAN)
31
D 1/0
real samples
Gz~pZ G(z, y)
random noise fake samples
x
(x, y)~pX, Y
conditions
y
Mirza et al., “Conditional generative
adversarial nets,” arXiv 2014
Conditional GAN (CGAN)
32
D 1/0
real samples
Gz~pZ G(z, y)
random noise fake samples
x
(x, y)~pX, Y
conditions
y
y~pY
conditions
Mirza et al., “Conditional generative
adversarial nets,” arXiv 2014
Conditional GAN (CGAN)
33
D 1/0
real samples
Gz~pZ G(z, y)
random noise fake samples
x
(x, y)~pX, Y
conditions
y
y~pY
conditions
Key—Feed conditions to both G and D
Discriminator now examine whether
a pair (x, y) or (G(z), y) is real or not
Generator now generate samples
based on some conditions
Mirza et al., “Conditional generative
adversarial nets,” arXiv 2014
Conditional GAN—Samples
34
Dog Cat Tiger
pix2pix
35
DY 1/0
real samples
G(x)x
fake samples
y
GX→Y
real samples
(x, y)~pX, Y
Key—Use pixel-wise loss for supervisions
pixel-wise
loss
Isola et al., “Image-to-image translation with conditional adversarial nets,” CVPR 2017
pix2pix—Samples
36Isola et al., “Image-to-image translation with conditional adversarial nets,” CVPR 2017
Cycle-consistent GAN (CycleGAN)
37
real samples
x~pX
y~pY
DY 1/0
GX→Y(x)
fake samples
GX→Y
real samples
Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
Cycle-consistent GAN (CycleGAN)
38
real samples
x~pX
y~pY
real samples
GY→X(y)
fake samples
GY→X
DX1/0
Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
Cycle-consistent GAN (CycleGAN)
39
real samples
x~pX
y~pY
DY 1/0
GX→Y(x)
fake samples
GX→Y
real samples
GY→X(y)
fake samples
GY→X
DX1/0
GAN 1
GAN 2
Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
Cycle-consistent loss
40
cycle-consistency
loss
cycle-consistency
loss
CycleGAN
Unpaired samples in two domains
pix2pix
Require paired (x, y) samples
pixel-wise
loss
(each x ∈ X is mapped to a certain y ∈ Y)
Key—Use cycle-consistency
loss for supervisions
Isola et al., “Image-to-image translation with conditional adversarial nets,” CVPR 2017
Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
CycleGAN—Samples
41
(pix2pix) We do not need a Monte’s version for each photo  hard to acquire
(CycleGAN) We only need a collection of Monet’s paintings and a collection of photos  easier to acquire
Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
The many other GANs … and more!
Training
criterions
• LSGAN
• WGAN
• EBGAN
• BEGAN
• GeometricGAN
• RaGAN
Applications
• DCGAN
• InfoGAN
• CGAN
• ACGAN
• pix2pix
• CycleGAN
• CoGAN
• DAN
42
Optimization
constraints
• WGAN
• WGANGP
• McGAN
• MMDGAN
• FisherGAN
• DRAGAN
• SNGAN
Training
strategies
• UnrolledGAN
• LAPGAN
• StackedGAN
• StackGAN
• PGGAN
• StyleGAN
• BigGAN
See more at https://github.com/hindupuravinash/the-gan-zoo
Open questions about GANs
• https://distill.pub/2019/gan-open-problems/
1. What are the trade-offs between GANs and other generative models?
2. What sorts of distributions can GANs model?
3. How can we Scale GANs beyond image synthesis?
4. What can we say about the global convergence of the training dynamics?
5. How should we evaluate GANs and when should we use them?
6. How does GAN training scale with batch size?
7. What is the relationship between GANs and adversarial examples?
43
Comparative studies on GANs
[1] Kunfeng Wang, Chao Gou, Yanjie Duan, Yilun Lin, Xinhu Zheng, and Fei-Yue Wang. Generative
adversarial networks: Introduction and outlook. IEEE/CAA Journal of Automatica Sinica,
4(4):588-598, 2017.
[2] Mario Lučić, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs
created equal? A large-scale Study. Proc. Advances in Neural Information Processing Systems 31,
pp. 700–709, 2018.
[3] Karol Kurach, Mario Lučić, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A large-scale study
on regularization and normalization in GANs. Proc. International Conference on Machine
Learning (PMLR), 97:3581-3590, 2019.
[4] Hao-Wen Dong and Yi-Hsuan Yang, “Towards a deeper understanding of adversarial losses,”
arXiv preprint arXiv:1901.08753, 2019.
44
Tea Time!
45
Section 1
Overview of music generation research
Section 2
Introduction to GANs
Section 3
Case studies of GAN-based systems
Section 4
Current limitations &
Future research directions
Or try the next Google Colab notebook
https://colab.research.google.com/drive/
1WrFtqo5LW8QfhiuhHmge9QLexWwS2BcM
Generating Music with GANs
Outline
46
Section 1
Overview of music generation research
Section 2
Introduction to GANs
Section 3
Case studies of GAN-based systems
Section 4
Current limitations
Future research directions
Scope of music generation
1. Symbolic melody generation
• X  melody
2. Arrangement generation
• X  piano roll
• melody  piano roll
3. Style transfer
• piano roll  piano roll’
• audio  audio’
4. Audio generation
• X  audio
• piano roll  audio
47
Some symbolic datasets to starts with
48
Dataset Type Format Link
Wikifornia lead sheet XML www.wikifonia.org
HookTheory lead sheet XML www.hooktheory.com/theorytab
Lakh MIDI Dataset multitrack MIDI https://colinraffel.com/projects/lmd
Lakh Pianoroll Dataset multitrack npz salu133445.github.io/lakh-pianoroll-dataset
Groove MIDI Dataset drum MIDI magenta.tensorflow.org/datasets/groove
Midi Man drum MIDI
www.reddit.com/r/WeAreTheMusicMakers/comments
/3anwu8/the_drum_percussion_midi_archive_800k/
See more at https://github.com/wayne391/symbolic-musical-datasets
(Only datasets with miscellaneous genres are presented here)
GANs for Symbolic
Melody Generation
49
X  melody
MidiNet
50Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017
MidiNet
51Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017
chords
MidiNet
52Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017
(U-Net-like structure)
previous
measure
chords
MidiNet
53Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017
(U-Net-like structure)
previous
measure
chords
MidiNet
54Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017
(U-Net-like structure)
previous
measure
chords
Key—Design CNN kernel sizes to match
our understanding of music
double the resolution
at each layer
grow the pitch axis at last
(1128)
MidiNet—Samples
55
More samples can be found at https://richardyang40148.github.io/TheBlog/midinet_arxiv_demo.html
Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017
1D + 2D conditions
(previous bar and chords)
1D condition only
(chords only)
SSMGAN
56Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019
SSMGAN
57Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019
DL 1/0
LSTM discriminator
(window of K measures)
SSMGAN
58Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019
Key—Use the SSM discriminator to
improve global musical structure
DL 1/0
LSTM discriminator
(window of K measures)
SSM discriminator
(all measures)
SSMGAN—Samples
59Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019
Sample 1 Sample 2
Other GAN models for melody generation
• C-RNN-GAN [1]: use RNNs for the generator and discriminator
• JazzGAN [2]: given chords, compose melody; compare 3 representations
• Conditional LSTM-GAN [3]: given lyrics, compose melody
• SSMGAN [4]: use GAN to generate a self-similarity matrix (SSM) to
represent self-repetition in music, and then use LSTM to generate
melody given the SSM
60
[1] Mogren, “C-RNN-GAN: Continuous recurrent neural networks with adversarial training,” CML 2016
[2] Trieu and Keller, “JazzGAN: Improvising with generative adversarial networks,” MUME 2018
[3] Yu and Canales, “Conditional LSTM-GAN for melody generation from lyrics,” arXiv 2019
[4] Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,”
ML4MD 2019
GANs for Symbolic
Arrangement
Generation
61
X  piano roll
melody  piano roll
Challenges of arrangement generation
• Temporal evolution
• Dynamics, emotions, tensions, etc.
• Structure (temporal)
• Short-term structure  can somehow be generated with special models
• Long-term structure  super hard
• Instrumentation
• Multiple tracks
• Functions of instruments
62
Couple with one another in a complex way in real world music
Why pianorolls?
• Deep learning friendly format  basically matrices
• Easier for GANs to work with
63
Pros and cons of pianorolls
• Pros
• Can be purely symbolic  quantized by beats (or factors of beats)
• Repetition and structure can be observed easily
• No need to serialize polyphonic compositions
• Cons
• Memory inefficient  mostly zero entries (i.e., sparse matrices)
• Missing the concepts of “notes”
• Hard to handle performance-level information unless using high resolution
64
Multitrack pianoroll
65
Drums
GuitarPiano
Strings
Bass
Drums
Piano
Guitar
Bass
Strings
MuseGAN—Generator
66
zzzzz
zzzz
z
z
z
z
GGGGG
Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation
and accompaniment,” AAAI 2018
Bar Generator
MuseGAN—Generator
67
z
zzzz
zzzzz
z
z
z
z
GGGGG
No Coordination
Coordination
track-dependent
track-independent
Bar Generator
Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation
and accompaniment,” AAAI 2018
zzzzz
MuseGAN—Generator
68
z
Gz
GGGGG
zzzz
zzzz
zzzzz
z
z
z
z
GGGGG
Temporal
Generator
Bar Generator
Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation
and accompaniment,” AAAI 2018
zzzzz
MuseGAN—Generator
69
z
Gz
GGGGG
zzzz
zzzz
zzzzz
z
z
z
z
GGGGG
Time
Dependent Independent
Track
Dependent Melody Groove
Independent Chords Style
Key—Use different types of latent
variables to enhance controllability
Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation
and accompaniment,” AAAI 2018
Bar Generator
MuseGAN—Samples
70
More samples can be found at https://salu133445.github.io/musegan/results
Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation
and accompaniment,” AAAI 2018
Sample 1 Sample 2
Before the coding session…
LPD (Lakh Pianoroll Dataset)
• 174,154 multi-track piano-rolls
• Derived from Lakh MIDI Dataset*
• Mainly pop songs
• Derived labels available
Pypianoroll (Python package)
• Manipulation & Visualization
• Efficient I/O
• Parse/Write MIDI files
• On PYPI (pip install pypianoroll)
71
[Lakh MIDI Dataset] https://colinraffel.com/projects/lmd/
[Pypianoroll] https://salu133445.github.io/pypianoroll
[Lakh Pianoroll Dataset] https://salu133445.github.io/lakh-pianoroll-dataset
Dong et al., “Pypianoroll: Open source Python package for handling multitrack pianorolls,” ISMIR-LBD 2018.
We will use them in the next coding session!
Coding session II—GAN for pianorolls
Google Colab notebook link
https://colab.research.google.com/drive/1WrFtqo5LW8QfhiuhHmge9QLexWwS2BcM
You can also find the link on the tutorial website
https://salu133445.github.io/ismir2019tutorial/
Click
72
BinaryMuseGAN
73
Dong and Yang, “Convolutional generative adversarial networks with binary neurons for polyphonic music
generation,” ISMIR 2018
BinaryMuseGAN
(+DBNs)MuseGAN’s output
(real-valued)
hard
thresholding
Bernoulli
sampling
less overly-fragmented notes
Key—Naïve binarization methods can
easily lead to overly-fragmented notes
many overly-fragmented notes
BinaryMuseGAN
• use binary neurons at the output layer of the generator
• use straight-through estimator to estimate the gradients for the
binary neurons (which involves nondifferentiable operations)
74
Generator’s outputs Real data
MuseGAN real-valued binary-valued
BinaryMuseGAN binary-valued binary-valued
Dong and Yang, “Convolutional generative adversarial networks with binary neurons for polyphonic music
generation,” ISMIR 2018
LeadSheetGAN
75
Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018
Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
LeadSheetGAN
76
Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018
Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
LeadSheetGAN
77
Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018
Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
LeadSheetGAN
78
Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018
Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
Key—First generate the lead
sheet, then the arrangement
(Unconditional) GAN Conditional GAN
LeadSheetGAN—Samples
79
Marron 5
Payphone
The Beatles
Hey Judelatent space interpolation
Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018
Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
LeadSheetGAN—Samples
80
More samples can be found at https://liuhaumin.github.io/LeadsheetArrangement/results
Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018
Arrangement generation
given an “Amazing Graze” lead sheet
Tea Time!
81
Section 1
Overview of music generation research
Section 2
Introduction to GANs
Section 3
Case studies of GAN-based systems
Section 4
Current limitations &
Future research directions
Generating Music with GANs
Scope of music generation
1. Symbolic melody generation
• X  melody
2. Arrangement generation
• X  piano roll
• melody  piano roll
3. Style transfer
• piano roll  piano roll’
• audio  audio’
4. Audio generation
• X  audio
• piano roll  audio
82
CycleGANs for
Music Style Transfer
83
piano roll  piano roll’
audio  audio’
Music style transfer
• Alter the “style,” but keep the “content” fixed
• Three types of music style transfer [1]
1. composition style transfer for score
2. performance style transfer for performance control
3. timbre style transfer for sound
• Little existing work on performance style transfer (e.g., [2])
uses deep learning
84
[1] Dai et al., “Music style transfer: A position paper,” MUME 2018
[2] Shih et al., “Analysis and synthesis of the violin playing styles of Heifetz and Oistrakh,” DAFx 2017
Composition style transfer (musical genre)
• Example: https://www.youtube.com/watch?v=buXqNqBFd6E
• Re-orchestrations of
Beethoven's Ode to Joy
by a collaboration
between human and AI
(Sony CSL Flow Machines)
85
Pachet, “A Joyful Ode to automatic
orchestration,” ACM TIST 2016
Composition style transfer (musical genre)
• Transfer among Classic, Jazz, and Pop [1,2]
• Model: standard convolutional CycleGAN
• I/O representation: single-track piano roll (64 x 84)
• merge all notes of all tracks (except for drums) into a single track
• discard drums
• 7 octaves (C1-C8; hence 84 notes)
• 4/4 time signature, 16 time steps per bar, 4 bars as a unit (hence 64 steps)
• 10k+ four-bar phrases for each genre (no paired data)
86
[1] Brunner et al., “Symbolic music genre transfer with CycleGAN,” ICTAI 2018
[2] Brunner et al., “Neural symbolic music genre transfer insights,” MML 2019
Recap—CycleGAN
87
• Adversarial loss + identity mapping loss (“cycle consistency”)
Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
Composition style transfer (musical genre)
• https://www.youtube.com/channel/UCs-bI_NP7PrQaMV1AJ4A3HQ
• Possible extensions:
• Consider different voices (including drums) separately (instead of
merging them)
• Add recurrent layers to better model sequential information
• Identify the melody line [1] and take better care of it
• Related:
• Supervised genre style transfer (not using GANs) using synthesized data [2]
88
[1] Simonetta et al., “A Convolutional approach to melody line identification in symbolic scores,” ISMIR 2019
[2] Cífka et al., “Supervised symbolic music style translation using synthetic data,” ISMIR 2019
Timbre style transfer (instrumental sounds):
TimbreTron
89
• Transfer among piano, flute, violin, harpsichord solos
https://www.cs.toronto.edu/~huang/TimbreTron/samples_page.html
• Model: modified version of CycleGAN
• I/O representation: 4-second CQT (257 x 251)
Huang et al., “TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer,” ICLR 2019
Drawback of cycleGAN
• Can work for only two domains at a time
• For example, piano↔flute, paion↔violin, piano↔harpsichord, etc
90
MUNIT instead of cycleGAN
• MUNIT [1]: an advanced version of cycleGAN that incorporates
encoders/decoders to get disentangled “content” and “style” codes
• Can work for multiple domains at the same time
91[1] Huang et al., “Multimodal unsupervised image-to-image translation,” ECCV 2018
Timbre style transfer (instrumental sounds):
Play-as-you-like
92
• Transfer among piano, guitar solos, and string quartet (https://tinyurl.com/y23tvhjx)
• Use MUNIT
• I/O representation: Mel-spectrogram + spectral difference + MFCC + spectral envelope
• Cycle consistency among the channel-wise features (as regularizers)
• Style interpolation:
https://soundcloud.com/affige/sets/ismir2019-gan-tutorial-supp-material
piano guitar
Lu et al., “Play as You Like: Timbre-enhanced multi-modal music style transfer,” AAAI 2019
Timbre style transfer (singing):
CycleBEGAN
93
Wu et al., “Singing style
transfer using cycle-
consistent boundary
equilibrium generative
adversarial networks,”
ICML workshop 2018
Timbre style transfer (singing):
CycleBEGAN
• Transfer between female and male singing voices [1]
http://mirlab.org/users/haley.wu/cybegan/ (check the outside test result)
• Model: BEGAN instead of GAN [2]
• train the generator G such that the discriminator D (an encoder/decoder net)
would reconstruct fake data as nicely as real data
• have a mechanism to balance the power of G and D
+ skip connections (also used in [3]) + a recurrent layer
94
[1] Wu et al., “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial
networks,” ICML workshop 2018
[2] Berthelot et al., “BEGAN: Boundary equilibrium generative adversarial networks,” arXiv 2017
[3] Hung et al., “Musical composition style transfer via disentangled timbre representations,” IJCAI 2019
Timbre style transfer (singing):
CycleBEGAN
95
• Skip connections contribute to sharpness, lyrics intelligibility, and naturalness
• Recurrent layers further improves everything, especially pitch accuracy
Wu et al., “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,”
ICML workshop 2018
Timbre style transfer (singing):
CycleBEGAN
96
RNN
here
skip
connections
“Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” ICML works. '18
Also use an encoder/decoder architecture for the generator
GANs for Music
Audio Generation
97
X  audio
piano roll  audio
Generating instrument sounds using GANs
• Generate spectrograms
• SpecGAN [1], TiFGAN [2], GANSynth [3]
• Generate waveforms
• WaveGAN [1]
• There are also approaches that do not use GANs [4, 5, 6, 7]
98
[1] Donahue et al., “Adversarial audio synthesis,” ICLR 2019
[2] Marafioti et al., “Adversarial generation of time-frequency features,” ICML 2019
[3] Engel et al., “GANSynth: Adversarial neural audio synthesis,” ICLR 2019
[4] Oord et al., “WaveNet: A generative model for raw audio,” SSW 2016
[5] Défossez et al., “SING: Symbol-to-instrument neural generator,” NeurIPS 2018
[6] Schimbinschi et al., “SynthNet: Learning to synthesize music end-to-end,” IJCAI 2019
[7] “PerformanceNet Score-to-audio music generation with multi-band convolutional residual network,” AAAI 2019
SpecGAN and WaveGAN
• Generating 1-second audio at 16kHz (16,000 points)
https://chrisdonahue.com/wavegan/
• Model: based on DCGAN
• Flatten 2D convolutions into 1D (e.g., 5x5 2D convolution becomes length-25 1D)
• Increase the stride factor for all convolutions (e.g., stride 2x2 becomes stride 4)
• DCGAN outputs 64x64 images; add one more layer so that the output has 16,384 points
99Donahue et al., “Adversarial audio synthesis,” ICLR 2019
GANSynth
• Generating 4-second audio at 16kHz [1]: large output dimension
https://magenta.tensorflow.org/gansynth
• Model: based on PGGAN [2]
• Progressively grow the GAN,
from low to high resolution
• During a resolution transition,
interpolate between the
output of two resolutions,
with weight α linearly
increasing from 0 to 1
100
[1] Engel et al., “GANSynth: Adversarial neural audio synthesis,” ICLR 2019
[2] Karras et al., “Progressive growing of GANs for improved quality, stability, and variation,” ICLR 2018
PGGAN: progressive growing of GANS
101Karras et al., “Progressive growing of GANs for improved quality, stability, and variation,” ICLR 2018
GANSynth
• Generating 4-second audio at 16kHz
https://magenta.tensorflow.org/gansynth
• Model: based on PGGAN (conv2d)
• Output: mel-spectrogram + instantaneous freq (IF)
• Use IF to derive the phase, and then use inverse STFT
to get the waveform
• STFT window size 2048, stride 512 (so, about 128 frames)
• 1024-bin mel-frequency scale
• Target output tensor size: 128x 1024x 2
• (2x16 → 4x32 → … → 128x1024)
102Engel et al., “GANSynth: Adversarial neural audio synthesis,” ICLR 2019
GANSynth: Why instantaneous frequency?
103
GANSynth: Pitch-conditioned generation
• Input to the generator
• 256-dim random vector Z
• 61-dim one-hot vector (MIDI 24-84)
for pitch conditioning
• Auxiliary pitch classification loss for the
discriminator (ACGAN [1])
• In addition to the real/fake loss
• Try to predict the pitch label
104
[1] Odena et al., “Conditional image synthesis with auxiliary
classifier GANs,” ICML 2017
GANSynth: Possible future extensions
• Modify the network to generate variable-length audio
• From note-level synthesis
(i.e., condition on a one-hot pitch vector);
to phrase-level syntehsis
(i.e., condition on a piano roll) [1,2]
• The IF might be noisy during note transitions
• May need to deal with the use of playing techniques [3]
105
[1] Wang and Yang, “PerformanceNet: Score-to-audio music generation with multi-band convolutional residual
network,” AAAI 2019
[2] Chen et al., “Demonstration of PerformanceNet: A convolutional neural network model for score-to-audio
music generation,” IJCAI demo paper, 2019
[3] Su et al., “TENT: Technique-embedded note tracking for real-world guitar solo recordings,” TISMIR 2019
PerformanceNet: Possible future extensions
106
• Building an “ AI Performer” [1,2,3]
[1] Wang and Yang, “PerformanceNet: Score-to-audio music generation with multi-band convolutional residual
network,” AAAI 2019
[2] Chen et al., “Demonstration of PerformanceNet: A convolutional neural network model for score-to-audio
music generation,” IJCAI demo paper, 2019
[3] Oore et al., “This time with feeling: Learning expressive musical performance,” arXiv 2018
Outline
107
Section 1
Overview of music generation research
Section 2
Introduction to GANs
Section 3
Case studies of GAN-based systems
Section 4
Current limitations
Future research directions
Limitations of GANs
• A bit difficult to train (it's helpful to know some tips and tricks [1])
• Only learn z→X mapping, not the inverse (X→z)
• Less explainable [2]
• Less clear how GANs can model text-like data or musical scores
• Unclear how GANs (and all other music composition models in
general) can generate “new” music genres
108
[1] “How to Train a GAN? Tips and tricks to make GANs work,” https://github.com/soumith/ganhacks
[2] Kelz and Widmer, “Towards interpretable polyphonic transcription with invertible neural
networks,” ISMIR 2019
The many other generative models
• Variational autoencoders (VAEs)
• Flow-based models
• Autoregressive models
• Attention mechanisms (transformers)
• Restricted Boltzmann machines (RBMs)
• Hidden Markov models (HMMs)
• …and more!
109
Future research directions
• Better network architectures for musical data, including piano rolls, MIDI events,
musical scores, and audio
• Learning to generate music with better structure and diversity
• Better interpretability and human control (e.g., [1])
• Standardized test dataset and evaluation metrics [2]
• Cross-modal generation, e.g., “music + lyrics” or “music + video”
• Interactive music generation (e.g., [3])
110
[1] Lattner and Grachten, “High-level control of drum track generation using learned patterns of rhythmic
interaction,” WASPAA 2019
[2] Katayose et al., “On evaluating systems for generating expressive music performance: the Rencon
experience,” Journal of New Music Research 2012
[3] Hsiao et al., “Jamming with Yating: Interactive demonstration of a music composition AI,” ISMIR-LBD 2019
Future research directions
• Composition style transfer (musical genre) using the LPD dataset
• More work on performance style transfer
• Phrase-level audio generation instead of note-level synthesis [1]
• Multi-singer audio generation and style transfer
• Lyrics-free singing generation [2]
• EDM generation [3,4]
111
[1] “Demonstration of PerformanceNet: A convolutional neural network model for score-to-audio music
generation,” IJCAI demo paper, 2019
[2] “Score and lyrics-free singing voice generation,” ICLR 2020 submission
[3] “DJnet: A dream for making an automatic DJ,” ISMIR-LBD 2017
[4] “Unmixer: An interface for extracting and remixing loops,” ISMIR 2019
Future research directions
• The “MIR4generation” pipeline
• Learning to generate music (that is expressive) from machine-transcribed
data (i.e., learning to compose and perform at the same time)
112Yeh et al., “Learning to generate Jazz and Pop piano music from audio via MIR techniques,” ISMIR-LBD 2019
Thank you!
Any questions?
113
Contact
Hao-Wen Dong (salu.hwdong@gmail.com)
Yi-Hsuan Yang (yang@citi.sinica.edu.tw)
Tutorial website
salu133445.github.io/ismir2019tutorial/

More Related Content

What's hot

Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technologyKalluri Madhuri
 
Direction of arrival estimation using music algorithm
Direction of arrival estimation using music algorithmDirection of arrival estimation using music algorithm
Direction of arrival estimation using music algorithmeSAT Journals
 
Artificial Neural Networks (ANNs) - XOR - Step-By-Step
Artificial Neural Networks (ANNs) - XOR - Step-By-StepArtificial Neural Networks (ANNs) - XOR - Step-By-Step
Artificial Neural Networks (ANNs) - XOR - Step-By-StepAhmed Gad
 
speech processing basics
speech processing basicsspeech processing basics
speech processing basicssivakumar m
 
Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022Kwanghee Choi
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Yi-Hsuan Yang
 
Audio Source Separation Based on Low-Rank Structure and Statistical Independence
Audio Source Separation Based on Low-Rank Structure and Statistical IndependenceAudio Source Separation Based on Low-Rank Structure and Statistical Independence
Audio Source Separation Based on Low-Rank Structure and Statistical IndependenceDaichi Kitamura
 
Text to Speech for Mobile Voice
Text to Speech for Mobile Voice Text to Speech for Mobile Voice
Text to Speech for Mobile Voice June Hostetter
 
Text to Speech PowerPoint
Text to Speech PowerPointText to Speech PowerPoint
Text to Speech PowerPointmatthewmahony
 
Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)BushraShaikh44
 
Beamforming and microphone arrays
Beamforming and microphone arraysBeamforming and microphone arrays
Beamforming and microphone arraysRamin Anushiravani
 
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BARThyunyoung Lee
 
a deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationa deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationJEE HYUN PARK
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 

What's hot (20)

Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Direction of arrival estimation using music algorithm
Direction of arrival estimation using music algorithmDirection of arrival estimation using music algorithm
Direction of arrival estimation using music algorithm
 
Artificial Neural Networks (ANNs) - XOR - Step-By-Step
Artificial Neural Networks (ANNs) - XOR - Step-By-StepArtificial Neural Networks (ANNs) - XOR - Step-By-Step
Artificial Neural Networks (ANNs) - XOR - Step-By-Step
 
speech processing basics
speech processing basicsspeech processing basics
speech processing basics
 
Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)
 
Audio Source Separation Based on Low-Rank Structure and Statistical Independence
Audio Source Separation Based on Low-Rank Structure and Statistical IndependenceAudio Source Separation Based on Low-Rank Structure and Statistical Independence
Audio Source Separation Based on Low-Rank Structure and Statistical Independence
 
Text to Speech for Mobile Voice
Text to Speech for Mobile Voice Text to Speech for Mobile Voice
Text to Speech for Mobile Voice
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
Text to Speech PowerPoint
Text to Speech PowerPointText to Speech PowerPoint
Text to Speech PowerPoint
 
Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)
 
Beamforming and microphone arrays
Beamforming and microphone arraysBeamforming and microphone arrays
Beamforming and microphone arrays
 
Music Information Retrieval
Music Information RetrievalMusic Information Retrieval
Music Information Retrieval
 
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
(Presentation)NLP Pretraining models based on deeplearning -BERT, GPT, and BART
 
a deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationa deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarization
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Spectral analysis and filtering
Spectral analysis and filteringSpectral analysis and filtering
Spectral analysis and filtering
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 

Similar to ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)

Nithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier
 
machine learning x music
machine learning x musicmachine learning x music
machine learning x musicYi-Hsuan Yang
 
Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Keunwoo Choi
 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...multimediaeval
 
Automatic Music Generation Using Deep Learning
Automatic Music Generation Using Deep LearningAutomatic Music Generation Using Deep Learning
Automatic Music Generation Using Deep LearningIRJET Journal
 
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...IRJET Journal
 
10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and ImpactBruce Wiggins
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET Journal
 
CRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 PresentationCRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 Presentationaxambo
 
Music Creation by Example
Music Creation by ExampleMusic Creation by Example
Music Creation by Exampleivaderivader
 
Extraction and Conversion of Vocals
Extraction and Conversion of VocalsExtraction and Conversion of Vocals
Extraction and Conversion of VocalsIRJET Journal
 
Music Objects to Social Machines
Music Objects to Social MachinesMusic Objects to Social Machines
Music Objects to Social MachinesDavid De Roure
 
web based music genre classification.pptx
web based music genre classification.pptxweb based music genre classification.pptx
web based music genre classification.pptxUmaMahesh786960
 
Research on Automatic Music Composition at the Taiwan AI Labs, April 2020
Research on Automatic Music Composition at the Taiwan AI Labs, April 2020Research on Automatic Music Composition at the Taiwan AI Labs, April 2020
Research on Automatic Music Composition at the Taiwan AI Labs, April 2020Yi-Hsuan Yang
 
Music-Recommendation-Tutorial-2017
Music-Recommendation-Tutorial-2017Music-Recommendation-Tutorial-2017
Music-Recommendation-Tutorial-2017Fabien Gouyon
 
Conditional generative model for audio
Conditional generative model for audioConditional generative model for audio
Conditional generative model for audioKeunwoo Choi
 
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...I MT
 

Similar to ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs) (20)

Nithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier research_proposal
Nithin Xavier research_proposal
 
machine learning x music
machine learning x musicmachine learning x music
machine learning x music
 
Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24Deep learning for music classification, 2016-05-24
Deep learning for music classification, 2016-05-24
 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
 
Automatic Music Generation Using Deep Learning
Automatic Music Generation Using Deep LearningAutomatic Music Generation Using Deep Learning
Automatic Music Generation Using Deep Learning
 
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...
IRJET- Music Genre Classification using Machine Learning Algorithms: A Compar...
 
10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact10 Minute Research Presentation on Ambisonics and Impact
10 Minute Research Presentation on Ambisonics and Impact
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
 
CRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 PresentationCRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 Presentation
 
Music Creation by Example
Music Creation by ExampleMusic Creation by Example
Music Creation by Example
 
C045031116
C045031116C045031116
C045031116
 
Extraction and Conversion of Vocals
Extraction and Conversion of VocalsExtraction and Conversion of Vocals
Extraction and Conversion of Vocals
 
Music Objects to Social Machines
Music Objects to Social MachinesMusic Objects to Social Machines
Music Objects to Social Machines
 
web based music genre classification.pptx
web based music genre classification.pptxweb based music genre classification.pptx
web based music genre classification.pptx
 
Research on Automatic Music Composition at the Taiwan AI Labs, April 2020
Research on Automatic Music Composition at the Taiwan AI Labs, April 2020Research on Automatic Music Composition at the Taiwan AI Labs, April 2020
Research on Automatic Music Composition at the Taiwan AI Labs, April 2020
 
Music-Recommendation-Tutorial-2017
Music-Recommendation-Tutorial-2017Music-Recommendation-Tutorial-2017
Music-Recommendation-Tutorial-2017
 
jr
jrjr
jr
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
 
Conditional generative model for audio
Conditional generative model for audioConditional generative model for audio
Conditional generative model for audio
 
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
 

More from Yi-Hsuan Yang

20211026 taicca 3 music analysis sota
20211026 taicca 3 music analysis sota20211026 taicca 3 music analysis sota
20211026 taicca 3 music analysis sotaYi-Hsuan Yang
 
20211026 taicca 2 music generation
20211026 taicca 2 music generation20211026 taicca 2 music generation
20211026 taicca 2 music generationYi-Hsuan Yang
 
20211026 taicca 1 intro to mir
20211026 taicca 1 intro to mir20211026 taicca 1 intro to mir
20211026 taicca 1 intro to mirYi-Hsuan Yang
 
Learning to Generate Jazz & Pop Piano Music from Audio via MIR Techniques
Learning to Generate Jazz & Pop Piano Music from Audio via MIR TechniquesLearning to Generate Jazz & Pop Piano Music from Audio via MIR Techniques
Learning to Generate Jazz & Pop Piano Music from Audio via MIR TechniquesYi-Hsuan Yang
 
20190625 Research at Taiwan AI Labs: Music and Speech AI
20190625 Research at Taiwan AI Labs: Music and Speech AI20190625 Research at Taiwan AI Labs: Music and Speech AI
20190625 Research at Taiwan AI Labs: Music and Speech AIYi-Hsuan Yang
 
Machine Learning for Creative AI Applications in Music (2018 May)
Machine Learning for Creative AI Applications in Music (2018 May)Machine Learning for Creative AI Applications in Music (2018 May)
Machine Learning for Creative AI Applications in Music (2018 May)Yi-Hsuan Yang
 
Research at MAC Lab, Academia Sincia, in 2017
Research at MAC Lab, Academia Sincia, in 2017Research at MAC Lab, Academia Sincia, in 2017
Research at MAC Lab, Academia Sincia, in 2017Yi-Hsuan Yang
 
Dimensional Music Emotion Recognition
Dimensional Music Emotion RecognitionDimensional Music Emotion Recognition
Dimensional Music Emotion RecognitionYi-Hsuan Yang
 

More from Yi-Hsuan Yang (8)

20211026 taicca 3 music analysis sota
20211026 taicca 3 music analysis sota20211026 taicca 3 music analysis sota
20211026 taicca 3 music analysis sota
 
20211026 taicca 2 music generation
20211026 taicca 2 music generation20211026 taicca 2 music generation
20211026 taicca 2 music generation
 
20211026 taicca 1 intro to mir
20211026 taicca 1 intro to mir20211026 taicca 1 intro to mir
20211026 taicca 1 intro to mir
 
Learning to Generate Jazz & Pop Piano Music from Audio via MIR Techniques
Learning to Generate Jazz & Pop Piano Music from Audio via MIR TechniquesLearning to Generate Jazz & Pop Piano Music from Audio via MIR Techniques
Learning to Generate Jazz & Pop Piano Music from Audio via MIR Techniques
 
20190625 Research at Taiwan AI Labs: Music and Speech AI
20190625 Research at Taiwan AI Labs: Music and Speech AI20190625 Research at Taiwan AI Labs: Music and Speech AI
20190625 Research at Taiwan AI Labs: Music and Speech AI
 
Machine Learning for Creative AI Applications in Music (2018 May)
Machine Learning for Creative AI Applications in Music (2018 May)Machine Learning for Creative AI Applications in Music (2018 May)
Machine Learning for Creative AI Applications in Music (2018 May)
 
Research at MAC Lab, Academia Sincia, in 2017
Research at MAC Lab, Academia Sincia, in 2017Research at MAC Lab, Academia Sincia, in 2017
Research at MAC Lab, Academia Sincia, in 2017
 
Dimensional Music Emotion Recognition
Dimensional Music Emotion RecognitionDimensional Music Emotion Recognition
Dimensional Music Emotion Recognition
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

ISMIR 2019 tutorial: Generating music with generative adverairal networks (GANs)

  • 1. Generating Music with GANs An Overview and Case Studies Hao-Wen Dong UC San Diego & Academia Sinica Yi-Hsuan Yang Taiwan AI Labs & Academia Sinica November 4th, 2019 salu133445.github.io/ismir2019tutorial/
  • 2. Outline 2 Section 1 Overview of music generation research Section 2 Introduction to GANs Coding session 1: GAN for images Section 3 Case studies of GAN-based systems (I) Coding session 2: GAN for piano rolls Case studies of GAN-based systems (II) Section 4 Current limitations Future research directionssalu133445.github.io/ismir2019tutorial/ ◁ break ◁ break
  • 3. About us Hao-Wen Dong • Ph.D. student, UC San Diego (2019-) • Research intern, Yamaha Corporation (2019) • Research assistant, Academia Sinica (2017-2019) • First author of MuseGAN & BMuseGAN Yi-Hsuan Yang • Chief Music Scientist, Taiwan AI Labs (2019-) • Research professor, Academia Sinica (2011-) • Ph.D., National Taiwan University (2006-2010) • Associate Editor of IEEE TMM and TAFFC (2017-2019) • Program Chair of ISMIR @ Taipei, Taiwan (2014) • Tutorial speaker of ISMIR (2012 last time) 3
  • 4. About the Music and AI Lab @ Sinica • About Academia Sinica • National academy of Taiwan, founded in 1928 (not a university) • About 1,000 full, associate, assistant research professors • About Music and AI Lab • https://musicai.citi.sinica.edu.tw/ • Since Sep 2011 • Members • PI [me] • research assistants • PhD/master students • 3 AAAI full papers + 3 IJCAI full papers in 2018 and 2019 4
  • 5. About the Music Team @ Taiwan AI Labs • About Taiwan AI Labs • https://ailabs.tw/ • Privately-funded research organization (like openAI), founded in 2017 • Three main research area: 1) HCI, 2) medicine, 3) smart city • 100+ employees (late 2019) • About the “Yating” Music AI team • Members • scientist [me] • ML engineers (for models) • musicians • program manager • software engineers (for frontend/backend) 5
  • 6. Outline 6 Section 1 Overview of music generation research Section 2 Introduction to GANs Section 3 Case studies of GAN-based systems Section 4 Current limitations Future research directions
  • 7. From a music production view • Composing / songwriting • Melody • Chords • Lyrics • Arranging • Instrumentation • Structure • Mixing • Timbres/tones • Balancing 7
  • 8. Use cases of music AI • Make musicians’ life easier • Inspire ideas • Suggest continuations, accompaniments, lyrics, or drum loops • Suggest mixing presets • Empower everyone to make music • Democratization of music creation • Create copyright free music for videos or games • Music education (e.g., auto-accompaniment) 8
  • 9. Companies involved in automatic music generation 9 … and more!
  • 10. Demo: Magenta Studio • https://magenta.tensorflow.org/studio/ 10
  • 11. Demo: Jamming with Yating 11 • https://www.youtube.com/watch?v=9ZIJrr6lmHg • Jamming with Yating [1,2] • Input (by human): piano • Output: piano + bass + drum [1] Hsiao et al., “Jamming with Yating: Interactive demonstration of a music composition AI,” ISMIR-LBD 2019 [2] Yeh et al., “Learning to generate Jazz and Pop piano music from audio via MIR techniques,” ISMIR-LBD 2019
  • 12. From a deep learning view • Input representation • Model • Output representation 12
  • 13. Models • Rule based methods • Concatenation based methods • Machine learning based methods • VAE: variational autoencoder • GAN: generative adversarial network • See Section 2: Introduction to GANs 13
  • 14. Why GAN? • State-of-the-art model in: • Image generation: BigGAN [1] • Text-to-speech audio synthesis: GAN-TTS [2] • Note-level instrument audio synthesis: GANSynth [3] • Also see ICASSP 2018 tutorial: “GAN and its applications to signal processing and NLP” [4] • Its potential for music generation has not been fully realized • Adversarial training has many other applications • For example, source separation [5], domain adapation [6], music transcription [7] 14 [1] “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019 [2] “High fidelity speech synthesis with adversarial networks,” ICLR 2020 submission [3] “GANSynth: Adversarial neural audio synthesis,” ICLR 2019 [4] https://tinyurl.com/y23ywv4s (on slideshare) [5] “SVSGAN: Singing voice separation via generative adversarial network,” ICASSP 2018 [6] “Cross-cultural music emotion recognition by adversarial discriminative domain adaptation,” ICMLA 2018 [7] “Adversarial learning for improved onsets and frames music transcription,” ISMIR 2019
  • 15. Input/output representations • Symbolic output • Piano rolls • MIDI events • Score • Audio output • Sepctrogram • Waveform 15
  • 16. I/O representations • Symbolic output • Piano rolls (image-like): easier for GANs to work with • MIDI events (text-like) • Score (hybrid) • Audio output • Sepctrogram (image-like) • Waveform 16
  • 17. I/O representations • Symbolic output • Piano rolls (image-like): MidiNet [1], MuseGAN [2] • MIDI events (text-like): Music Transformer [3], MuseNet [4] • Scores (hybrid): Thickstun [5], measure-by-measure [6] 17 [1] https://arxiv.org/abs/1703.10847, ISMIR 2017 [2] https://arxiv.org/abs/1709.06298, AAAI 2018 [3] https://openreview.net/pdf?id=rJe4ShAcF7, ICLR 2019 [4] https://openai.com/blog/musenet/ [5] https://arxiv.org/abs/1811.08045, ISMIR 2019 [6] https://openreview.net/forum?id=Hklk6xrYPB, ICLR 2020 submission
  • 18. Scope of music generation • Generation from scratch • X  melody • X  piano roll • X  audio • Conditional generation • melody  piano roll (accompaniment) • piano roll  audio (synthesis) • piano roll  piano roll’ (rearrangement) • audio  audio’ • See Section 3: Case studies • and, https://github.com/affige/genmusic_demo_list 18
  • 19. We will talk about 1. Symbolic melody generation: MidiNet [1], SSMGAN [2] 2. Arrangement generation: MuseGAN [3], BinaryMuseGAN [4], LeadSheetGAN [5] 3. Style transfer: CycleGAN [6], TimbreTron [7], Play-as-you-like [8], CycleBEGAN [9] 4. Audio generation: WaveGAN [10], GANSynth [11] 19 [1] “MidiNet: A convolutional GAN for symbolic-domain music generation,” ISMIR 2017 [2] “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019 [3] “MuseGAN: Multi-track sequential GANs for symbolic music generation and accompaniment,” AAAI 2018 [4] “Convolutional GANs with binary neurons for polyphonic music generation,” ISMIR 2018 [5] “Lead sheet generation and arrangement by conditional GAN,” ISMIR-LBD 2018 [6] “Symbolic music genre transfer with CycleGAN,” ICTAI 2018 [7] “TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer,” ICLR 2019 [8] “Play as You Like: Timbre-enhanced multi-modal music style transfer,” AAAI 2019 [9] “Singing style transfer using cycle-consistent boundary equilibrium GANs,” ICML workshop 2018 [10] “Adversarial audio synthesis,” ICLR 2019 [11] “GANSynth: Adversarial neural audio synthesis,” ICLR 2019
  • 20. Outline 20 Section 1 Overview of music generation research Section 2 Introduction to GANs Section 3 Case studies of GAN-based systems Section 4 Current limitations Future research directions
  • 21. What is a GAN? Generative Adversarial Network 21 an adversarial game between two competitors a generative model a deep neural network
  • 22. A loss function for training generative models 22 D 1/0 real samples Gz~pZ G(z) random noise fake samples x~pX latent variable model (to be learned) Goodfellow et al., “Generative adversarial networks,” NeurIPS 2014
  • 23. A loss function for training generative models 23 D 1/0 log(1-D(x)) + log(D(G(z))) log(1-D(G(z))) Generator Make G(z) indistinguishable from real data for D Discriminator Tell G(z) as fake data from x being real ones real samples Gz~pZ G(z) random noise fake samples x~pX Goodfellow et al., “Generative adversarial networks,” NeurIPS 2014
  • 24. Problems of unregularized GANs • Key—discriminator provides generator with gradients as a guidance for improvement • Discrimination is easier than generation • Discriminator tends to provide large gradients • Result in unstable training of the generator • Common failure cases • Mode collapse • Missing modes 24 (Colors show the outputs of the discriminator) sharp color changes  large gradients
  • 25. Regularizing GANs 25 Locally regularized Advantages of gradient regularization • provide a smoother guidance to the generator • alleviate mode collapse and missing modes issues Globally regularizedUnregularized gradient clipping [1] gradient penalties [2,3] spectral normalization [4] [1] Arjovsky et al., “Wasserstein generative adversarial networks,” ICML 2017 [2] Gulrajani et al., “Improved training of Wasserstein GANs,” NeurIPS 2017 [3] Kodali et al., “On convergence and stability of GANs,” arXiv 2017 [4] Miyato et al., “Spectral normalization for generative adversarial networks,” ICLR 2018
  • 26. Coding session I—GAN for images Google Colab notebook link https://colab.research.google.com/drive/1Cnq9z3QvxIsVntlXKjPjbwttxeDH47Xl You can also find the link on the tutorial website https://salu133445.github.io/ismir2019tutorial/ Click 26
  • 27. Deep convolutional GAN (DCGAN) 27 D 1/0 real samples Gz~pZ G(z) random noise fake samples x~pX transposed convolutional layers Key—Use CNNs for both G and D Radford et al., “Unsupervised representation learning with deep convolutional GANs,” ICLR 2016
  • 28. GANs vs VAEs 28 GAN VAE Objective (generator) fool the discriminator (discriminator) tell real data from fake ones reconstruct real data using pixel-wise loss Results tend to be sharper tend to be more blurred Diversity Higher Lower Stability Lower Higher Larsen et al., “Autoencoding beyond pixels using a learned similarity metric,” ICML 2016
  • 29. State of the arts—BigGANs 29 [Colab notebook demo] https://colab.research.google.com/github/tensorflow/hub/blob/ master/examples/colab/biggan_generation_with_tf_hub.ipynb (hard classes) Brock et al., “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019
  • 30. Interpolation on the latent space 30 Brock et al., “Large scale GAN training for high fidelity natural image synthesis,” ICLR 2019 latent space data space
  • 31. Conditional GAN (CGAN) 31 D 1/0 real samples Gz~pZ G(z, y) random noise fake samples x (x, y)~pX, Y conditions y Mirza et al., “Conditional generative adversarial nets,” arXiv 2014
  • 32. Conditional GAN (CGAN) 32 D 1/0 real samples Gz~pZ G(z, y) random noise fake samples x (x, y)~pX, Y conditions y y~pY conditions Mirza et al., “Conditional generative adversarial nets,” arXiv 2014
  • 33. Conditional GAN (CGAN) 33 D 1/0 real samples Gz~pZ G(z, y) random noise fake samples x (x, y)~pX, Y conditions y y~pY conditions Key—Feed conditions to both G and D Discriminator now examine whether a pair (x, y) or (G(z), y) is real or not Generator now generate samples based on some conditions Mirza et al., “Conditional generative adversarial nets,” arXiv 2014
  • 35. pix2pix 35 DY 1/0 real samples G(x)x fake samples y GX→Y real samples (x, y)~pX, Y Key—Use pixel-wise loss for supervisions pixel-wise loss Isola et al., “Image-to-image translation with conditional adversarial nets,” CVPR 2017
  • 36. pix2pix—Samples 36Isola et al., “Image-to-image translation with conditional adversarial nets,” CVPR 2017
  • 37. Cycle-consistent GAN (CycleGAN) 37 real samples x~pX y~pY DY 1/0 GX→Y(x) fake samples GX→Y real samples Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
  • 38. Cycle-consistent GAN (CycleGAN) 38 real samples x~pX y~pY real samples GY→X(y) fake samples GY→X DX1/0 Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
  • 39. Cycle-consistent GAN (CycleGAN) 39 real samples x~pX y~pY DY 1/0 GX→Y(x) fake samples GX→Y real samples GY→X(y) fake samples GY→X DX1/0 GAN 1 GAN 2 Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
  • 40. Cycle-consistent loss 40 cycle-consistency loss cycle-consistency loss CycleGAN Unpaired samples in two domains pix2pix Require paired (x, y) samples pixel-wise loss (each x ∈ X is mapped to a certain y ∈ Y) Key—Use cycle-consistency loss for supervisions Isola et al., “Image-to-image translation with conditional adversarial nets,” CVPR 2017 Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
  • 41. CycleGAN—Samples 41 (pix2pix) We do not need a Monte’s version for each photo  hard to acquire (CycleGAN) We only need a collection of Monet’s paintings and a collection of photos  easier to acquire Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
  • 42. The many other GANs … and more! Training criterions • LSGAN • WGAN • EBGAN • BEGAN • GeometricGAN • RaGAN Applications • DCGAN • InfoGAN • CGAN • ACGAN • pix2pix • CycleGAN • CoGAN • DAN 42 Optimization constraints • WGAN • WGANGP • McGAN • MMDGAN • FisherGAN • DRAGAN • SNGAN Training strategies • UnrolledGAN • LAPGAN • StackedGAN • StackGAN • PGGAN • StyleGAN • BigGAN See more at https://github.com/hindupuravinash/the-gan-zoo
  • 43. Open questions about GANs • https://distill.pub/2019/gan-open-problems/ 1. What are the trade-offs between GANs and other generative models? 2. What sorts of distributions can GANs model? 3. How can we Scale GANs beyond image synthesis? 4. What can we say about the global convergence of the training dynamics? 5. How should we evaluate GANs and when should we use them? 6. How does GAN training scale with batch size? 7. What is the relationship between GANs and adversarial examples? 43
  • 44. Comparative studies on GANs [1] Kunfeng Wang, Chao Gou, Yanjie Duan, Yilun Lin, Xinhu Zheng, and Fei-Yue Wang. Generative adversarial networks: Introduction and outlook. IEEE/CAA Journal of Automatica Sinica, 4(4):588-598, 2017. [2] Mario Lučić, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale Study. Proc. Advances in Neural Information Processing Systems 31, pp. 700–709, 2018. [3] Karol Kurach, Mario Lučić, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A large-scale study on regularization and normalization in GANs. Proc. International Conference on Machine Learning (PMLR), 97:3581-3590, 2019. [4] Hao-Wen Dong and Yi-Hsuan Yang, “Towards a deeper understanding of adversarial losses,” arXiv preprint arXiv:1901.08753, 2019. 44
  • 45. Tea Time! 45 Section 1 Overview of music generation research Section 2 Introduction to GANs Section 3 Case studies of GAN-based systems Section 4 Current limitations & Future research directions Or try the next Google Colab notebook https://colab.research.google.com/drive/ 1WrFtqo5LW8QfhiuhHmge9QLexWwS2BcM Generating Music with GANs
  • 46. Outline 46 Section 1 Overview of music generation research Section 2 Introduction to GANs Section 3 Case studies of GAN-based systems Section 4 Current limitations Future research directions
  • 47. Scope of music generation 1. Symbolic melody generation • X  melody 2. Arrangement generation • X  piano roll • melody  piano roll 3. Style transfer • piano roll  piano roll’ • audio  audio’ 4. Audio generation • X  audio • piano roll  audio 47
  • 48. Some symbolic datasets to starts with 48 Dataset Type Format Link Wikifornia lead sheet XML www.wikifonia.org HookTheory lead sheet XML www.hooktheory.com/theorytab Lakh MIDI Dataset multitrack MIDI https://colinraffel.com/projects/lmd Lakh Pianoroll Dataset multitrack npz salu133445.github.io/lakh-pianoroll-dataset Groove MIDI Dataset drum MIDI magenta.tensorflow.org/datasets/groove Midi Man drum MIDI www.reddit.com/r/WeAreTheMusicMakers/comments /3anwu8/the_drum_percussion_midi_archive_800k/ See more at https://github.com/wayne391/symbolic-musical-datasets (Only datasets with miscellaneous genres are presented here)
  • 49. GANs for Symbolic Melody Generation 49 X  melody
  • 50. MidiNet 50Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017
  • 51. MidiNet 51Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017 chords
  • 52. MidiNet 52Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017 (U-Net-like structure) previous measure chords
  • 53. MidiNet 53Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017 (U-Net-like structure) previous measure chords
  • 54. MidiNet 54Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017 (U-Net-like structure) previous measure chords Key—Design CNN kernel sizes to match our understanding of music double the resolution at each layer grow the pitch axis at last (1128)
  • 55. MidiNet—Samples 55 More samples can be found at https://richardyang40148.github.io/TheBlog/midinet_arxiv_demo.html Yang et al., “MidiNet: A convolutional generative adversarial network for symbolic-domain music generation,” ISMIR 2017 1D + 2D conditions (previous bar and chords) 1D condition only (chords only)
  • 56. SSMGAN 56Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019
  • 57. SSMGAN 57Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019 DL 1/0 LSTM discriminator (window of K measures)
  • 58. SSMGAN 58Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019 Key—Use the SSM discriminator to improve global musical structure DL 1/0 LSTM discriminator (window of K measures) SSM discriminator (all measures)
  • 59. SSMGAN—Samples 59Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019 Sample 1 Sample 2
  • 60. Other GAN models for melody generation • C-RNN-GAN [1]: use RNNs for the generator and discriminator • JazzGAN [2]: given chords, compose melody; compare 3 representations • Conditional LSTM-GAN [3]: given lyrics, compose melody • SSMGAN [4]: use GAN to generate a self-similarity matrix (SSM) to represent self-repetition in music, and then use LSTM to generate melody given the SSM 60 [1] Mogren, “C-RNN-GAN: Continuous recurrent neural networks with adversarial training,” CML 2016 [2] Trieu and Keller, “JazzGAN: Improvising with generative adversarial networks,” MUME 2018 [3] Yu and Canales, “Conditional LSTM-GAN for melody generation from lyrics,” arXiv 2019 [4] Jhamtani and Berg-Kirkpatrick, “Modeling self-repetition in music generation using structured adversaries,” ML4MD 2019
  • 61. GANs for Symbolic Arrangement Generation 61 X  piano roll melody  piano roll
  • 62. Challenges of arrangement generation • Temporal evolution • Dynamics, emotions, tensions, etc. • Structure (temporal) • Short-term structure  can somehow be generated with special models • Long-term structure  super hard • Instrumentation • Multiple tracks • Functions of instruments 62 Couple with one another in a complex way in real world music
  • 63. Why pianorolls? • Deep learning friendly format  basically matrices • Easier for GANs to work with 63
  • 64. Pros and cons of pianorolls • Pros • Can be purely symbolic  quantized by beats (or factors of beats) • Repetition and structure can be observed easily • No need to serialize polyphonic compositions • Cons • Memory inefficient  mostly zero entries (i.e., sparse matrices) • Missing the concepts of “notes” • Hard to handle performance-level information unless using high resolution 64
  • 66. MuseGAN—Generator 66 zzzzz zzzz z z z z GGGGG Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” AAAI 2018 Bar Generator
  • 67. MuseGAN—Generator 67 z zzzz zzzzz z z z z GGGGG No Coordination Coordination track-dependent track-independent Bar Generator Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” AAAI 2018
  • 68. zzzzz MuseGAN—Generator 68 z Gz GGGGG zzzz zzzz zzzzz z z z z GGGGG Temporal Generator Bar Generator Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” AAAI 2018
  • 69. zzzzz MuseGAN—Generator 69 z Gz GGGGG zzzz zzzz zzzzz z z z z GGGGG Time Dependent Independent Track Dependent Melody Groove Independent Chords Style Key—Use different types of latent variables to enhance controllability Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” AAAI 2018 Bar Generator
  • 70. MuseGAN—Samples 70 More samples can be found at https://salu133445.github.io/musegan/results Dong et al., “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” AAAI 2018 Sample 1 Sample 2
  • 71. Before the coding session… LPD (Lakh Pianoroll Dataset) • 174,154 multi-track piano-rolls • Derived from Lakh MIDI Dataset* • Mainly pop songs • Derived labels available Pypianoroll (Python package) • Manipulation & Visualization • Efficient I/O • Parse/Write MIDI files • On PYPI (pip install pypianoroll) 71 [Lakh MIDI Dataset] https://colinraffel.com/projects/lmd/ [Pypianoroll] https://salu133445.github.io/pypianoroll [Lakh Pianoroll Dataset] https://salu133445.github.io/lakh-pianoroll-dataset Dong et al., “Pypianoroll: Open source Python package for handling multitrack pianorolls,” ISMIR-LBD 2018. We will use them in the next coding session!
  • 72. Coding session II—GAN for pianorolls Google Colab notebook link https://colab.research.google.com/drive/1WrFtqo5LW8QfhiuhHmge9QLexWwS2BcM You can also find the link on the tutorial website https://salu133445.github.io/ismir2019tutorial/ Click 72
  • 73. BinaryMuseGAN 73 Dong and Yang, “Convolutional generative adversarial networks with binary neurons for polyphonic music generation,” ISMIR 2018 BinaryMuseGAN (+DBNs)MuseGAN’s output (real-valued) hard thresholding Bernoulli sampling less overly-fragmented notes Key—Naïve binarization methods can easily lead to overly-fragmented notes many overly-fragmented notes
  • 74. BinaryMuseGAN • use binary neurons at the output layer of the generator • use straight-through estimator to estimate the gradients for the binary neurons (which involves nondifferentiable operations) 74 Generator’s outputs Real data MuseGAN real-valued binary-valued BinaryMuseGAN binary-valued binary-valued Dong and Yang, “Convolutional generative adversarial networks with binary neurons for polyphonic music generation,” ISMIR 2018
  • 75. LeadSheetGAN 75 Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018 Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
  • 76. LeadSheetGAN 76 Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018 Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
  • 77. LeadSheetGAN 77 Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018 Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
  • 78. LeadSheetGAN 78 Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018 Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018 Key—First generate the lead sheet, then the arrangement (Unconditional) GAN Conditional GAN
  • 79. LeadSheetGAN—Samples 79 Marron 5 Payphone The Beatles Hey Judelatent space interpolation Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018 Liu and Yang, “Lead sheet generation and arrangement via a hybrid generative model,” ISMIR-LBD 2018
  • 80. LeadSheetGAN—Samples 80 More samples can be found at https://liuhaumin.github.io/LeadsheetArrangement/results Liu and Yang, “Lead sheet generation and arrangement by conditional generative adversarial network,” ICMLA 2018 Arrangement generation given an “Amazing Graze” lead sheet
  • 81. Tea Time! 81 Section 1 Overview of music generation research Section 2 Introduction to GANs Section 3 Case studies of GAN-based systems Section 4 Current limitations & Future research directions Generating Music with GANs
  • 82. Scope of music generation 1. Symbolic melody generation • X  melody 2. Arrangement generation • X  piano roll • melody  piano roll 3. Style transfer • piano roll  piano roll’ • audio  audio’ 4. Audio generation • X  audio • piano roll  audio 82
  • 83. CycleGANs for Music Style Transfer 83 piano roll  piano roll’ audio  audio’
  • 84. Music style transfer • Alter the “style,” but keep the “content” fixed • Three types of music style transfer [1] 1. composition style transfer for score 2. performance style transfer for performance control 3. timbre style transfer for sound • Little existing work on performance style transfer (e.g., [2]) uses deep learning 84 [1] Dai et al., “Music style transfer: A position paper,” MUME 2018 [2] Shih et al., “Analysis and synthesis of the violin playing styles of Heifetz and Oistrakh,” DAFx 2017
  • 85. Composition style transfer (musical genre) • Example: https://www.youtube.com/watch?v=buXqNqBFd6E • Re-orchestrations of Beethoven's Ode to Joy by a collaboration between human and AI (Sony CSL Flow Machines) 85 Pachet, “A Joyful Ode to automatic orchestration,” ACM TIST 2016
  • 86. Composition style transfer (musical genre) • Transfer among Classic, Jazz, and Pop [1,2] • Model: standard convolutional CycleGAN • I/O representation: single-track piano roll (64 x 84) • merge all notes of all tracks (except for drums) into a single track • discard drums • 7 octaves (C1-C8; hence 84 notes) • 4/4 time signature, 16 time steps per bar, 4 bars as a unit (hence 64 steps) • 10k+ four-bar phrases for each genre (no paired data) 86 [1] Brunner et al., “Symbolic music genre transfer with CycleGAN,” ICTAI 2018 [2] Brunner et al., “Neural symbolic music genre transfer insights,” MML 2019
  • 87. Recap—CycleGAN 87 • Adversarial loss + identity mapping loss (“cycle consistency”) Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV 2017
  • 88. Composition style transfer (musical genre) • https://www.youtube.com/channel/UCs-bI_NP7PrQaMV1AJ4A3HQ • Possible extensions: • Consider different voices (including drums) separately (instead of merging them) • Add recurrent layers to better model sequential information • Identify the melody line [1] and take better care of it • Related: • Supervised genre style transfer (not using GANs) using synthesized data [2] 88 [1] Simonetta et al., “A Convolutional approach to melody line identification in symbolic scores,” ISMIR 2019 [2] Cífka et al., “Supervised symbolic music style translation using synthetic data,” ISMIR 2019
  • 89. Timbre style transfer (instrumental sounds): TimbreTron 89 • Transfer among piano, flute, violin, harpsichord solos https://www.cs.toronto.edu/~huang/TimbreTron/samples_page.html • Model: modified version of CycleGAN • I/O representation: 4-second CQT (257 x 251) Huang et al., “TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer,” ICLR 2019
  • 90. Drawback of cycleGAN • Can work for only two domains at a time • For example, piano↔flute, paion↔violin, piano↔harpsichord, etc 90
  • 91. MUNIT instead of cycleGAN • MUNIT [1]: an advanced version of cycleGAN that incorporates encoders/decoders to get disentangled “content” and “style” codes • Can work for multiple domains at the same time 91[1] Huang et al., “Multimodal unsupervised image-to-image translation,” ECCV 2018
  • 92. Timbre style transfer (instrumental sounds): Play-as-you-like 92 • Transfer among piano, guitar solos, and string quartet (https://tinyurl.com/y23tvhjx) • Use MUNIT • I/O representation: Mel-spectrogram + spectral difference + MFCC + spectral envelope • Cycle consistency among the channel-wise features (as regularizers) • Style interpolation: https://soundcloud.com/affige/sets/ismir2019-gan-tutorial-supp-material piano guitar Lu et al., “Play as You Like: Timbre-enhanced multi-modal music style transfer,” AAAI 2019
  • 93. Timbre style transfer (singing): CycleBEGAN 93 Wu et al., “Singing style transfer using cycle- consistent boundary equilibrium generative adversarial networks,” ICML workshop 2018
  • 94. Timbre style transfer (singing): CycleBEGAN • Transfer between female and male singing voices [1] http://mirlab.org/users/haley.wu/cybegan/ (check the outside test result) • Model: BEGAN instead of GAN [2] • train the generator G such that the discriminator D (an encoder/decoder net) would reconstruct fake data as nicely as real data • have a mechanism to balance the power of G and D + skip connections (also used in [3]) + a recurrent layer 94 [1] Wu et al., “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” ICML workshop 2018 [2] Berthelot et al., “BEGAN: Boundary equilibrium generative adversarial networks,” arXiv 2017 [3] Hung et al., “Musical composition style transfer via disentangled timbre representations,” IJCAI 2019
  • 95. Timbre style transfer (singing): CycleBEGAN 95 • Skip connections contribute to sharpness, lyrics intelligibility, and naturalness • Recurrent layers further improves everything, especially pitch accuracy Wu et al., “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” ICML workshop 2018
  • 96. Timbre style transfer (singing): CycleBEGAN 96 RNN here skip connections “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” ICML works. '18 Also use an encoder/decoder architecture for the generator
  • 97. GANs for Music Audio Generation 97 X  audio piano roll  audio
  • 98. Generating instrument sounds using GANs • Generate spectrograms • SpecGAN [1], TiFGAN [2], GANSynth [3] • Generate waveforms • WaveGAN [1] • There are also approaches that do not use GANs [4, 5, 6, 7] 98 [1] Donahue et al., “Adversarial audio synthesis,” ICLR 2019 [2] Marafioti et al., “Adversarial generation of time-frequency features,” ICML 2019 [3] Engel et al., “GANSynth: Adversarial neural audio synthesis,” ICLR 2019 [4] Oord et al., “WaveNet: A generative model for raw audio,” SSW 2016 [5] Défossez et al., “SING: Symbol-to-instrument neural generator,” NeurIPS 2018 [6] Schimbinschi et al., “SynthNet: Learning to synthesize music end-to-end,” IJCAI 2019 [7] “PerformanceNet Score-to-audio music generation with multi-band convolutional residual network,” AAAI 2019
  • 99. SpecGAN and WaveGAN • Generating 1-second audio at 16kHz (16,000 points) https://chrisdonahue.com/wavegan/ • Model: based on DCGAN • Flatten 2D convolutions into 1D (e.g., 5x5 2D convolution becomes length-25 1D) • Increase the stride factor for all convolutions (e.g., stride 2x2 becomes stride 4) • DCGAN outputs 64x64 images; add one more layer so that the output has 16,384 points 99Donahue et al., “Adversarial audio synthesis,” ICLR 2019
  • 100. GANSynth • Generating 4-second audio at 16kHz [1]: large output dimension https://magenta.tensorflow.org/gansynth • Model: based on PGGAN [2] • Progressively grow the GAN, from low to high resolution • During a resolution transition, interpolate between the output of two resolutions, with weight α linearly increasing from 0 to 1 100 [1] Engel et al., “GANSynth: Adversarial neural audio synthesis,” ICLR 2019 [2] Karras et al., “Progressive growing of GANs for improved quality, stability, and variation,” ICLR 2018
  • 101. PGGAN: progressive growing of GANS 101Karras et al., “Progressive growing of GANs for improved quality, stability, and variation,” ICLR 2018
  • 102. GANSynth • Generating 4-second audio at 16kHz https://magenta.tensorflow.org/gansynth • Model: based on PGGAN (conv2d) • Output: mel-spectrogram + instantaneous freq (IF) • Use IF to derive the phase, and then use inverse STFT to get the waveform • STFT window size 2048, stride 512 (so, about 128 frames) • 1024-bin mel-frequency scale • Target output tensor size: 128x 1024x 2 • (2x16 → 4x32 → … → 128x1024) 102Engel et al., “GANSynth: Adversarial neural audio synthesis,” ICLR 2019
  • 103. GANSynth: Why instantaneous frequency? 103
  • 104. GANSynth: Pitch-conditioned generation • Input to the generator • 256-dim random vector Z • 61-dim one-hot vector (MIDI 24-84) for pitch conditioning • Auxiliary pitch classification loss for the discriminator (ACGAN [1]) • In addition to the real/fake loss • Try to predict the pitch label 104 [1] Odena et al., “Conditional image synthesis with auxiliary classifier GANs,” ICML 2017
  • 105. GANSynth: Possible future extensions • Modify the network to generate variable-length audio • From note-level synthesis (i.e., condition on a one-hot pitch vector); to phrase-level syntehsis (i.e., condition on a piano roll) [1,2] • The IF might be noisy during note transitions • May need to deal with the use of playing techniques [3] 105 [1] Wang and Yang, “PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network,” AAAI 2019 [2] Chen et al., “Demonstration of PerformanceNet: A convolutional neural network model for score-to-audio music generation,” IJCAI demo paper, 2019 [3] Su et al., “TENT: Technique-embedded note tracking for real-world guitar solo recordings,” TISMIR 2019
  • 106. PerformanceNet: Possible future extensions 106 • Building an “ AI Performer” [1,2,3] [1] Wang and Yang, “PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network,” AAAI 2019 [2] Chen et al., “Demonstration of PerformanceNet: A convolutional neural network model for score-to-audio music generation,” IJCAI demo paper, 2019 [3] Oore et al., “This time with feeling: Learning expressive musical performance,” arXiv 2018
  • 107. Outline 107 Section 1 Overview of music generation research Section 2 Introduction to GANs Section 3 Case studies of GAN-based systems Section 4 Current limitations Future research directions
  • 108. Limitations of GANs • A bit difficult to train (it's helpful to know some tips and tricks [1]) • Only learn z→X mapping, not the inverse (X→z) • Less explainable [2] • Less clear how GANs can model text-like data or musical scores • Unclear how GANs (and all other music composition models in general) can generate “new” music genres 108 [1] “How to Train a GAN? Tips and tricks to make GANs work,” https://github.com/soumith/ganhacks [2] Kelz and Widmer, “Towards interpretable polyphonic transcription with invertible neural networks,” ISMIR 2019
  • 109. The many other generative models • Variational autoencoders (VAEs) • Flow-based models • Autoregressive models • Attention mechanisms (transformers) • Restricted Boltzmann machines (RBMs) • Hidden Markov models (HMMs) • …and more! 109
  • 110. Future research directions • Better network architectures for musical data, including piano rolls, MIDI events, musical scores, and audio • Learning to generate music with better structure and diversity • Better interpretability and human control (e.g., [1]) • Standardized test dataset and evaluation metrics [2] • Cross-modal generation, e.g., “music + lyrics” or “music + video” • Interactive music generation (e.g., [3]) 110 [1] Lattner and Grachten, “High-level control of drum track generation using learned patterns of rhythmic interaction,” WASPAA 2019 [2] Katayose et al., “On evaluating systems for generating expressive music performance: the Rencon experience,” Journal of New Music Research 2012 [3] Hsiao et al., “Jamming with Yating: Interactive demonstration of a music composition AI,” ISMIR-LBD 2019
  • 111. Future research directions • Composition style transfer (musical genre) using the LPD dataset • More work on performance style transfer • Phrase-level audio generation instead of note-level synthesis [1] • Multi-singer audio generation and style transfer • Lyrics-free singing generation [2] • EDM generation [3,4] 111 [1] “Demonstration of PerformanceNet: A convolutional neural network model for score-to-audio music generation,” IJCAI demo paper, 2019 [2] “Score and lyrics-free singing voice generation,” ICLR 2020 submission [3] “DJnet: A dream for making an automatic DJ,” ISMIR-LBD 2017 [4] “Unmixer: An interface for extracting and remixing loops,” ISMIR 2019
  • 112. Future research directions • The “MIR4generation” pipeline • Learning to generate music (that is expressive) from machine-transcribed data (i.e., learning to compose and perform at the same time) 112Yeh et al., “Learning to generate Jazz and Pop piano music from audio via MIR techniques,” ISMIR-LBD 2019
  • 113. Thank you! Any questions? 113 Contact Hao-Wen Dong (salu.hwdong@gmail.com) Yi-Hsuan Yang (yang@citi.sinica.edu.tw) Tutorial website salu133445.github.io/ismir2019tutorial/