SlideShare a Scribd company logo
MuseGAN: Multi-track Sequential
Generative Adversarial Networks for
Symbolic Music Generation and
Accompaniment
Hongkyu Lim
2021. 01. 29
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)
Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, Yi-Hsuan Yang
Contents
•Overview of the Paper
•Proposed Model
•Implementation
•Experiment and Results
2
Overview of the Paper
• Musical Terminology
 Phrase - 소절
 Note - 음표
 Pitch - 음의 높이
 Tracks - instruments
 Bars – beat, pixel
 Fragments
 Monophonic music- music composed of
one instrument
 Polyphonic music – music composed of
multiple instruments
3
Overview of the Paper
• Basic Ideas and Motivation
 Music is an art of time
• Music has a hierarchical structure, with higher-level building blocks made up of smaller recurrent
patterns.
• People pay attention to structural patterns related to coherence, rhythm, tension and the
emotion flow while listening to music. Thus, a mechanism to account for the temporal structure is
critical.
 Music is usually composed of multiple instruments/tracks.
• A modern orchestra usually contains four different sections: brass, strings, woodwinds and
percussion; a rock band includes a bass, a drum set, guitars and possibly a vocal.
• These tracks interact with one another closely and unfold over time interdependently.
 Musical notes are often grouped into chords, arpeggios or melodies.
• It is not naturally suitable to introduce a chronological ordering of notes for polyphonic music.
• Therefore, success in natural language generation and monophonic music generation may not be
readily generalizable to polyphonic music generation.
4
Overview of the Paper
• Basic Ideas and Motivation (Continued..)
 Most prior arts chose to simplify symbolic music generation in certain ways to render
the problem manageable. Music has a hierarchical structure, with higher-level
building blocks made up of smaller recurrent patterns.
• Goal
 To generate multi-track polyphonic music with the followings:
• Harmonic and rhythmic structure
• Multi-track interdependency
• Temporal structure
• Contribution
• Existing RNN based models Generate only single-track monophonic music
• Introducing a chronological ordering of notes for polyphonic music
• Generating polyphonic music as a combination of several monophonic melodies, etc.
5
Overview of the Paper
• Approach
 To incorporate a temporal model, two approaches are proposed for different scenarios
1) One generates music from scratch(without human inputs) while the other learns to follow the
underlying temporal structure of a track given a proiri by human.
2) viewing bars instead of notes as the basic compositional unit and generate music one bar after another
using transposed convolutional neural networks, which is known to be good at finding local,
translation-invariant patterns
 To handle the interactions among tracks, three methods are suggested based on
understanding of how pop music is composed:
• One generates tracks independently by their private generators (one for each); another generates all tracks
jointly with only one generator; the other generates each track by its private generator with additional
shared inputs among tracks, which is expected to guide the tracks to be collectively harmonious and
coordinated.
• To cope with the grouping of notes, notes are considered as the basic compositional unit and
music is generated one bar after another using transposed convolutional neural networks, which is
known to be good at finding local, translation-invariant patterns.
• Purpose of intra-track and inter-track : Propose a few intra-track and inter-track objective measures and
use them to monitor the learning process and to evaluate the generated results of different proposed
models quantitatively.
6
Proposed Model
• Generative Adversarial Networks (GANs)
 The core concept of GANs is to achieve adversarial learning by constructing two
networks: the generator and the discriminator
 The generator maps a random noise ‘z’ sampled from a prior distribution to the data
space.
 The discriminator is trained to distinguish real data from those generated by the
generator, whereas the generator is trained to fool the discriminator.
 The training procedure can be formally modeled as a two-player minimax game
between the generator G and the discriminator D :
*Pd and Pz represent the distribution of real data and the prior distribution of z
*E - Expectation
*log(D(x)) : Discriminator output for real data x
*log(D(G(z))) : Discriminator output for generated fake data G(z)
7
Proposed Model
*Pd and Pz represent the distribution of real data and the prior distribution of z
*E - Expectation
*log(D(x)) : Discriminator output for real data x (close to 1 -> real)
*log(D(G(z))) : Discriminator output for generated fake data G(z) (close to 0 -> fake)
8
Proposed Model
• Jamming Model
 Multiple generators work independently and generate music of its own track from a
private random vector Zi, i = 1,2,...,M, where M denotes the number of generators
(or tracks).
 M denotes the number of generators (or tracks). These generators receive critics (i.e
. back-propagated supervisory signals) from different discriminators. To generate
music of M tracks, M generators and M discriminators are needed.
9
Proposed Model
• Composer Model
 One single generator creates a multi-channel piano-roll, with each channel
representing a specific track.
 CM requires only one shared random vector z and one discriminator, which
examines the M tracks collectively to tell whether the input music is real or fake.
 Regardless of the value of M, we always need only one generator and one
discriminator.
10
Proposed Model
• Hybrid Model
 M generators and only one discriminator
 Combining the idea of jamming and composing, the author expects that the inter-track
random vector can coordinate the generation of different musicians, namely ’Gi’ just like a
composer does. Moreover, we use only one discriminator to evaluate the M tracks
collectively.
 A major difference between the composer model and the hybrid model lies in the flexibility
– in the hybrid model we can use different network architectures and different inputs for
the M generators. Therefore, we can for example vary the generation of one specific track
without losing the inter-track interdependency.
 It is expected that the inter-track random vector can coordinate the generation of different
musicians.
11
Proposed Model
• Modeling the Temporal Structure
 The previous models presented can only generate multi-track music bar by bar, with
possibly no coherence among the bars. ->We need a temporal model to generate.
music of a few bars long, such as a musical phrase. There are 2 methods.
12
Proposed Model
• Modeling the Temporal Structure
• Generation from Scratch
• GS aims to generate fixed-length musical phrases(소절) by viewing bar progression as another
dimension to grow the generator.
*A phrase doesn’t mean a mathematical unit, but a line of a song
• The generator consists of two sub networks, the temporal structure generator Gtemp and the bar
generator Gbar. Gtemp maps a noise vector z to a sequence of some latent vectors. The resulting z
carrying temporal information is then be used by Gbar to generate piano-rolls sequentially.(bar-by-bar)
13
Proposed Model
• Modeling the Temporal Structure
• Track-conditional Generation
• TG assumes that the bar sequence y of one specific track is given by human, and tries to learn the
temporal structure underlying that track and to generate the remaining tracks. The track conditional
generator G generates bars one after another with the conditional bar generator G bar.
The multi-track piano -rolls of the remaining tracks of bar t are then generated by G bar.
• Encoder E is trained to map y to the space of z. E is expected to extract inter-track features instead of
intra-track features from the given track, since intra-track features are supposed not to be useful for
generating the other tracks. (ex) AIVA)
14
Proposed Model
• Modeling the Temporal Structure
• MuseGAN
• An integration and extension of the proposed multi-track and temporal models
*M : max # track
T : max # Bar
i : initial track
t = initial bar
15
• Modeling the Temporal Structure
• Inputs
 Zt : An Inter-track time-dependent random vector
 Z : An inter-track time independent random vector
 Zit : Intra-track time dependent random vectors
 Zi : Intra-track time independent random vectors
*M : max # track
T : max # Bar
i : initial track
t = initial bar
16
• Modeling the Temporal Structure
• Inputs
 Zt : An Inter-track time-dependent random vector
 Z : An inter-track time independent random vector
 Zit : Intra-track time dependent random vectors
 Zi : Intra-track time independent random vectors
*M : max # track
T : max # Bar
i : initial track
t = initial bar
17
• Modeling the Temporal Structure
• Inputs
 Zt : An Inter-track time-dependent random vector
 Z : An inter-track time independent random vector
 Zit : Intra-track time dependent random vectors
 Zi : Intra-track time independent random vectors
*M : max # track
T : max # Bar
i : initial track
t = initial bar
18
• Modeling the Temporal Structure
• Inputs
 Zt : An Inter-track time-dependent random vector
 Z : An inter-track time independent random vector
 Zit : Intra-track time dependent random vectors
 Zi : Intra-track time independent random vectors
*M : max # track
T : max # Bar
i : initial track
t = initial bar
19
• Modeling the Temporal Structure
• Inputs
 Zt : An Inter-track time-dependent random vector
 Z : An inter-track time independent random vector
 Zit : Intra-track time dependent random vectors
 Zi : Intra-track time independent random vectors
*M : max # track
T : max # Bar
i : initial track
t = initial bar
20
Implementation
 Dataset
• Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files.
• The MIDI files are converted into multi-track piano-rolls.
• For each bar, we set the height to 128(pitch) and the width(time resolution, tempo and beat) to 96 for
modeling common temporal patterns such as triplets and 16th notes.
• Python library ‘pretty_midi’ is used to parse and process the MIDI files.
• The resulting dataset was named after Lakh Piano-roll Dataset(LPD)
 Data Preprocessing
21
Implementation
 Data Preprocessing
• As these MIDI files are scraped from the web and mostly user-generated, the dataset is quite
noisy. Hence, we use LPD-matched in this work and perform three steps for further cleansing
1) Some tracks tend to play only a few notes in the entire songs. This increases data sparsity and
impedes the learning process. We deal with such a data imbalance issue by merging tracks of similar.
instruments (by summing their piano-rolls). Each multi-track piano-roll is compressed into five tracks:
bass, drums, guitar, piano and strings. After this step, we get the LPD-5-matched, which has 30,887
multi-track piano-rolls.
22
Implementation
 Data Preprocessing
• As these MIDI files are scraped from the web and mostly user-generated (Raffel and Ellis 2016),
the dataset is quite noisy. Hence, we use LPD-matched in this work and perform three steps for
further cleansing
2) we utilize the metadata provided in the LMD and MSD, and we pick only the piano-rolls that have
higher confidence score in matching, that are Rock songs and are in 4/4 time. After this step, we get.
the LPD-5-cleansed.
3) In order to acquire musically meaningful phrases to train our temporal model, we segment the
piano-rolls with a state-of-the-art algorithm, structural features (Serra` et al. 2012), and obtain phrases
accordingly. In this work, we consider four bars as a phrase and prune longer segments into proper size.
We get 50,266 phrases in total for the training data. Notably, although we use our models to generate
fixed-length segments only, the track-conditional model is able to generate music of any length
according to the input. The size of the target output tensor is hence
4 (bar) × 96 (time step) × 84 (note) × 5 (track).
23
Implementation
 Dataset
• Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files.
• The MIDI files are converted into multi-track piano-rolls.
• For each bar, we set the height to 128(pitch) and the width(time resolution, tempo and beat) to 96 for
modeling common temporal patterns such as triplets and 16th notes.
• Python library ‘pretty_midi’ is used to parse and process the MIDI files.
• The resulting dataset was named after Lakh Piano-roll Dataset(LPD)
 Data Preprocessing
24
Implementation
 Dataset
• Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files.
• Python library ‘pretty_midi’ is used to parse and process the MIDI files.
• The resulting dataset was named after Lakh Piano-roll Dataset(LPD)
 Data Preprocessing
25
Implementation
 Dataset
• Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files.
• Python library ‘pretty_midi’ is used to parse and process the MIDI files.
• The resulting dataset was named after Lakh Piano-roll Dataset(LPD)
 Data Preprocessing
26
Implementation
 Dataset
• Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files.
• Python library ‘pretty_midi’ is used to parse and process the MIDI files.
• The resulting dataset was named after Lakh Piano-roll Dataset(LPD)
 Data Preprocessing
27
Experiment and Results
• Objective Metrics for Evaluation
 The real and the generated data, including four intra-track and inter-track(the last
one) metrics:
• EB : ratio of empty bars(in %)
• UPC : number of used pitch classes per bar (from 0 to 12)
• QN : ratio of qualified notes(in %). We consider a note not shorter than three time steps as a
qualified note.
• DP or drum pattern : ratio of notes in 8- or 16-beat patterns, common ones for Rock songs in 4/4
time(in %)
• TD or Tonal Distance : It measures the harmonicity between a pair of tracks. Larger TD implies
weaker inter-track harmonic relations.
28
Experiment and Results
• Analysis of Training Data
 The real and the generated data, including four intra-track and inter-track(the last one) metr
ics:
*ablated : ablated version of the composer model, one without batch normalization layers
• EB : shows that categorizing the tracks into five families in appropriate.
• UPC : we find that the bass tends to play the melody, which results in a UPC below 2, while the guitar,
piano and strings tend to play the chords, which results in a UPC above 3.
29
Experiment and Results
• Analysis of Training Data
 The real and the generated data, including four intra-track and inter-track(the last o
ne) metrics:
• QN : High values of QN indicate that the converted piano-rolls are not overly fragmented.
• DP : we see that over 88 percent of the drum notes are in eighter 8- or 16-beat patterns.
• TD : are around 1.5 when measuring distance between a melody-like track(mostly the bass) and
a chord-like track (mostly one of the piano, guitar or strings), and around 1.00 for two chord-like
tracks.
30
Experiment and Results
• Example Results
 The tracks usually play in the same music scale(음계).
 Chord-like intervals can be observed in some samples.
 The bass often plays the lowest pitches and it is monophonic at most time.
 The drums usually have 8 or 16 beat rhythmic patterns.
 The guitar, piano and strings tend to play the chords, and their pitches sometimes overlap,
which indicates nice harmonic relations.
31
Experiment and Results
• Objective Evaluation
 20,000 bars are generated with each model and evaluate them in terms of the proposed
objective metrics.
 The piano tracks are used as conditions and generate the other four tracks.
 For comparison, we also include the result of an ablated version of the composer model,
one without batch normalization layers. This ablated model barely learns anything, so its
values can be taken as a reference.
 For the intra-track metrics, we see that the jamming model tends to perform the best.
 we see that all models tend to use more pitch classes and produce fairly less qualified
notes than the training data do.
 This indicates that some noise might have been produced and that the generated music
contains a great amount of overly fragmented notes, which may result from the way we
binarize the continuous-valued output of G (to create binary-valued piano-rolls).
 For the inter-track metric TD, composer model and the hybrid model are relatively lower
than that of the jamming models. ->music generated by the jamming model has weaker har
monic relation among tracks and that the composer model and the hybrid model may be
more appropriate for multi-track generation in terms of cross-track harmonic relation.
32
Experiment and Results
• Example Results
 (a) Training loss of the discriminator
 (b) the UPC - number of used pitch
classes per bar
 (c) the QN(Qualified Note) of the
strings track, for the composer model
generating from scratch
33
Experiment and Results
• User Study
 The subject rates the clips in terms of
whether they
• 1) have pleasant harmony
• 2) have unified rhythm
• 3) have clear musical structure
• 4) are coherent, and
• 5) the overall rating,
* a 5-point Likert scale
*144 subjects recruited from the
Internet via our social circles.
44 of them are deemed ‘pro user,’
according to a simple questionnaire
probing their musical background
34
Experiment and Results
35
Conclusion and Future Work
• Conclusion
 MuseGAN is a novel generative model for multi-track sequence generation under the
framework of GANs.
 A model such as MuseGAN with deep CNNs for generating multi-track could be led to
new ideas left behind the work implemented in the thesis paper.
 The objective metrics and the subjective user study show that the proposed models
can start to learn something about music.
[Seminar] 20210129 Hongkyu Lim

More Related Content

Similar to [Seminar] 20210129 Hongkyu Lim

PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
Varad Meru
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platforms
Syed Zaid Irshad
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
Vajira Thambawita
 
Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021
Yi-Hsuan Yang
 
Everything About Graphs in Data Structures.pptx
Everything About Graphs in Data Structures.pptxEverything About Graphs in Data Structures.pptx
Everything About Graphs in Data Structures.pptx
MdSabbirAhmedEkhon
 
A new parallel bat algorithm for musical note recognition
A new parallel bat algorithm for musical note recognition  A new parallel bat algorithm for musical note recognition
A new parallel bat algorithm for musical note recognition
IJECEIAES
 

Similar to [Seminar] 20210129 Hongkyu Lim (7)

PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platforms
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
 
Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021
 
Everything About Graphs in Data Structures.pptx
Everything About Graphs in Data Structures.pptxEverything About Graphs in Data Structures.pptx
Everything About Graphs in Data Structures.pptx
 
A new parallel bat algorithm for musical note recognition
A new parallel bat algorithm for musical note recognition  A new parallel bat algorithm for musical note recognition
A new parallel bat algorithm for musical note recognition
 

More from ivaderivader

Argument Mining
Argument MiningArgument Mining
Argument Mining
ivaderivader
 
Papers at CHI23
Papers at CHI23Papers at CHI23
Papers at CHI23
ivaderivader
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
ivaderivader
 
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
ivaderivader
 
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
ivaderivader
 
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
ivaderivader
 
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
ivaderivader
 
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial NetworksA Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
ivaderivader
 
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
ivaderivader
 
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for VisualizationPerception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
ivaderivader
 
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
ivaderivader
 
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-PoolingNeural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
ivaderivader
 
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
ivaderivader
 
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTubeBad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
ivaderivader
 
Invertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise RemovalInvertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise Removal
ivaderivader
 
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural NetworkTraffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
ivaderivader
 
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training  MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
ivaderivader
 
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI ComponentsScreen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
ivaderivader
 
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
ivaderivader
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translation
ivaderivader
 

More from ivaderivader (20)

Argument Mining
Argument MiningArgument Mining
Argument Mining
 
Papers at CHI23
Papers at CHI23Papers at CHI23
Papers at CHI23
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
 
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
 
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
 
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
 
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial NetworksA Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
 
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
 
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for VisualizationPerception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
 
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
 
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-PoolingNeural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
 
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
 
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTubeBad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
 
Invertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise RemovalInvertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise Removal
 
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural NetworkTraffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
 
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training  MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
 
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI ComponentsScreen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
 
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translation
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 

[Seminar] 20210129 Hongkyu Lim

  • 1. MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment Hongkyu Lim 2021. 01. 29 The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, Yi-Hsuan Yang
  • 2. Contents •Overview of the Paper •Proposed Model •Implementation •Experiment and Results
  • 3. 2 Overview of the Paper • Musical Terminology  Phrase - 소절  Note - 음표  Pitch - 음의 높이  Tracks - instruments  Bars – beat, pixel  Fragments  Monophonic music- music composed of one instrument  Polyphonic music – music composed of multiple instruments
  • 4. 3 Overview of the Paper • Basic Ideas and Motivation  Music is an art of time • Music has a hierarchical structure, with higher-level building blocks made up of smaller recurrent patterns. • People pay attention to structural patterns related to coherence, rhythm, tension and the emotion flow while listening to music. Thus, a mechanism to account for the temporal structure is critical.  Music is usually composed of multiple instruments/tracks. • A modern orchestra usually contains four different sections: brass, strings, woodwinds and percussion; a rock band includes a bass, a drum set, guitars and possibly a vocal. • These tracks interact with one another closely and unfold over time interdependently.  Musical notes are often grouped into chords, arpeggios or melodies. • It is not naturally suitable to introduce a chronological ordering of notes for polyphonic music. • Therefore, success in natural language generation and monophonic music generation may not be readily generalizable to polyphonic music generation.
  • 5. 4 Overview of the Paper • Basic Ideas and Motivation (Continued..)  Most prior arts chose to simplify symbolic music generation in certain ways to render the problem manageable. Music has a hierarchical structure, with higher-level building blocks made up of smaller recurrent patterns. • Goal  To generate multi-track polyphonic music with the followings: • Harmonic and rhythmic structure • Multi-track interdependency • Temporal structure • Contribution • Existing RNN based models Generate only single-track monophonic music • Introducing a chronological ordering of notes for polyphonic music • Generating polyphonic music as a combination of several monophonic melodies, etc.
  • 6. 5 Overview of the Paper • Approach  To incorporate a temporal model, two approaches are proposed for different scenarios 1) One generates music from scratch(without human inputs) while the other learns to follow the underlying temporal structure of a track given a proiri by human. 2) viewing bars instead of notes as the basic compositional unit and generate music one bar after another using transposed convolutional neural networks, which is known to be good at finding local, translation-invariant patterns  To handle the interactions among tracks, three methods are suggested based on understanding of how pop music is composed: • One generates tracks independently by their private generators (one for each); another generates all tracks jointly with only one generator; the other generates each track by its private generator with additional shared inputs among tracks, which is expected to guide the tracks to be collectively harmonious and coordinated. • To cope with the grouping of notes, notes are considered as the basic compositional unit and music is generated one bar after another using transposed convolutional neural networks, which is known to be good at finding local, translation-invariant patterns. • Purpose of intra-track and inter-track : Propose a few intra-track and inter-track objective measures and use them to monitor the learning process and to evaluate the generated results of different proposed models quantitatively.
  • 7. 6 Proposed Model • Generative Adversarial Networks (GANs)  The core concept of GANs is to achieve adversarial learning by constructing two networks: the generator and the discriminator  The generator maps a random noise ‘z’ sampled from a prior distribution to the data space.  The discriminator is trained to distinguish real data from those generated by the generator, whereas the generator is trained to fool the discriminator.  The training procedure can be formally modeled as a two-player minimax game between the generator G and the discriminator D : *Pd and Pz represent the distribution of real data and the prior distribution of z *E - Expectation *log(D(x)) : Discriminator output for real data x *log(D(G(z))) : Discriminator output for generated fake data G(z)
  • 8. 7 Proposed Model *Pd and Pz represent the distribution of real data and the prior distribution of z *E - Expectation *log(D(x)) : Discriminator output for real data x (close to 1 -> real) *log(D(G(z))) : Discriminator output for generated fake data G(z) (close to 0 -> fake)
  • 9. 8 Proposed Model • Jamming Model  Multiple generators work independently and generate music of its own track from a private random vector Zi, i = 1,2,...,M, where M denotes the number of generators (or tracks).  M denotes the number of generators (or tracks). These generators receive critics (i.e . back-propagated supervisory signals) from different discriminators. To generate music of M tracks, M generators and M discriminators are needed.
  • 10. 9 Proposed Model • Composer Model  One single generator creates a multi-channel piano-roll, with each channel representing a specific track.  CM requires only one shared random vector z and one discriminator, which examines the M tracks collectively to tell whether the input music is real or fake.  Regardless of the value of M, we always need only one generator and one discriminator.
  • 11. 10 Proposed Model • Hybrid Model  M generators and only one discriminator  Combining the idea of jamming and composing, the author expects that the inter-track random vector can coordinate the generation of different musicians, namely ’Gi’ just like a composer does. Moreover, we use only one discriminator to evaluate the M tracks collectively.  A major difference between the composer model and the hybrid model lies in the flexibility – in the hybrid model we can use different network architectures and different inputs for the M generators. Therefore, we can for example vary the generation of one specific track without losing the inter-track interdependency.  It is expected that the inter-track random vector can coordinate the generation of different musicians.
  • 12. 11 Proposed Model • Modeling the Temporal Structure  The previous models presented can only generate multi-track music bar by bar, with possibly no coherence among the bars. ->We need a temporal model to generate. music of a few bars long, such as a musical phrase. There are 2 methods.
  • 13. 12 Proposed Model • Modeling the Temporal Structure • Generation from Scratch • GS aims to generate fixed-length musical phrases(소절) by viewing bar progression as another dimension to grow the generator. *A phrase doesn’t mean a mathematical unit, but a line of a song • The generator consists of two sub networks, the temporal structure generator Gtemp and the bar generator Gbar. Gtemp maps a noise vector z to a sequence of some latent vectors. The resulting z carrying temporal information is then be used by Gbar to generate piano-rolls sequentially.(bar-by-bar)
  • 14. 13 Proposed Model • Modeling the Temporal Structure • Track-conditional Generation • TG assumes that the bar sequence y of one specific track is given by human, and tries to learn the temporal structure underlying that track and to generate the remaining tracks. The track conditional generator G generates bars one after another with the conditional bar generator G bar. The multi-track piano -rolls of the remaining tracks of bar t are then generated by G bar. • Encoder E is trained to map y to the space of z. E is expected to extract inter-track features instead of intra-track features from the given track, since intra-track features are supposed not to be useful for generating the other tracks. (ex) AIVA)
  • 15. 14 Proposed Model • Modeling the Temporal Structure • MuseGAN • An integration and extension of the proposed multi-track and temporal models *M : max # track T : max # Bar i : initial track t = initial bar
  • 16. 15 • Modeling the Temporal Structure • Inputs  Zt : An Inter-track time-dependent random vector  Z : An inter-track time independent random vector  Zit : Intra-track time dependent random vectors  Zi : Intra-track time independent random vectors *M : max # track T : max # Bar i : initial track t = initial bar
  • 17. 16 • Modeling the Temporal Structure • Inputs  Zt : An Inter-track time-dependent random vector  Z : An inter-track time independent random vector  Zit : Intra-track time dependent random vectors  Zi : Intra-track time independent random vectors *M : max # track T : max # Bar i : initial track t = initial bar
  • 18. 17 • Modeling the Temporal Structure • Inputs  Zt : An Inter-track time-dependent random vector  Z : An inter-track time independent random vector  Zit : Intra-track time dependent random vectors  Zi : Intra-track time independent random vectors *M : max # track T : max # Bar i : initial track t = initial bar
  • 19. 18 • Modeling the Temporal Structure • Inputs  Zt : An Inter-track time-dependent random vector  Z : An inter-track time independent random vector  Zit : Intra-track time dependent random vectors  Zi : Intra-track time independent random vectors *M : max # track T : max # Bar i : initial track t = initial bar
  • 20. 19 • Modeling the Temporal Structure • Inputs  Zt : An Inter-track time-dependent random vector  Z : An inter-track time independent random vector  Zit : Intra-track time dependent random vectors  Zi : Intra-track time independent random vectors *M : max # track T : max # Bar i : initial track t = initial bar
  • 21. 20 Implementation  Dataset • Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files. • The MIDI files are converted into multi-track piano-rolls. • For each bar, we set the height to 128(pitch) and the width(time resolution, tempo and beat) to 96 for modeling common temporal patterns such as triplets and 16th notes. • Python library ‘pretty_midi’ is used to parse and process the MIDI files. • The resulting dataset was named after Lakh Piano-roll Dataset(LPD)  Data Preprocessing
  • 22. 21 Implementation  Data Preprocessing • As these MIDI files are scraped from the web and mostly user-generated, the dataset is quite noisy. Hence, we use LPD-matched in this work and perform three steps for further cleansing 1) Some tracks tend to play only a few notes in the entire songs. This increases data sparsity and impedes the learning process. We deal with such a data imbalance issue by merging tracks of similar. instruments (by summing their piano-rolls). Each multi-track piano-roll is compressed into five tracks: bass, drums, guitar, piano and strings. After this step, we get the LPD-5-matched, which has 30,887 multi-track piano-rolls.
  • 23. 22 Implementation  Data Preprocessing • As these MIDI files are scraped from the web and mostly user-generated (Raffel and Ellis 2016), the dataset is quite noisy. Hence, we use LPD-matched in this work and perform three steps for further cleansing 2) we utilize the metadata provided in the LMD and MSD, and we pick only the piano-rolls that have higher confidence score in matching, that are Rock songs and are in 4/4 time. After this step, we get. the LPD-5-cleansed. 3) In order to acquire musically meaningful phrases to train our temporal model, we segment the piano-rolls with a state-of-the-art algorithm, structural features (Serra` et al. 2012), and obtain phrases accordingly. In this work, we consider four bars as a phrase and prune longer segments into proper size. We get 50,266 phrases in total for the training data. Notably, although we use our models to generate fixed-length segments only, the track-conditional model is able to generate music of any length according to the input. The size of the target output tensor is hence 4 (bar) × 96 (time step) × 84 (note) × 5 (track).
  • 24. 23 Implementation  Dataset • Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files. • The MIDI files are converted into multi-track piano-rolls. • For each bar, we set the height to 128(pitch) and the width(time resolution, tempo and beat) to 96 for modeling common temporal patterns such as triplets and 16th notes. • Python library ‘pretty_midi’ is used to parse and process the MIDI files. • The resulting dataset was named after Lakh Piano-roll Dataset(LPD)  Data Preprocessing
  • 25. 24 Implementation  Dataset • Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files. • Python library ‘pretty_midi’ is used to parse and process the MIDI files. • The resulting dataset was named after Lakh Piano-roll Dataset(LPD)  Data Preprocessing
  • 26. 25 Implementation  Dataset • Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files. • Python library ‘pretty_midi’ is used to parse and process the MIDI files. • The resulting dataset was named after Lakh Piano-roll Dataset(LPD)  Data Preprocessing
  • 27. 26 Implementation  Dataset • Lakh MIDI dataset(LMD) : a large collection of 176,581 unique MIDI files. • Python library ‘pretty_midi’ is used to parse and process the MIDI files. • The resulting dataset was named after Lakh Piano-roll Dataset(LPD)  Data Preprocessing
  • 28. 27 Experiment and Results • Objective Metrics for Evaluation  The real and the generated data, including four intra-track and inter-track(the last one) metrics: • EB : ratio of empty bars(in %) • UPC : number of used pitch classes per bar (from 0 to 12) • QN : ratio of qualified notes(in %). We consider a note not shorter than three time steps as a qualified note. • DP or drum pattern : ratio of notes in 8- or 16-beat patterns, common ones for Rock songs in 4/4 time(in %) • TD or Tonal Distance : It measures the harmonicity between a pair of tracks. Larger TD implies weaker inter-track harmonic relations.
  • 29. 28 Experiment and Results • Analysis of Training Data  The real and the generated data, including four intra-track and inter-track(the last one) metr ics: *ablated : ablated version of the composer model, one without batch normalization layers • EB : shows that categorizing the tracks into five families in appropriate. • UPC : we find that the bass tends to play the melody, which results in a UPC below 2, while the guitar, piano and strings tend to play the chords, which results in a UPC above 3.
  • 30. 29 Experiment and Results • Analysis of Training Data  The real and the generated data, including four intra-track and inter-track(the last o ne) metrics: • QN : High values of QN indicate that the converted piano-rolls are not overly fragmented. • DP : we see that over 88 percent of the drum notes are in eighter 8- or 16-beat patterns. • TD : are around 1.5 when measuring distance between a melody-like track(mostly the bass) and a chord-like track (mostly one of the piano, guitar or strings), and around 1.00 for two chord-like tracks.
  • 31. 30 Experiment and Results • Example Results  The tracks usually play in the same music scale(음계).  Chord-like intervals can be observed in some samples.  The bass often plays the lowest pitches and it is monophonic at most time.  The drums usually have 8 or 16 beat rhythmic patterns.  The guitar, piano and strings tend to play the chords, and their pitches sometimes overlap, which indicates nice harmonic relations.
  • 32. 31 Experiment and Results • Objective Evaluation  20,000 bars are generated with each model and evaluate them in terms of the proposed objective metrics.  The piano tracks are used as conditions and generate the other four tracks.  For comparison, we also include the result of an ablated version of the composer model, one without batch normalization layers. This ablated model barely learns anything, so its values can be taken as a reference.  For the intra-track metrics, we see that the jamming model tends to perform the best.  we see that all models tend to use more pitch classes and produce fairly less qualified notes than the training data do.  This indicates that some noise might have been produced and that the generated music contains a great amount of overly fragmented notes, which may result from the way we binarize the continuous-valued output of G (to create binary-valued piano-rolls).  For the inter-track metric TD, composer model and the hybrid model are relatively lower than that of the jamming models. ->music generated by the jamming model has weaker har monic relation among tracks and that the composer model and the hybrid model may be more appropriate for multi-track generation in terms of cross-track harmonic relation.
  • 33. 32 Experiment and Results • Example Results  (a) Training loss of the discriminator  (b) the UPC - number of used pitch classes per bar  (c) the QN(Qualified Note) of the strings track, for the composer model generating from scratch
  • 34. 33 Experiment and Results • User Study  The subject rates the clips in terms of whether they • 1) have pleasant harmony • 2) have unified rhythm • 3) have clear musical structure • 4) are coherent, and • 5) the overall rating, * a 5-point Likert scale *144 subjects recruited from the Internet via our social circles. 44 of them are deemed ‘pro user,’ according to a simple questionnaire probing their musical background
  • 36. 35 Conclusion and Future Work • Conclusion  MuseGAN is a novel generative model for multi-track sequence generation under the framework of GANs.  A model such as MuseGAN with deep CNNs for generating multi-track could be led to new ideas left behind the work implemented in the thesis paper.  The objective metrics and the subjective user study show that the proposed models can start to learn something about music.