FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
PopMAG: Pop Music Accompaniment Generation
1. PopMAG:
Pop Music Accompaniment Generation
Hyeshin Chu
2021. 08. 20
MM 2020
Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu
2. Contents
• Overview of the Paper
• Introduction
• Multi-Track Midi Representation
• Multi-Track Modeling
• Experimental Setup
• Results and Analyses
• Conclusion & Future Work
3. 2
Overview of the Paper
https://music-popmag.github.io/popmag/
• Goal
To improve harmony of accompaniment (usually with multiple instruments)
• Previous Works
Generate multiple tracks separately
Music notes from different tracks: NOT explicitly depend on each other
• MuMIDI (MUlti-track MIDI representation)
Simultaneous multi-track generation in a single sequence
Explicitly models the dependency of the notes from different tracks
Challenge occurs!
• Enlarges the sequence length ⇒ Difficult to model long-term
How to solve?
• 1) Model multiple note attributes(e.g., pitch, duration, velocity) of a musical in one step (NOT multiple steps)
⇒ Shorten the length of MuMIDI sequence
• 2) Introduce extra long-context as memory to capture long-term dependency in music
4. 3
Introduction
• Music sequence modeling using deep learning techniques
CNN [30], RNN [26], Transformer [5, 14, 15], VAE [25], GAN [11]
• Pop music generation consists of two parts
Chord and melody generation
Accompaniment generation
• Accompaniment generation ⇒ Multi-track generation
MuseGAN [10-12]
MIDI-Sandwich2 [18]
XiaoIce Band [34]
LakhNES [9]
• MuMIDI
Encodes multi-track MIDI events into one sequence of tokens
⇒ Better captures dependency among musical notes in different tracks
Models multiple attributes in one sequence step instead of multiple steps
⇒ Shortens the sequence length
Solves long-term music modeling challenges
5. 4
Multi-Track Representation
• 2-track musical piece (Piano track, Bass track)
Piano track: 10 notes
Bass track: 5 notes
Bar
and
Position
Track Note Chord
Meta
Symbol
6. 5
Multi-Track Representation
• Beginning of bar and different positions in a bar
<Bar> ∋ <Pos>, <Track>, <Note>, <Chord>
Each <Bar> has a total of 32 timesteps, called <Pos>
Bar
and
Position
Track Note Chord
Meta
Symbol
Pos 1 Pos 8
Pos 12
Pos 16 Pos 24
7. 6
Multi-Track Representation
• Six <Track> symbols:
<Track_Melody>, <Track_Drum>, <Track_Piano>, <Track_String>, <Track_Guitar>, <Track_Bass>
Pos 1 Pos 8
Pos 12
Pos 16 Pos 24
Bar
and
Position
Track Note Chord
Meta
Symbol
8. 7
Multi-Track Representation
• <Note> includes four attributes:
Pitch: 𝑃𝑖𝑡𝑐ℎ1 (C-1) to 𝑃𝑖𝑡𝑐ℎ128 (G9) for all tracks except drum
Velocity: Quantize into 32 levels (how hard the key was struck)
Duration: The duration of note from 1 timestep to 32 timestep
Bar
and
Position
Track Note Chord
Meta
Symbol
9. 8
Multi-Track Representation
• Guides the pitch range of notes and emotion
• Totally 84 possible chord symbols
12 chord roots (C, C#, D, D#, E, F, …) X 7 chord qualities (major, minor, diminished, …)
Bar
and
Position
Track Note Chord
Meta
Symbol
10. 9
Multi-Track Representation
• Encodes the meta data of the whole musical piece
Including tempo, tonality, style, and emotion
Usually unchanged throughout the whole musical piece
Bar
and
Position
Track Note Chord
Meta
Symbol
11. 10
Multi-Track Modeling
Modeling One Note
in One Step
• MuMIDI
Encodes multi-track MIDI events into a single sequence
⇒ Long sequence
⇒ Difficult to model long-term structure
• Two aspects to better model long-term sequence
(1) Shorten the sequence length:
• Modeling multiple note attributes (e.g., pitch, duration, velocity) of a note
in one sequence step (NOT in multiple steps)
(2) Adopt extra long context to capture long-term dependencies:
• In the encoder and decoder of our seq-to-seq model
Modeling
Long-Term Structure
Modeling
Implementation
12. 11
Multi-Track Modeling
Modeling One Note
in One Step
• Why?
To let MuMIDI learn from longer music structure
• How?
Apply note-level modeling
: Model multiple attributes of one note in one sequence step
Regard each attribute of a note(pitch, velocity, duration) as an embedding
⇒ Sum of all attribute embeddings represent one note:
⇒ Input to encoder and decoder in our seq-to-seq model in each time step
• Result
Shorter input and output sequences
Better captures the long-term dependency
Faster training and inference
Modeling
Long-Term Structure
Modeling
Implementation
13. 12
Multi-Track Modeling
Modeling One Note
in One Step
• Why?
To capture and exploit the long-range context
• How?
Recurrence Transformer Encoder
• Encode each token 𝑥𝑖 in conditional tracks (in one sequence step i)
• Outputs of encoder: fed into decoder as condition context
Recurrence Transformer Decoder
• Generate token 𝑦𝑗 :
• Conditioned on 1) the previously generated tokens 𝑦𝑡 (𝑡<𝑗)
and 2) context from encoder
• Each token in decoder:
• Only sees the condition context of the same bar
Modeling
Long-Term Structure
Modeling
Implementation
14. 13
Multi-Track Modeling
Modeling One Note
in One Step
• Input Module
The input embedding in each timestep
: Sum of token, meta, position, bar embedding in the timestep
Token Embeddings
• Contain <Note>, <Bar>, <Pos>, <Track>, <Chord>, etc.
• <Note>:
• (1) All attributes(pitch, duration, velocity) of one note ⇒ One token
• (2) Sum all embeddings of all attributes as one sequence step
Bar Embeddings
• Which bar the input token is located in
• 𝐵1, … 𝐵𝑚 (m: max # of bars in a music piece)
Position Embeddings
• The timestep the current input token is located in
• In a <Bar>, there exist O(empty), 𝑃1, …, 𝑃32
Meta Embeddings
• Meta symbols: 𝑇𝑒𝑚𝑝𝑜𝑙𝑜𝑤, 𝑇𝑒𝑚𝑝𝑜𝑚𝑖𝑑, 𝑇𝑒𝑚𝑝𝑜ℎ𝑖𝑔ℎ
• Output Module
Predict a note symbol or non-note symbol
Modeling
Long-Term Structure
Modeling
Implementation
<Figure 3> Input module of MuMIDI
<Figure 4> Output module of MuMIDI
Predict a note symbol Predict non-note symbol
15. 14
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Three music datastes:
LMD [23]:
• Get meta info ⇒ Filter MIDIs with ‘pop’ style tag
FreeMidi:
• Crawl all MIDIs in pop genre in the FreeMidi website
CPMD
• Data processing
(1) Melody Extraction:
• MIDI Miner [13] to recognize melody track, or use flute as melody
(2) Track Compression
• Other tracks ⇒ Compress into five tracks: bass, drum, guitar, piano, and string [11]
(3) Data Filtration
• Filter tracks which contain less than 20 notes
• ⇒ (1) MIDIs which contain at least 3 tracks; (2) Must contain melody track and at leat one another track
(4) Data Segmentation
• Only consider 4/4 time signature
(5) Chord Recognition
• Infer two chords for each bar
Val: 100 samples / Test: 100 samples /
Train: The remaining samples
16. 15
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Model Configurations
Model:
• Recurrent Transformer Encoder + Recurrent Transformer Decoder
More details:
• Encoder layers (4), decoder layers (8), encoder heads (8), decoder heads (8)
• Hidden size of all layers and dimension of token, bar, position embeddings: 512
• Training and Evaluation Setup
Default task:
• To generate five tracks (bass, piano, guitar, string and drum) conditioned on melody and chord
Max # of generated bars:
• Set to 32
For inference:
• Stochastic sampling method as most music generation systems do [14, 15]
17. 16
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Subjective Evaluation
What:
• Choose the musical piece you like by overall harmony
Who:
• Totally 15 participants (5 understanding basic music theory)
How:
• Each participant: Listens to a total of 100 listening sets (100 test musical pieces)
• Each set: Contains musical pieces from several settings (e.g., generated, ground truth)
18. 17
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Objective Evaluation
Chord Accuracy (CA)
• To measure harmony (Higher score ⇒ Better harmony)
• Whether the chords of generated tracks match the conditional chord sequence
Perplexity (PPL)
• How good a model can fit the sequence (Lower perplexity ⇒ The model better fits the sequence)
Pitch (P), Velocity (V), Duration (D), and Inter-Onset Interval (IOI)
• Measure the difference between generated musical piece and ground-truth musical piece
by computing the average OA(Overalapped Area) of distribution(P, V, D, or IOI). High OA = High Similarity
• Pitch (P):
• Compute the distribution of pitches classes (Higher score ⇒ More similar with GT)
• Velocity (V):
• Quantize the note velocity into 32 classes (Higher score ⇒ More similar with GT)
• Duration (D):
• Quantize the duration into 32 classes (Higher score ⇒ More similar with GT)
• Inter-Onset Interval (IOI): The time between the beginning of one note and that of the next one
• Quantize the intervals into 32 classes ⇒ Compute the distritbution of interval classes
(Higher score More similar with GT)
19. 18
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• Goal
To evaluate the overall harmony and high-quality of generated musical pieces (PopMAG)
• How
GT vs. PopMAG, for three datasets
• Results
42%, 38%, 40% of PopMAG generated music pieces have reached the quality of GT
20. 19
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• MuseGAN [11] vs. PopGAN
• What & How
Generate four tracks (guitar, drum, string, and bass)
conditioned on piano track
4 bars of notes in target tracks / NOT use chord / Velocity 100
• Result
PopMAG wins all subjective and objective metrics
PopMAG can generate long musical pieces
<Figure 6> Subjective evaluations of several settings
21. 20
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• Comparison with Other MIDI Representation
• Goal
To analyze the effectiveness of MuMIDI representation
• How
PopMAG vs. REMI [15], MIDI-Like [14]
• Result
PopMAG: Better scores
(more harmonious musical piece)
<Table 5> The result comparison of among different settings of PopMAG and LMD dataset
<Figure 6> Subjective evaluations of several settings
22. 21
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• Analyses on Note-Level Modeling
• Goal
To verify the effectiveness of the note-level modeling method
(modeling one note in one step)
• How
PopMAG vs. MIDI-Like [14], REMI [15]
• Result
PopMAG: Faster
• Shorter target token length
• Shorter training time & Latency
23. 22
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• Analyses on Memory in the Encoder and Decoder
• Goal
To investigate the effectiveness of the context memory in the encoder and decoder
• How
PopMAG vs.
• PopMAG – DM – EM (#4):
• Removes memory in the encoder and decoder
• PopMAG – DM (#5):
• Removes memory in the decoder
• PopMAG – EM (#6):
• Removes memory in the encoder
• Results
PopMAG (#1) outperforms others in all metrics
⇒ Context memory in E & D improves performance
PopMAG—EM (#6): better than PopMAG—DM (#7)
⇒ Memory in D is more important
<Figure 6> Subjective evaluations of several settings
<Table 5> The result comparison of among different settings of PopMAG and LMD dataset
in melody-to-others task
24. 23
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• Analyses on Bar and Position Embeddings
• Goal
To prove the effectiveness of bar and position embeddings
• How
PopMAG (#1) vs. PopMAG – POS – BAR (#7), + Sinusoidal (#8), + Relative Position Encoding (#9)
• Results
PopMAG outperforms Sinusoidal (#8) and Relative (#9)
⇒ Bar and position embeddings help model better capture the music structure
<Figure 6> Subjective evaluations of several settings
<Table 5> The result comparison of among different settings of PopMAG and LMD dataset
25. 24
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• Possible Future Extensions
(1) Generate multi-track accompaniments conditioned only on melody and chord
(2) Generate more tracks conditioned on other tracks (e.g., melody, chord, etc.)
(3) Recompose a song (remove or generate some tracks)
26. 25
Conclusion & Future Work
• Main Contributions
Propose a novel Multi-track MIDI representation (MuMIDI)
• Enables simultaneous multi-track generation in a single sequence
• Explicitly models the dependency of the notes from different tracks
Conduct experiment
• Three datasets
• Compare with previous work, and some ablation studies
Show outperforming performance
• Opinions
Provide in-detailed explanation on terms and concepts to understand musical representation
Remained questions on objective evaluation metrics: validity?
Effort on subjective evaluation
• Only one question to ask the quality(harmony) of the generated songs
Suggestions on usage scenario (how MuMIDI can help end-users) to make the motivation more persuasive