PopMAG: Pop Music Accompaniment Generation

PopMAG:
Pop Music Accompaniment Generation
Hyeshin Chu
2021. 08. 20
MM 2020
Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

Contents
• Overview of the Paper
• Introduction
• Multi-Track Midi Representation
• Multi-Track Modeling
• Experimental Setup
• Results and Analyses
• Conclusion & Future Work

2
Overview of the Paper
https://music-popmag.github.io/popmag/
• Goal
 To improve harmony of accompaniment (usually with multiple instruments)
• Previous Works
 Generate multiple tracks separately
 Music notes from different tracks: NOT explicitly depend on each other
• MuMIDI (MUlti-track MIDI representation)
 Simultaneous multi-track generation in a single sequence
 Explicitly models the dependency of the notes from different tracks
 Challenge occurs!
• Enlarges the sequence length ⇒ Difficult to model long-term
 How to solve?
• 1) Model multiple note attributes(e.g., pitch, duration, velocity) of a musical in one step (NOT multiple steps)
⇒ Shorten the length of MuMIDI sequence
• 2) Introduce extra long-context as memory to capture long-term dependency in music

3
Introduction
• Music sequence modeling using deep learning techniques
 CNN [30], RNN [26], Transformer [5, 14, 15], VAE [25], GAN [11]
• Pop music generation consists of two parts
 Chord and melody generation
 Accompaniment generation
• Accompaniment generation ⇒ Multi-track generation
 MuseGAN [10-12]
 MIDI-Sandwich2 [18]
 XiaoIce Band [34]
 LakhNES [9]
• MuMIDI
 Encodes multi-track MIDI events into one sequence of tokens
⇒ Better captures dependency among musical notes in different tracks
 Models multiple attributes in one sequence step instead of multiple steps
⇒ Shortens the sequence length
 Solves long-term music modeling challenges

4
Multi-Track Representation
• 2-track musical piece (Piano track, Bass track)
 Piano track: 10 notes
 Bass track: 5 notes
Bar
and
Position
Track Note Chord
Meta
Symbol

5
• Beginning of bar and different positions in a bar
 <Bar> ∋ <Pos>, <Track>, <Note>, <Chord>
 Each <Bar> has a total of 32 timesteps, called <Pos>
Bar
and
Position
Track Note Chord
Meta
Symbol
Pos 1 Pos 8
Pos 12
Pos 16 Pos 24

6
• Six <Track> symbols:
 <Track_Melody>, <Track_Drum>, <Track_Piano>, <Track_String>, <Track_Guitar>, <Track_Bass>
Pos 1 Pos 8
Pos 12
Pos 16 Pos 24
Bar
and
Position
Track Note Chord
Meta
Symbol

7
• <Note> includes four attributes:
 Pitch: 𝑃𝑖𝑡𝑐ℎ1 (C-1) to 𝑃𝑖𝑡𝑐ℎ128 (G9) for all tracks except drum
 Velocity: Quantize into 32 levels (how hard the key was struck)
 Duration: The duration of note from 1 timestep to 32 timestep
Bar
and
Position
Track Note Chord
Meta
Symbol

8
• Guides the pitch range of notes and emotion
• Totally 84 possible chord symbols
 12 chord roots (C, C#, D, D#, E, F, …) X 7 chord qualities (major, minor, diminished, …)
Bar
and
Position
Track Note Chord
Meta
Symbol

9
• Encodes the meta data of the whole musical piece
 Including tempo, tonality, style, and emotion
 Usually unchanged throughout the whole musical piece
Bar
and
Position
Track Note Chord
Meta
Symbol

10
Multi-Track Modeling
Modeling One Note
in One Step
• MuMIDI
 Encodes multi-track MIDI events into a single sequence
⇒ Long sequence
⇒ Difficult to model long-term structure
• Two aspects to better model long-term sequence
 (1) Shorten the sequence length:
• Modeling multiple note attributes (e.g., pitch, duration, velocity) of a note
in one sequence step (NOT in multiple steps)
 (2) Adopt extra long context to capture long-term dependencies:
• In the encoder and decoder of our seq-to-seq model
Modeling
Long-Term Structure
Modeling
Implementation

11
Modeling One Note
in One Step
• Why?
 To let MuMIDI learn from longer music structure
• How?
 Apply note-level modeling
: Model multiple attributes of one note in one sequence step
 Regard each attribute of a note(pitch, velocity, duration) as an embedding
⇒ Sum of all attribute embeddings represent one note:
⇒ Input to encoder and decoder in our seq-to-seq model in each time step
• Result
 Shorter input and output sequences
 Better captures the long-term dependency
 Faster training and inference
Modeling
Long-Term Structure
Modeling
Implementation

12
Modeling One Note
in One Step
• Why?
 To capture and exploit the long-range context
• How?
 Recurrence Transformer Encoder
• Encode each token 𝑥𝑖 in conditional tracks (in one sequence step i)
• Outputs of encoder: fed into decoder as condition context
 Recurrence Transformer Decoder
• Generate token 𝑦𝑗 :
• Conditioned on 1) the previously generated tokens 𝑦𝑡 (𝑡<𝑗)
and 2) context from encoder
• Each token in decoder:
• Only sees the condition context of the same bar
Modeling
Long-Term Structure
Modeling
Implementation

13
Modeling One Note
in One Step
• Input Module
 The input embedding in each timestep
: Sum of token, meta, position, bar embedding in the timestep
 Token Embeddings
• Contain <Note>, <Bar>, <Pos>, <Track>, <Chord>, etc.
• <Note>:
• (1) All attributes(pitch, duration, velocity) of one note ⇒ One token
• (2) Sum all embeddings of all attributes as one sequence step
 Bar Embeddings
• Which bar the input token is located in
• 𝐵1, … 𝐵𝑚 (m: max # of bars in a music piece)
 Position Embeddings
• The timestep the current input token is located in
• In a <Bar>, there exist O(empty), 𝑃1, …, 𝑃32
 Meta Embeddings
• Meta symbols: 𝑇𝑒𝑚𝑝𝑜𝑙𝑜𝑤, 𝑇𝑒𝑚𝑝𝑜𝑚𝑖𝑑, 𝑇𝑒𝑚𝑝𝑜ℎ𝑖𝑔ℎ
• Output Module
 Predict a note symbol or non-note symbol
Modeling
Long-Term Structure
Modeling
Implementation
<Figure 3> Input module of MuMIDI
<Figure 4> Output module of MuMIDI
Predict a note symbol Predict non-note symbol

14
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Three music datastes:
 LMD [23]:
• Get meta info ⇒ Filter MIDIs with ‘pop’ style tag
 FreeMidi:
• Crawl all MIDIs in pop genre in the FreeMidi website
 CPMD
• Data processing
 (1) Melody Extraction:
• MIDI Miner [13] to recognize melody track, or use flute as melody
 (2) Track Compression
• Other tracks ⇒ Compress into five tracks: bass, drum, guitar, piano, and string [11]
 (3) Data Filtration
• Filter tracks which contain less than 20 notes
• ⇒ (1) MIDIs which contain at least 3 tracks; (2) Must contain melody track and at leat one another track
 (4) Data Segmentation
• Only consider 4/4 time signature
 (5) Chord Recognition
• Infer two chords for each bar
Val: 100 samples / Test: 100 samples /
Train: The remaining samples

15
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Model Configurations
 Model:
• Recurrent Transformer Encoder + Recurrent Transformer Decoder
 More details:
• Encoder layers (4), decoder layers (8), encoder heads (8), decoder heads (8)
• Hidden size of all layers and dimension of token, bar, position embeddings: 512
• Training and Evaluation Setup
 Default task:
• To generate five tracks (bass, piano, guitar, string and drum) conditioned on melody and chord
 Max # of generated bars:
• Set to 32
 For inference:
• Stochastic sampling method as most music generation systems do [14, 15]

16
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Subjective Evaluation
 What:
• Choose the musical piece you like by overall harmony
 Who:
• Totally 15 participants (5 understanding basic music theory)
 How:
• Each participant: Listens to a total of 100 listening sets (100 test musical pieces)
• Each set: Contains musical pieces from several settings (e.g., generated, ground truth)

17
Experimental Setup
Datasets
Model
Configurations
Training &
Evaluation Setup
Evaluation
Metrics
• Objective Evaluation
 Chord Accuracy (CA)
• To measure harmony (Higher score ⇒ Better harmony)
• Whether the chords of generated tracks match the conditional chord sequence
 Perplexity (PPL)
• How good a model can fit the sequence (Lower perplexity ⇒ The model better fits the sequence)
 Pitch (P), Velocity (V), Duration (D), and Inter-Onset Interval (IOI)
• Measure the difference between generated musical piece and ground-truth musical piece
by computing the average OA(Overalapped Area) of distribution(P, V, D, or IOI). High OA = High Similarity
• Pitch (P):
• Compute the distribution of pitches classes (Higher score ⇒ More similar with GT)
• Velocity (V):
• Quantize the note velocity into 32 classes (Higher score ⇒ More similar with GT)
• Duration (D):
• Quantize the duration into 32 classes (Higher score ⇒ More similar with GT)
• Inter-Onset Interval (IOI): The time between the beginning of one note and that of the next one
• Quantize the intervals into 32 classes ⇒ Compute the distritbution of interval classes
(Higher score  More similar with GT)

18
Results and Analyses
Overall Quality
Comparison with
Previous Works
Method Analyses Extension
• Goal
 To evaluate the overall harmony and high-quality of generated musical pieces (PopMAG)
• How
 GT vs. PopMAG, for three datasets
• Results
 42%, 38%, 40% of PopMAG generated music pieces have reached the quality of GT

19
Overall Quality
Comparison with
Previous Works
• MuseGAN [11] vs. PopGAN
• What & How
 Generate four tracks (guitar, drum, string, and bass)
conditioned on piano track
 4 bars of notes in target tracks / NOT use chord / Velocity 100
• Result
 PopMAG wins all subjective and objective metrics
 PopMAG can generate long musical pieces
<Figure 6> Subjective evaluations of several settings

20
Overall Quality
Comparison with
Previous Works
• Comparison with Other MIDI Representation
• Goal
 To analyze the effectiveness of MuMIDI representation
• How
 PopMAG vs. REMI [15], MIDI-Like [14]
• Result
 PopMAG: Better scores
(more harmonious musical piece)
<Table 5> The result comparison of among different settings of PopMAG and LMD dataset

21
Overall Quality
Comparison with
Previous Works
• Analyses on Note-Level Modeling
• Goal
 To verify the effectiveness of the note-level modeling method
(modeling one note in one step)
• How
 PopMAG vs. MIDI-Like [14], REMI [15]
• Result
 PopMAG: Faster
• Shorter target token length
• Shorter training time & Latency

22
Overall Quality
Comparison with
Previous Works
• Analyses on Memory in the Encoder and Decoder
• Goal
 To investigate the effectiveness of the context memory in the encoder and decoder
• How
 PopMAG vs.
• PopMAG – DM – EM (#4):
• Removes memory in the encoder and decoder
• PopMAG – DM (#5):
• Removes memory in the decoder
• PopMAG – EM (#6):
• Removes memory in the encoder
• Results
 PopMAG (#1) outperforms others in all metrics
⇒ Context memory in E & D improves performance
 PopMAG—EM (#6): better than PopMAG—DM (#7)
⇒ Memory in D is more important
in melody-to-others task

23
Overall Quality
Comparison with
Previous Works
• Analyses on Bar and Position Embeddings
• Goal
 To prove the effectiveness of bar and position embeddings
• How
 PopMAG (#1) vs. PopMAG – POS – BAR (#7), + Sinusoidal (#8), + Relative Position Encoding (#9)
• Results
 PopMAG outperforms Sinusoidal (#8) and Relative (#9)
⇒ Bar and position embeddings help model better capture the music structure

24
Overall Quality
Comparison with
Previous Works
• Possible Future Extensions
 (1) Generate multi-track accompaniments conditioned only on melody and chord
 (2) Generate more tracks conditioned on other tracks (e.g., melody, chord, etc.)
 (3) Recompose a song (remove or generate some tracks)

25
Conclusion & Future Work
• Main Contributions
 Propose a novel Multi-track MIDI representation (MuMIDI)
• Enables simultaneous multi-track generation in a single sequence
• Explicitly models the dependency of the notes from different tracks
 Conduct experiment
• Three datasets
• Compare with previous work, and some ablation studies
 Show outperforming performance
• Opinions
 Provide in-detailed explanation on terms and concepts to understand musical representation
 Remained questions on objective evaluation metrics: validity?
 Effort on subjective evaluation
• Only one question to ask the quality(harmony) of the generated songs
 Suggestions on usage scenario (how MuMIDI can help end-users) to make the motivation more persuasive

PopMAG: Pop Music Accompaniment Generation

PopMAG: Pop Music Accompaniment Generation

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to PopMAG: Pop Music Accompaniment Generation

Similar to PopMAG: Pop Music Accompaniment Generation (20)

More from ivaderivader

More from ivaderivader (20)

Recently uploaded

Recently uploaded (20)

PopMAG: Pop Music Accompaniment Generation