Video Background Music Generation with Controllable Music Transformer
1. Video Background Music Generation
with Controllable Music Transformer
ACM MM 2021 – Best Paper Award
Hyeshin Chu
2. • Overview
• Introduction
• Related Work
• Establishing Video-Music Rhythmic Relationships
• Controllable Music Transformer
• Experiments
• Conclusion
3. 2
Overview
• Paper / Code / Project Page
• Contributions
Propose CMT, a Controllable Music Transformer
Generate melodious music for a given video
Consider the melody-music rhythmic consistency
4. 3
Introduction
• Motivation
Interests on editing short videos and share them on social platforms
Add background music ⇒ More attractive video
Find the best fit between music & video ⇒ Difficult and time-consuming
• Approach
Explore the rhythmic relationships between video & BGM
Free of reliance upon annotated training data
5. 4
Introduction
• Establish three rhythmic relationships between video & BGM
Motion Speed and Simu-Note Density
• Motion speed: Magnitude of motion in a video
• Simu-note density: # of simu-note per bar
Local-Maximum Motion Saliency with Simu-Note Strength
• Local-maximum motion saliency: Labels some rhythmic keyframes
• Simu-note strength: # of notes in a simu-note
Beat-timing Encoding
• Extract timing information from a video ⇒ Guide as position encoding
• Music appears and disappears smoothly with the start and end of a video
*simu-note: a group of notes that start simultaneously
6. 5
Related Work
• Representations of Music
MIDI-like event sequences [10], [16]
REMI [11]:
• A metrical structure in the input data
• Explicitly note bars, beats, chord, and tempo
• Music Generation Models
Utilize autoencoders to generate symbolic music [25], [23], [19]
Transformer-based music generation models [10], [11], [3], [9]
• Composing Music from Silent Videos
From video clips containing people playing musical instruments [6], [21], [22]
No dataset (video-BGM pair) for general video
7. 6
Establishing Video-Music Rhythmic Relationships
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
Video
Generated BGM
fast
high
salient visual beat occurs
large
8. 7
Establishing Video-Music Rhythmic Relationships
• To build rhythmic relationship between the video and music
• A video contains T frames
• Convert the tth frame to its beat number
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
• Convert the ith beat to video frame number (inverse function)
*Tempo: the speed at which BGM is played
*FPS: short for frame per second (intrinsic attribute of video)
9. 8
Establishing Video-Music Rhythmic Relationships
• Optical flow magnitude
• Motion Speed
• Simu-note
• Bar
• Simu-note density
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
• Entire video: M clips
*T: total # of frames in a video
*S: set to 4 ⇒ Each clip corresponds to 4 beats (one bar) in the music
10. 9
Establishing Video-Music Rhythmic Relationships
• Optical flow magnitude
• Motion Speed
• Simu-note
• Bar
• Simu-note density
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
*Optical Flow
• Measure the displacement of individual pixels
between two consecutive video frames
• Analyze video motion
The average of absolute optical flow
to measure the motion magnitude in the tth frame
11. 10
Establishing Video-Music Rhythmic Relationships
• Optical flow magnitude
• Motion Speed
• Simu-note
• Bar
• Simu-note density
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
Motion speed of the mth video clip
: The average optical flow magnitude
12. 11
Establishing Video-Music Rhythmic Relationships
• Optical flow magnitude
• Motion Speed
• Simu-note
• Bar
• Simu-note density
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
*i: ith bar
*j: jth tick (4 ticks == 1 beat)
*k: the instrument
*n: a single note
Simu-note
: A group of notes
: To connect with the motion speed
13. 12
Establishing Video-Music Rhythmic Relationships
• Optical flow magnitude
• Motion Speed
• Simu-note
• Bar
• Simu-note density
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
*i: ith bar
*j: jth tick (4 ticks == 1 beat)
*k: the instrument
*n: a single note
We divide a bar into 16 ticks ⇒ j = 1, 2, … 16
Bar
: A group of non-empty simu-notes
: To connect with the motion speed
14. 13
Establishing Video-Music Rhythmic Relationships
• Optical flow magnitude
• Motion Speed
• Simu-note
• Bar
• Simu-note density
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
Simu-note density
: # of simu-note per bar
: Connect with magnitude of motion in a video
*simu-note: a group of notes that start simultaneously
15. 14
Establishing Video-Music Rhythmic Relationships
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
• Visual Beats
• Simu-note Strength
Motion Saliency at the tth frame
: Average positive change of optical flow
16. 15
Establishing Video-Music Rhythmic Relationships
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
• Visual Beats
• Simu-note Strength
Frames with both
• local-maximum motion saliency; and
• a near-constant tempo
17. 16
Establishing Video-Music Rhythmic Relationships
Video Timing & Music Beat
Motion Speed &
Simu-note Density
Motion Saliency &
Simu-note Strength
• Visual Beats
• Simu-note Strength
The number of notes in a simu-note
• Denote the richness of an extended chord or harmony
• Give a rhythmic feeling
• Higher simu-note strength ⇒ More auditory impact
18. 17
Controllable Music Transformer
• Extract rhythmic features from both video and MIDI
• Training stage:
Rhythmic features from video included
• Inference stage:
Rhythmic features from MIDI included
19. • PopMAG [18] & CWT [9]
• Consider seven kinds of attributes:
Type, beat/bar, density, strength, instrument, pitch, and duration
Rhythm-related group;
Note-related group;
Type: to distinguish those two groups
18
Controllable Music Transformer
Music Representation Control Sequence Modeling
20. 19
Controllable Music Transformer
Music Representation Control Sequence Modeling
• Density Replacement
Density of music match the density of motion speed of video
• Strength Replacement
Use beat token of the given visual beat
• Hyper-parameter C for Control Degree
Compatibility of music with the video vs. melodiousness of music
• Beat-timing Encoding
Smoother beginning and ending of music
• Genre and Instrument Type
6 genres (country, dance, electronic, metal, pop, and rock)
5 instruments (drums, piano, guitar, bass, and strings)
21. 20
Controllable Music Transformer
• Music Transformer [24], linear transformer [12]
• Predict seven kinds of attributes in order:
Type, beat/bar, density, strength,
instrument, pitch, and duration
Music Representation Control Sequence Modeling
22. 21
Experiments
• Lakh Pianoroll Dataset (LPD) [4]
Five instruments (drums, piano, guitar, bass and strings)
Six genres (country, dance, electronic, metal, pop, and rock)
• Implementation Detailes
Choose the embedding size for each attribute based on its vocabulary size
i.e. (32, 64, 64, 512, 128, 32, 64) for (type, beat, den, str, pit, dur, ins) at each attribute
Dataset Implementation Details Objective Evaluation
Subjective Evaluation Controlling Accuracy Visualization
23. 22
Experiments
Dataset Implementation Details Objective Evaluation
Subjective Evaluation Controlling Accuracy Visualization
• Pitch Class Histogram Entropy
Music’s equality in tonality
• Grooving Pattern Similarity
Music’s rhythmicity
• Structureness Indicator
Music’s repetitive structure
• Overall Rank
Mean of rankings
*Better music: Closer to the real data
24. 23
Experiments
• Participants
13 with basic understanding of music
• Questionnaire
10 minute
• Procedure
Listen to several pieces of music (random order)
corresponding to input video
Rate based on subjective metrics
Dataset Implementation Details Objective Evaluation
Subjective Evaluation Controlling Accuracy Visualization
• Richness
• Correctness
• Structuredness
• Rhythmicity
• Correspondence
• Structure Consistency
• Overall Rank
25. 24
Experiments
Dataset Implementation Details Objective Evaluation
Subjective Evaluation Controlling Accuracy Visualization
• Richness
Music diversity and interestingness
• Correctness
Perceived absence of notes or playing mistakes
• Structuredness
Whether there are structural patterns
• Rhythmicity
How much the rhythm of generated music matches with the motion of the video
• Correspondence
How much the major stress or boundary of the music matches with video boundary of video
• Structure Consistency
The start and end of the music match up with those of the video
• Overall Rank
The mean of rank
26. 25
Experiments
Dataset Implementation Details Objective Evaluation
Subjective Evaluation Controlling Accuracy Visualization
• Richness
• Correctness
• Structuredness
• Rhythmicity
• Correspondence
• Structure Consistency
• Overall Rank • Different levels of hyper parameter C
Compatibility of music with the video
vs. melodiousness of music
27. 26
Experiments
Dataset Implementation Details Objective Evaluation
Subjective Evaluation Controlling Accuracy Visualization
• Richness
• Correctness
• Structuredness
• Rhythmicity
• Correspondence
• Structure Consistency
• Overall Rank
• Ablation Study
Procedure
• Choose three video clips from different categories
(edited, unedited, and animation video)
• Provide the generated music of ours, balseline, and matched
in the questionnaire
Result
• Overall, Ours >> Matched
28. • Controlling Accuracy
Calculate the accuracy of three controlling attributes
L2 distance between the rhythmic feature from the video and generated music
• Visualization
Loss curves for rhythmic-related attributes and note-related attributes
27
Experiments
Dataset Implementation Details Objective Evaluation
Subjective Evaluation Controlling Accuracy Visualization
29. 28
Conclusion
• Contribution
Address the unexplored task – video background music generation
Establish rhythmic relationships between video and BGM
Propose controllable music transformer
No need to acquire video-music paired dataset
• Future Work
Explore more abstract connects between visual and music (e.g., emotion)
Utilize music in the waveform
30. 29
For My Research
• Study how CMT is developed on
Compound Word Transformer [9], Music Transformer [24], PopMAG [18]
• Which visual feature represents emotion?
• Consider both mood and sentiment of video
When emotion of a person does not align with the overall mood
• Objective & Subjective evaluation metric
Created by the authors (statistical analysis, no reference)
⇒ List up my own and conduct the statistical analysis
36 participants using questionnaire ⇒ In-depth interview?
• Check the performance of this model in various video categories
Demo and explanation mostly on paper focus on travel and sport