Foley Music: Learning to Generate Music from Videos
1. Foley Music:
Learning to Generate Music from Videos
Hyeshin Chu
2021. 04. 16
ECCV 2020
Chuamg Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba
3. 2
Overview of the Paper
• Foley Music
A system that can learn to generate music by seeing and listening to a large-scale music performance videos
http://foley-music.csail.mit.edu/
4. 3
Overview of the Paper
• Foley Music
A system that can learn to generate music by seeing and listening to a large-scale music performance videos
• Task
To learn a mapping between audio and visual signals from unlabeled video in practice
• The challenges
To make a visual perception module to recognize the physical interactions between the musical instrument and the
players body from videos
To implement audio representation which
• respects the major musical rules about structure and dynamics
• is easy to predict from visual signals
To build a model that associates these two modalities and accurately predicts music from videos
5. 4
Overview of the Paper
• Identify two key elements for a successful video to music generator to address the
challenges
For the visual perception part
• Extract key points of the human body and hand fingers from video frames as intermediate visual representations
• Explicitly model the body parts and hand movements
For the music
• Use Musical Instrument Digital Interface (MIDI)
• Encodes timing and loudness information for each note event
• Advantages of using MIDI
• MIDI events capture the expressive timing and dynamics information contained in music
• Easy to fit into machine learning models
• Fully interpretable and flexible
• Easily converted to realistic music with a standard audio synthesizer
6. 5
Overview of the Paper
• Main Contributions
Present a model to generate synchronized and expressive music from videos
Propose body keypoint and MIDI as an intermediate representation for transferring knowledge
across two modalities
Empirically demonstrate that such representations are key to success
Foley Music outperforms previous SOTA systems on music generation from videos by a large margin
Demonstrate that MIDI musical representations facilitate new applications on generating different
styles of music
7. 6
Related Work
• Synchronization between vision and sound
Used sound clusters as supervision to learn visual feature representation, given unlabeled training
videos [44]
Jointly learn the visual and audio representation using a visual-audio correspondence task [2, 33]
Recent works on
• Localizing sound source in images or videos [29, 26, 3, 48, 64]
• Biometric matching [39]
• Visual-guided sound source separation [64, 15, 19, 60]
• Auditory inpainting [66]
• Emotion recognition [1], etc.
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
8. 7
Related Work
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
• Correlations between sound and motion
The associations between speech and facial movements [31, 55]
Generating high-quality talking face from audio [54, 30]
Separate mixed speech signals of multiple speakers [14, 42]
• Correlations between body motions and sound
Predicting gestures from speech [22]
Body dynamics from music [50]
Identifying a melody through body language [15]
• ⇒ The paper mainly focuses on generating music from videos according to body motions
9. 8
Related Work
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
• Deep neural network models
MelodyRNN [59] &DeepBach [24]
• Generate realistic melodies and bach chorales
WaveNet [40]
• Show promising results in generating realistic speech and music
Song from PI [11]
• Use a hierarchical RNN model to simultaneously generate melody, drums, and chords for a pop song
Music Transformer Model [28]
• Generate expressive piano music from MIDI event
MAESTRO Dataset [25]
• To factorize piano music modeling and generation
⇒ Little work on exploring the problem of generating expressive music from videos
10. 9
Related Work
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
• Foley invented by Jack Foley in 1920s
Following works investigate the task of predicting the sound emitted by interacting objects [43]
• Methods using a neural network
The conditional generative adversarial networks for lab-collected music performance videos [10]
SampleRNN-based method to directly predict a generate waveform from an unconstraint video
dataset in the wild (10 types of sound) [68]
A perceptual loss to improve the audio-visual semantic alignment [9]
⇒ The method uses MIDI for music transcription and generation, instead of spectrograms of
waveform for audio representation
11. 10
Approach
• The model architecture consists of three components
A visual encoder
• Takes video frames to extract keypoint coordinates
• Uses GCN to capture the body dynamic and produce a latent representation over time
A MIDI decoder
• Takes the video sequence representation to generate a sequence of MIDI event
An audio synthesizer
• Converts the MIDI event to the waveform
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
12. 11
Approach
• Visual Representations
The limitations of existing works
• Limited abilities to applications that require the capture of the fine-grained level correlations between
motion and sound
The method uses the human pose features to capture the body motion cues
• ① Detecting the human body and hand keypoints from each video frame
• ② Stacking their 2D coordinates over time as structured visual representations
In practice
• ① The open-source OpenPose toolbox [6]
• ⇒ Extract the 2D coordinates of human body joints
• ② OpenPose [6] and hand API [51]
• ⇒ Predict the coordinates of hand keypoints
⇒ Obtained 25 keypoints for the human body parts and 21 keypoints for each hand
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
13. 12
Approach
• Audio Representations
The limitations of existing works
• Do not work well on generating realistic music from videos
• It is because ...
• Music is highly compositional and contains many structured events
• Machine learning models barely discover these rules contained in the music
The paper uses the Musical Instrument Digital Interface (MIDI) as the audio
representations
• MIDI is composed of timing information note-on and note-off events
• Each event defines note pitch
• Note-on events contains additional velocity information (indicating how strong the note was played)
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
14. 13
Approach
• Audio Representations
① Detect MIDI events from the audio track of the videos using a music transaction
software
• 6-second video clip: contains around 500 MIDI events (length varies for different music)
② Generate expressive timing information for music modeling
• Using a music performance encoding [41]
• A vocabulary of 88 note-on events, 88 note-off events, 32 velocity bins, 32 time-shift events
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
15. 14
Approach
• Overall Architecture
The authors build a Graph-Transformer module
• To model the correlations between the human body parts and hand movements with the MIDI events
① Adopt a spatial-temporal graph convolutional network on body keypoint coordinates over time to
capture body motions
② Feed the encoded pose feature to a music transformer decoder to generate a sequence of the
MIDI events
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
16. 15
Approach
• Visual Encoder
① Extract 2D keypoints coordinates from the raw videos
② Adopt a Graph CNN to model the spatial-temporal relationships among different keypoints on the
body and hands (human skeleton sequence)
• MIDI Decoder
Music signals is a sequence of MIDI events
Consider music generation from body motions as a sequence prediction problem
Utilize Transformer model [28]
• An encoder-decoder based autoregressive generative model
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
17. 16
Approach
• Training
Objective:
• To minimize the cross-entropy loss given a source target sequence of MIDI events
• To predict a sequence of MIDI events
Input:
• 2D coordinates of human skeleton
At each generation process
• Predict the next MIDI event using current MIDI event
• ⇒ The model generates MIDI events
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
18. 17
Experiments
• Experimental Setup
Datasets
• Three video datasets of music performances
• URMP [34] for ablated study
• A high-quality multi-instrument video dataset recorded in a studio
• Provides MIDI file for each recorded video
• AtinPiano [64] for comparison with SOTA models
• A YouTube channel including piano video recordings
• Camera looking down on the keyboard and hands
• MUSIC [64] for comparison with SOTA models
• Untrimmed video dataset downloaded by querying keywords from YouTube
• 1000 music performance videos, 11 categories
Implementation Details
• OpenPose [6]
• To extract the coordinates of body and hand keypoints for each frame
• Pre-processing
• Extract MIDI events from audio recordings
• Training
• Randomly take a 6-sec video clip from the dataset
19. 18
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Baseline
SampleRNN, WaveNet, GAN-based
• Qualitative Evaluation with Human Study
Note that:
• The quality of the generated sound can be very subjective
• ⇒ Conduct qualitative study (Amazon Mechanical Turk)
Correctness
• Which music recording is more relevant to video content
Least noise
• Which music recording has least noise
Synchronization
• Which music recording temporally aligns with the video content best
Overall
• Which sound they prefer to listen to all
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
20. 19
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Compare MIDI prediction result and the ground truth
The predicted MIDI event are reasonable similar to the ground truth
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
21. 20
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Sound spectrogram generated by different approaches
The model generates more structured harmonic components than other baselines
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
22. 21
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Qualitative Evaluation with real or fake study
To assess whether the generated audios can fool people into thinking that they are real
① For AMT turkers, the authors provide two videos with
• Real audio (originally belonging to this video)
• Fake audio (generated by computers
② Turkers choose the video that they think is real
Evaluation criteria
• Synchronization, artifacts, or containing noise
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
23. 22
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Quantitative Evaluation with Automatic Metrics
Goal
• To evaluate the diversity of generated sound
Metric:
• The Number of Statistically-Different Bins (NDB) The lower the better
• NDB indicates the number of cells in which the training samples are significantly different from the number of testing
examples
Process
• ① Transform sound to log-spectrogram
• ② Cluster the spectrogram in the training set using k-means algorithm (k=50)
• ③ Each generated sound in the testing set is assigned to the nearest cell
Result
• The model achieves significantly lower NDB
• Generates more diverse sound
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
24. 23
Experiments
• Ablated Study
The effectiveness of Body Motions The effectiveness of Music Transformers
5 instruments from URMP (violin, viola, cello, trumpet, and flute)
• Overall
Goal:
• To assess the impact of each component of the model
Dataset:
• 5 instruments from URMP (violin, viola, cello, trumpet, and flute) for quantitative evaluations
25. 24
Experiments
• Ablated Study
The effectiveness of Body Motions The effectiveness of Music Transformers
• Goal
To understand the ability of the visual representations (skeleton) of the model
• Representations for explicit body motions through keypoint-based structure representations to guide music generation
• Metric
NEL loss (the lower the better)
• Process
Replace keypoint-based structure representation (skeleton)with
• RGB images and
• Optical flow representation
• Result
Keypoint-based representation achieves better MIDI prediction accuracy than other options
5 instruments from URMP (violin, viola, cello, trumpet, and flute)
26. 25
Experiments
• Ablated Study
The effectiveness of Body Motions The effectiveness of Music Transformers
• Goal
To verify the efficacy of a music transformers framework (for sequence predictions)
• Metric
NEL loss (the lower the better)
• Process
Replace the music transformer module with GRU, and keep the other parts of the pipeline the same
• Result
The model captures the long-term dependencies in music
5 instruments from URMP (violin, viola, cello, trumpet, and flute)
27. 26
Experiments
• Music Editing with MIDI
Performs music editing by manipulating the MIDI file
Fig. 6 demonstrates the flexibility of MIDI representations
• Manipulate the key of the predicted MIDI
Result
• The model is capable to generate music with different styles
• The model enables new applications on controllable music generation (which was not available
using waveform or spectrogram as the audio representation)
28. 27
Conclusion and Future Work
• Foley music system
Generate expressive music from videos
Takes video as input
Detects human skeletons
Recognizes interactions with musical instruments over time
Predicts the corresponding MIDI files
Generates music with different styles through the MIDI representations
• Future Work
Further study on studying the connections between video and music