Foley Music: Learning to Generate Music from Videos

Foley Music:
Learning to Generate Music from Videos
Hyeshin Chu
2021. 04. 16
ECCV 2020
Chuamg Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba

Contents
•Overview of the Paper
•Related Work
•Approach
•Experiments
•Conclusion and Future Work

2
Overview of the Paper
• Foley Music
 A system that can learn to generate music by seeing and listening to a large-scale music performance videos
http://foley-music.csail.mit.edu/

3
• Foley Music
 A system that can learn to generate music by seeing and listening to a large-scale music performance videos
• Task
 To learn a mapping between audio and visual signals from unlabeled video in practice
• The challenges
 To make a visual perception module to recognize the physical interactions between the musical instrument and the
players body from videos
 To implement audio representation which
• respects the major musical rules about structure and dynamics
• is easy to predict from visual signals
 To build a model that associates these two modalities and accurately predicts music from videos

4
• Identify two key elements for a successful video to music generator to address the
challenges
 For the visual perception part
• Extract key points of the human body and hand fingers from video frames as intermediate visual representations
• Explicitly model the body parts and hand movements
 For the music
• Use Musical Instrument Digital Interface (MIDI)
• Encodes timing and loudness information for each note event
• Advantages of using MIDI
• MIDI events capture the expressive timing and dynamics information contained in music
• Easy to fit into machine learning models
• Fully interpretable and flexible
• Easily converted to realistic music with a standard audio synthesizer

5
• Main Contributions
 Present a model to generate synchronized and expressive music from videos
 Propose body keypoint and MIDI as an intermediate representation for transferring knowledge
across two modalities
 Empirically demonstrate that such representations are key to success
 Foley Music outperforms previous SOTA systems on music generation from videos by a large margin
 Demonstrate that MIDI musical representations facilitate new applications on generating different
styles of music

6
Related Work
• Synchronization between vision and sound
 Used sound clusters as supervision to learn visual feature representation, given unlabeled training
videos [44]
 Jointly learn the visual and audio representation using a visual-audio correspondence task [2, 33]
 Recent works on
• Localizing sound source in images or videos [29, 26, 3, 48, 64]
• Biometric matching [39]
• Visual-guided sound source separation [64, 15, 19, 60]
• Auditory inpainting [66]
• Emotion recognition [1], etc.
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos

7
Related Work
Videos
• Correlations between sound and motion
 The associations between speech and facial movements [31, 55]
 Generating high-quality talking face from audio [54, 30]
 Separate mixed speech signals of multiple speakers [14, 42]
• Correlations between body motions and sound
 Predicting gestures from speech [22]
 Body dynamics from music [50]
 Identifying a melody through body language [15]
• ⇒ The paper mainly focuses on generating music from videos according to body motions

8
Related Work
Videos
• Deep neural network models
 MelodyRNN [59] &DeepBach [24]
• Generate realistic melodies and bach chorales
 WaveNet [40]
• Show promising results in generating realistic speech and music
 Song from PI [11]
• Use a hierarchical RNN model to simultaneously generate melody, drums, and chords for a pop song
 Music Transformer Model [28]
• Generate expressive piano music from MIDI event
 MAESTRO Dataset [25]
• To factorize piano music modeling and generation
 ⇒ Little work on exploring the problem of generating expressive music from videos

9
Related Work
Videos
• Foley invented by Jack Foley in 1920s
 Following works investigate the task of predicting the sound emitted by interacting objects [43]
• Methods using a neural network
 The conditional generative adversarial networks for lab-collected music performance videos [10]
 SampleRNN-based method to directly predict a generate waveform from an unconstraint video
dataset in the wild (10 types of sound) [68]
 A perceptual loss to improve the audio-visual semantic alignment [9]
 ⇒ The method uses MIDI for music transcription and generation, instead of spectrograms of
waveform for audio representation

10
Approach
• The model architecture consists of three components
 A visual encoder
• Takes video frames to extract keypoint coordinates
• Uses GCN to capture the body dynamic and produce a latent representation over time
 A MIDI decoder
• Takes the video sequence representation to generate a sequence of MIDI event
 An audio synthesizer
• Converts the MIDI event to the waveform
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference

11
Approach
• Visual Representations
 The limitations of existing works
• Limited abilities to applications that require the capture of the fine-grained level correlations between
motion and sound
 The method uses the human pose features to capture the body motion cues
• ① Detecting the human body and hand keypoints from each video frame
• ② Stacking their 2D coordinates over time as structured visual representations
 In practice
• ① The open-source OpenPose toolbox [6]
• ⇒ Extract the 2D coordinates of human body joints
• ② OpenPose [6] and hand API [51]
• ⇒ Predict the coordinates of hand keypoints
 ⇒ Obtained 25 keypoints for the human body parts and 21 keypoints for each hand
Visual and Audio
Representations
Body Motions to
MIDI Predictions

12
Approach
• Audio Representations
 The limitations of existing works
• Do not work well on generating realistic music from videos
• It is because ...
• Music is highly compositional and contains many structured events
• Machine learning models barely discover these rules contained in the music
 The paper uses the Musical Instrument Digital Interface (MIDI) as the audio
representations
• MIDI is composed of timing information note-on and note-off events
• Each event defines note pitch
• Note-on events contains additional velocity information (indicating how strong the note was played)
Visual and Audio
Representations
Body Motions to
MIDI Predictions

13
Approach
• Audio Representations
 ① Detect MIDI events from the audio track of the videos using a music transaction
software
• 6-second video clip: contains around 500 MIDI events (length varies for different music)
 ② Generate expressive timing information for music modeling
• Using a music performance encoding [41]
• A vocabulary of 88 note-on events, 88 note-off events, 32 velocity bins, 32 time-shift events
Visual and Audio
Representations
Body Motions to
MIDI Predictions

14
Approach
• Overall Architecture
 The authors build a Graph-Transformer module
• To model the correlations between the human body parts and hand movements with the MIDI events
 ① Adopt a spatial-temporal graph convolutional network on body keypoint coordinates over time to
capture body motions
 ② Feed the encoded pose feature to a music transformer decoder to generate a sequence of the
MIDI events
Visual and Audio
Representations
Body Motions to
MIDI Predictions

15
Approach
• Visual Encoder
 ① Extract 2D keypoints coordinates from the raw videos
 ② Adopt a Graph CNN to model the spatial-temporal relationships among different keypoints on the
body and hands (human skeleton sequence)
• MIDI Decoder
 Music signals is a sequence of MIDI events
 Consider music generation from body motions as a sequence prediction problem
 Utilize Transformer model [28]
• An encoder-decoder based autoregressive generative model
Visual and Audio
Representations
Body Motions to
MIDI Predictions

16
Approach
• Training
 Objective:
• To minimize the cross-entropy loss given a source target sequence of MIDI events
• To predict a sequence of MIDI events
 Input:
• 2D coordinates of human skeleton
 At each generation process
• Predict the next MIDI event using current MIDI event
• ⇒ The model generates MIDI events
Visual and Audio
Representations
Body Motions to
MIDI Predictions

17
Experiments
• Experimental Setup
 Datasets
• Three video datasets of music performances
• URMP [34] for ablated study
• A high-quality multi-instrument video dataset recorded in a studio
• Provides MIDI file for each recorded video
• AtinPiano [64] for comparison with SOTA models
• A YouTube channel including piano video recordings
• Camera looking down on the keyboard and hands
• MUSIC [64] for comparison with SOTA models
• Untrimmed video dataset downloaded by querying keywords from YouTube
• 1000 music performance videos, 11 categories
 Implementation Details
• OpenPose [6]
• To extract the coordinates of body and hand keypoints for each frame
• Pre-processing
• Extract MIDI events from audio recordings
• Training
• Randomly take a 6-sec video clip from the dataset

18
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Baseline
 SampleRNN, WaveNet, GAN-based
• Qualitative Evaluation with Human Study
 Note that:
• The quality of the generated sound can be very subjective
• ⇒ Conduct qualitative study (Amazon Mechanical Turk)
 Correctness
• Which music recording is more relevant to video content
 Least noise
• Which music recording has least noise
 Synchronization
• Which music recording temporally aligns with the video content best
 Overall
• Which sound they prefer to listen to all
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)

19
Experiments
Baseline
with Human Study
• Compare MIDI prediction result and the ground truth
 The predicted MIDI event are reasonable similar to the ground truth

20
Experiments
Baseline
with Human Study
• Sound spectrogram generated by different approaches
 The model generates more structured harmonic components than other baselines

21
Experiments
Baseline
with Human Study
• Qualitative Evaluation with real or fake study
 To assess whether the generated audios can fool people into thinking that they are real
 ① For AMT turkers, the authors provide two videos with
• Real audio (originally belonging to this video)
• Fake audio (generated by computers
 ② Turkers choose the video that they think is real
 Evaluation criteria
• Synchronization, artifacts, or containing noise

22
Experiments
Baseline
with Human Study
• Quantitative Evaluation with Automatic Metrics
 Goal
• To evaluate the diversity of generated sound
 Metric:
• The Number of Statistically-Different Bins (NDB)  The lower the better
• NDB indicates the number of cells in which the training samples are significantly different from the number of testing
examples
 Process
• ① Transform sound to log-spectrogram
• ② Cluster the spectrogram in the training set using k-means algorithm (k=50)
• ③ Each generated sound in the testing set is assigned to the nearest cell
 Result
• The model achieves significantly lower NDB
• Generates more diverse sound

23
Experiments
• Ablated Study
The effectiveness of Body Motions The effectiveness of Music Transformers
5 instruments from URMP (violin, viola, cello, trumpet, and flute)
• Overall
 Goal:
• To assess the impact of each component of the model
 Dataset:
• 5 instruments from URMP (violin, viola, cello, trumpet, and flute) for quantitative evaluations

24
Experiments
• Ablated Study
• Goal
 To understand the ability of the visual representations (skeleton) of the model
• Representations for explicit body motions through keypoint-based structure representations to guide music generation
• Metric
 NEL loss (the lower the better)
• Process
 Replace keypoint-based structure representation (skeleton)with
• RGB images and
• Optical flow representation
• Result
 Keypoint-based representation achieves better MIDI prediction accuracy than other options

25
Experiments
• Ablated Study
• Goal
 To verify the efficacy of a music transformers framework (for sequence predictions)
• Metric
 NEL loss (the lower the better)
• Process
 Replace the music transformer module with GRU, and keep the other parts of the pipeline the same
• Result
 The model captures the long-term dependencies in music

26
Experiments
• Music Editing with MIDI
 Performs music editing by manipulating the MIDI file
 Fig. 6 demonstrates the flexibility of MIDI representations
• Manipulate the key of the predicted MIDI
 Result
• The model is capable to generate music with different styles
• The model enables new applications on controllable music generation (which was not available
using waveform or spectrogram as the audio representation)

27
Conclusion and Future Work
• Foley music system
 Generate expressive music from videos
 Takes video as input
 Detects human skeletons
 Recognizes interactions with musical instruments over time
 Predicts the corresponding MIDI files
 Generates music with different styles through the MIDI representations
• Future Work
 Further study on studying the connections between video and music

Foley Music: Learning to Generate Music from Videos

Foley Music: Learning to Generate Music from Videos

Recommended

Recommended

More Related Content

Similar to Foley Music: Learning to Generate Music from Videos

Similar to Foley Music: Learning to Generate Music from Videos (20)

More from ivaderivader

More from ivaderivader (20)

Recently uploaded

Recently uploaded (20)

Foley Music: Learning to Generate Music from Videos