SlideShare a Scribd company logo
1 of 29
Foley Music:
Learning to Generate Music from Videos
Hyeshin Chu
2021. 04. 16
ECCV 2020
Chuamg Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba
Contents
•Overview of the Paper
•Related Work
•Approach
•Experiments
•Conclusion and Future Work
2
Overview of the Paper
• Foley Music
 A system that can learn to generate music by seeing and listening to a large-scale music performance videos
http://foley-music.csail.mit.edu/
3
Overview of the Paper
• Foley Music
 A system that can learn to generate music by seeing and listening to a large-scale music performance videos
• Task
 To learn a mapping between audio and visual signals from unlabeled video in practice
• The challenges
 To make a visual perception module to recognize the physical interactions between the musical instrument and the
players body from videos
 To implement audio representation which
• respects the major musical rules about structure and dynamics
• is easy to predict from visual signals
 To build a model that associates these two modalities and accurately predicts music from videos
4
Overview of the Paper
• Identify two key elements for a successful video to music generator to address the
challenges
 For the visual perception part
• Extract key points of the human body and hand fingers from video frames as intermediate visual representations
• Explicitly model the body parts and hand movements
 For the music
• Use Musical Instrument Digital Interface (MIDI)
• Encodes timing and loudness information for each note event
• Advantages of using MIDI
• MIDI events capture the expressive timing and dynamics information contained in music
• Easy to fit into machine learning models
• Fully interpretable and flexible
• Easily converted to realistic music with a standard audio synthesizer
5
Overview of the Paper
• Main Contributions
 Present a model to generate synchronized and expressive music from videos
 Propose body keypoint and MIDI as an intermediate representation for transferring knowledge
across two modalities
 Empirically demonstrate that such representations are key to success
 Foley Music outperforms previous SOTA systems on music generation from videos by a large margin
 Demonstrate that MIDI musical representations facilitate new applications on generating different
styles of music
6
Related Work
• Synchronization between vision and sound
 Used sound clusters as supervision to learn visual feature representation, given unlabeled training
videos [44]
 Jointly learn the visual and audio representation using a visual-audio correspondence task [2, 33]
 Recent works on
• Localizing sound source in images or videos [29, 26, 3, 48, 64]
• Biometric matching [39]
• Visual-guided sound source separation [64, 15, 19, 60]
• Auditory inpainting [66]
• Emotion recognition [1], etc.
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
7
Related Work
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
• Correlations between sound and motion
 The associations between speech and facial movements [31, 55]
 Generating high-quality talking face from audio [54, 30]
 Separate mixed speech signals of multiple speakers [14, 42]
• Correlations between body motions and sound
 Predicting gestures from speech [22]
 Body dynamics from music [50]
 Identifying a melody through body language [15]
• ⇒ The paper mainly focuses on generating music from videos according to body motions
8
Related Work
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
• Deep neural network models
 MelodyRNN [59] &DeepBach [24]
• Generate realistic melodies and bach chorales
 WaveNet [40]
• Show promising results in generating realistic speech and music
 Song from PI [11]
• Use a hierarchical RNN model to simultaneously generate melody, drums, and chords for a pop song
 Music Transformer Model [28]
• Generate expressive piano music from MIDI event
 MAESTRO Dataset [25]
• To factorize piano music modeling and generation
 ⇒ Little work on exploring the problem of generating expressive music from videos
9
Related Work
Audio-Visual Learning Motion and Sound Music Generation
Sound Generation from
Videos
• Foley invented by Jack Foley in 1920s
 Following works investigate the task of predicting the sound emitted by interacting objects [43]
• Methods using a neural network
 The conditional generative adversarial networks for lab-collected music performance videos [10]
 SampleRNN-based method to directly predict a generate waveform from an unconstraint video
dataset in the wild (10 types of sound) [68]
 A perceptual loss to improve the audio-visual semantic alignment [9]
 ⇒ The method uses MIDI for music transcription and generation, instead of spectrograms of
waveform for audio representation
10
Approach
• The model architecture consists of three components
 A visual encoder
• Takes video frames to extract keypoint coordinates
• Uses GCN to capture the body dynamic and produce a latent representation over time
 A MIDI decoder
• Takes the video sequence representation to generate a sequence of MIDI event
 An audio synthesizer
• Converts the MIDI event to the waveform
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
11
Approach
• Visual Representations
 The limitations of existing works
• Limited abilities to applications that require the capture of the fine-grained level correlations between
motion and sound
 The method uses the human pose features to capture the body motion cues
• ① Detecting the human body and hand keypoints from each video frame
• ② Stacking their 2D coordinates over time as structured visual representations
 In practice
• ① The open-source OpenPose toolbox [6]
• ⇒ Extract the 2D coordinates of human body joints
• ② OpenPose [6] and hand API [51]
• ⇒ Predict the coordinates of hand keypoints
 ⇒ Obtained 25 keypoints for the human body parts and 21 keypoints for each hand
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
12
Approach
• Audio Representations
 The limitations of existing works
• Do not work well on generating realistic music from videos
• It is because ...
• Music is highly compositional and contains many structured events
• Machine learning models barely discover these rules contained in the music
 The paper uses the Musical Instrument Digital Interface (MIDI) as the audio
representations
• MIDI is composed of timing information note-on and note-off events
• Each event defines note pitch
• Note-on events contains additional velocity information (indicating how strong the note was played)
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
13
Approach
• Audio Representations
 ① Detect MIDI events from the audio track of the videos using a music transaction
software
• 6-second video clip: contains around 500 MIDI events (length varies for different music)
 ② Generate expressive timing information for music modeling
• Using a music performance encoding [41]
• A vocabulary of 88 note-on events, 88 note-off events, 32 velocity bins, 32 time-shift events
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
14
Approach
• Overall Architecture
 The authors build a Graph-Transformer module
• To model the correlations between the human body parts and hand movements with the MIDI events
 ① Adopt a spatial-temporal graph convolutional network on body keypoint coordinates over time to
capture body motions
 ② Feed the encoded pose feature to a music transformer decoder to generate a sequence of the
MIDI events
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
15
Approach
• Visual Encoder
 ① Extract 2D keypoints coordinates from the raw videos
 ② Adopt a Graph CNN to model the spatial-temporal relationships among different keypoints on the
body and hands (human skeleton sequence)
• MIDI Decoder
 Music signals is a sequence of MIDI events
 Consider music generation from body motions as a sequence prediction problem
 Utilize Transformer model [28]
• An encoder-decoder based autoregressive generative model
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
16
Approach
• Training
 Objective:
• To minimize the cross-entropy loss given a source target sequence of MIDI events
• To predict a sequence of MIDI events
 Input:
• 2D coordinates of human skeleton
 At each generation process
• Predict the next MIDI event using current MIDI event
• ⇒ The model generates MIDI events
Visual and Audio
Representations
Body Motions to
MIDI Predictions
Training and Inference
17
Experiments
• Experimental Setup
 Datasets
• Three video datasets of music performances
• URMP [34] for ablated study
• A high-quality multi-instrument video dataset recorded in a studio
• Provides MIDI file for each recorded video
• AtinPiano [64] for comparison with SOTA models
• A YouTube channel including piano video recordings
• Camera looking down on the keyboard and hands
• MUSIC [64] for comparison with SOTA models
• Untrimmed video dataset downloaded by querying keywords from YouTube
• 1000 music performance videos, 11 categories
 Implementation Details
• OpenPose [6]
• To extract the coordinates of body and hand keypoints for each frame
• Pre-processing
• Extract MIDI events from audio recordings
• Training
• Randomly take a 6-sec video clip from the dataset
18
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Baseline
 SampleRNN, WaveNet, GAN-based
• Qualitative Evaluation with Human Study
 Note that:
• The quality of the generated sound can be very subjective
• ⇒ Conduct qualitative study (Amazon Mechanical Turk)
 Correctness
• Which music recording is more relevant to video content
 Least noise
• Which music recording has least noise
 Synchronization
• Which music recording temporally aligns with the video content best
 Overall
• Which sound they prefer to listen to all
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
19
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Compare MIDI prediction result and the ground truth
 The predicted MIDI event are reasonable similar to the ground truth
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
20
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Sound spectrogram generated by different approaches
 The model generates more structured harmonic components than other baselines
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
21
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Qualitative Evaluation with real or fake study
 To assess whether the generated audios can fool people into thinking that they are real
 ① For AMT turkers, the authors provide two videos with
• Real audio (originally belonging to this video)
• Fake audio (generated by computers
 ② Turkers choose the video that they think is real
 Evaluation criteria
• Synchronization, artifacts, or containing noise
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
22
Experiments
• Comparison with State-of-the-arts
Baseline
Qualitative Evaluation
with Human Study
Visualizations Evaluation
• Quantitative Evaluation with Automatic Metrics
 Goal
• To evaluate the diversity of generated sound
 Metric:
• The Number of Statistically-Different Bins (NDB)  The lower the better
• NDB indicates the number of cells in which the training samples are significantly different from the number of testing
examples
 Process
• ① Transform sound to log-spectrogram
• ② Cluster the spectrogram in the training set using k-means algorithm (k=50)
• ③ Each generated sound in the testing set is assigned to the nearest cell
 Result
• The model achieves significantly lower NDB
• Generates more diverse sound
9 instruments from MUSIC and AtinPiano dataset to compare against previous systems
(Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
23
Experiments
• Ablated Study
The effectiveness of Body Motions The effectiveness of Music Transformers
5 instruments from URMP (violin, viola, cello, trumpet, and flute)
• Overall
 Goal:
• To assess the impact of each component of the model
 Dataset:
• 5 instruments from URMP (violin, viola, cello, trumpet, and flute) for quantitative evaluations
24
Experiments
• Ablated Study
The effectiveness of Body Motions The effectiveness of Music Transformers
• Goal
 To understand the ability of the visual representations (skeleton) of the model
• Representations for explicit body motions through keypoint-based structure representations to guide music generation
• Metric
 NEL loss (the lower the better)
• Process
 Replace keypoint-based structure representation (skeleton)with
• RGB images and
• Optical flow representation
• Result
 Keypoint-based representation achieves better MIDI prediction accuracy than other options
5 instruments from URMP (violin, viola, cello, trumpet, and flute)
25
Experiments
• Ablated Study
The effectiveness of Body Motions The effectiveness of Music Transformers
• Goal
 To verify the efficacy of a music transformers framework (for sequence predictions)
• Metric
 NEL loss (the lower the better)
• Process
 Replace the music transformer module with GRU, and keep the other parts of the pipeline the same
• Result
 The model captures the long-term dependencies in music
5 instruments from URMP (violin, viola, cello, trumpet, and flute)
26
Experiments
• Music Editing with MIDI
 Performs music editing by manipulating the MIDI file
 Fig. 6 demonstrates the flexibility of MIDI representations
• Manipulate the key of the predicted MIDI
 Result
• The model is capable to generate music with different styles
• The model enables new applications on controllable music generation (which was not available
using waveform or spectrogram as the audio representation)
27
Conclusion and Future Work
• Foley music system
 Generate expressive music from videos
 Takes video as input
 Detects human skeletons
 Recognizes interactions with musical instruments over time
 Predicts the corresponding MIDI files
 Generates music with different styles through the MIDI representations
• Future Work
 Further study on studying the connections between video and music
Foley Music: Learning to Generate Music from Videos

More Related Content

Similar to Foley Music: Learning to Generate Music from Videos

Face2mus 1437580648936
Face2mus 1437580648936Face2mus 1437580648936
Face2mus 1437580648936
Ann Thomas
 
Soundtrack production for the moving image
Soundtrack production for the moving imageSoundtrack production for the moving image
Soundtrack production for the moving image
RedDreamsJosh
 
Escape from Ember
Escape from EmberEscape from Ember
Escape from Ember
dhdavidson
 

Similar to Foley Music: Learning to Generate Music from Videos (20)

Face2mus 1437580648936
Face2mus 1437580648936Face2mus 1437580648936
Face2mus 1437580648936
 
Visual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic VideosVisual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic Videos
 
Chapter 4 : SOUND
Chapter 4 : SOUNDChapter 4 : SOUND
Chapter 4 : SOUND
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)
 
major ppt 1 final.pptx
major ppt 1 final.pptxmajor ppt 1 final.pptx
major ppt 1 final.pptx
 
Video Hyperlinking Tutorial (Part B)
Video Hyperlinking Tutorial (Part B)Video Hyperlinking Tutorial (Part B)
Video Hyperlinking Tutorial (Part B)
 
MLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicMLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to Music
 
Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_share
 
CSC8605 - Video as Inquiry
CSC8605 - Video as InquiryCSC8605 - Video as Inquiry
CSC8605 - Video as Inquiry
 
Soundtrack production for the moving image
Soundtrack production for the moving imageSoundtrack production for the moving image
Soundtrack production for the moving image
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
 
Mini Project- Digital Audio Editing
Mini Project- Digital Audio EditingMini Project- Digital Audio Editing
Mini Project- Digital Audio Editing
 
NCorreia, AV Clash 2010-11-12
NCorreia, AV Clash 2010-11-12NCorreia, AV Clash 2010-11-12
NCorreia, AV Clash 2010-11-12
 
Sonic Pi - Lecture 1 (Presentation).pptx
Sonic Pi - Lecture 1 (Presentation).pptxSonic Pi - Lecture 1 (Presentation).pptx
Sonic Pi - Lecture 1 (Presentation).pptx
 
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
 
Query By humming - Music retrieval technology
Query By humming - Music retrieval technologyQuery By humming - Music retrieval technology
Query By humming - Music retrieval technology
 
Unit 29: Music Video Production (Brief)
Unit 29: Music Video Production (Brief)Unit 29: Music Video Production (Brief)
Unit 29: Music Video Production (Brief)
 
Escape from Ember
Escape from EmberEscape from Ember
Escape from Ember
 
COMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and TrackingCOMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and Tracking
 
Painterly interfaces for audiovisual performance
Painterly interfaces for  audiovisual performancePainterly interfaces for  audiovisual performance
Painterly interfaces for audiovisual performance
 

More from ivaderivader

Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
ivaderivader
 
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
ivaderivader
 

More from ivaderivader (20)

Argument Mining
Argument MiningArgument Mining
Argument Mining
 
Papers at CHI23
Papers at CHI23Papers at CHI23
Papers at CHI23
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
 
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
 
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
 
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
 
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial NetworksA Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
 
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
 
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for VisualizationPerception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
 
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
 
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-PoolingNeural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
 
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
 
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTubeBad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
 
Invertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise RemovalInvertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise Removal
 
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural NetworkTraffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
 
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training  MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
 
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI ComponentsScreen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
 
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translation
 

Recently uploaded

CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 

Recently uploaded (20)

Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 

Foley Music: Learning to Generate Music from Videos

  • 1. Foley Music: Learning to Generate Music from Videos Hyeshin Chu 2021. 04. 16 ECCV 2020 Chuamg Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba
  • 2. Contents •Overview of the Paper •Related Work •Approach •Experiments •Conclusion and Future Work
  • 3. 2 Overview of the Paper • Foley Music  A system that can learn to generate music by seeing and listening to a large-scale music performance videos http://foley-music.csail.mit.edu/
  • 4. 3 Overview of the Paper • Foley Music  A system that can learn to generate music by seeing and listening to a large-scale music performance videos • Task  To learn a mapping between audio and visual signals from unlabeled video in practice • The challenges  To make a visual perception module to recognize the physical interactions between the musical instrument and the players body from videos  To implement audio representation which • respects the major musical rules about structure and dynamics • is easy to predict from visual signals  To build a model that associates these two modalities and accurately predicts music from videos
  • 5. 4 Overview of the Paper • Identify two key elements for a successful video to music generator to address the challenges  For the visual perception part • Extract key points of the human body and hand fingers from video frames as intermediate visual representations • Explicitly model the body parts and hand movements  For the music • Use Musical Instrument Digital Interface (MIDI) • Encodes timing and loudness information for each note event • Advantages of using MIDI • MIDI events capture the expressive timing and dynamics information contained in music • Easy to fit into machine learning models • Fully interpretable and flexible • Easily converted to realistic music with a standard audio synthesizer
  • 6. 5 Overview of the Paper • Main Contributions  Present a model to generate synchronized and expressive music from videos  Propose body keypoint and MIDI as an intermediate representation for transferring knowledge across two modalities  Empirically demonstrate that such representations are key to success  Foley Music outperforms previous SOTA systems on music generation from videos by a large margin  Demonstrate that MIDI musical representations facilitate new applications on generating different styles of music
  • 7. 6 Related Work • Synchronization between vision and sound  Used sound clusters as supervision to learn visual feature representation, given unlabeled training videos [44]  Jointly learn the visual and audio representation using a visual-audio correspondence task [2, 33]  Recent works on • Localizing sound source in images or videos [29, 26, 3, 48, 64] • Biometric matching [39] • Visual-guided sound source separation [64, 15, 19, 60] • Auditory inpainting [66] • Emotion recognition [1], etc. Audio-Visual Learning Motion and Sound Music Generation Sound Generation from Videos
  • 8. 7 Related Work Audio-Visual Learning Motion and Sound Music Generation Sound Generation from Videos • Correlations between sound and motion  The associations between speech and facial movements [31, 55]  Generating high-quality talking face from audio [54, 30]  Separate mixed speech signals of multiple speakers [14, 42] • Correlations between body motions and sound  Predicting gestures from speech [22]  Body dynamics from music [50]  Identifying a melody through body language [15] • ⇒ The paper mainly focuses on generating music from videos according to body motions
  • 9. 8 Related Work Audio-Visual Learning Motion and Sound Music Generation Sound Generation from Videos • Deep neural network models  MelodyRNN [59] &DeepBach [24] • Generate realistic melodies and bach chorales  WaveNet [40] • Show promising results in generating realistic speech and music  Song from PI [11] • Use a hierarchical RNN model to simultaneously generate melody, drums, and chords for a pop song  Music Transformer Model [28] • Generate expressive piano music from MIDI event  MAESTRO Dataset [25] • To factorize piano music modeling and generation  ⇒ Little work on exploring the problem of generating expressive music from videos
  • 10. 9 Related Work Audio-Visual Learning Motion and Sound Music Generation Sound Generation from Videos • Foley invented by Jack Foley in 1920s  Following works investigate the task of predicting the sound emitted by interacting objects [43] • Methods using a neural network  The conditional generative adversarial networks for lab-collected music performance videos [10]  SampleRNN-based method to directly predict a generate waveform from an unconstraint video dataset in the wild (10 types of sound) [68]  A perceptual loss to improve the audio-visual semantic alignment [9]  ⇒ The method uses MIDI for music transcription and generation, instead of spectrograms of waveform for audio representation
  • 11. 10 Approach • The model architecture consists of three components  A visual encoder • Takes video frames to extract keypoint coordinates • Uses GCN to capture the body dynamic and produce a latent representation over time  A MIDI decoder • Takes the video sequence representation to generate a sequence of MIDI event  An audio synthesizer • Converts the MIDI event to the waveform Visual and Audio Representations Body Motions to MIDI Predictions Training and Inference
  • 12. 11 Approach • Visual Representations  The limitations of existing works • Limited abilities to applications that require the capture of the fine-grained level correlations between motion and sound  The method uses the human pose features to capture the body motion cues • ① Detecting the human body and hand keypoints from each video frame • ② Stacking their 2D coordinates over time as structured visual representations  In practice • ① The open-source OpenPose toolbox [6] • ⇒ Extract the 2D coordinates of human body joints • ② OpenPose [6] and hand API [51] • ⇒ Predict the coordinates of hand keypoints  ⇒ Obtained 25 keypoints for the human body parts and 21 keypoints for each hand Visual and Audio Representations Body Motions to MIDI Predictions Training and Inference
  • 13. 12 Approach • Audio Representations  The limitations of existing works • Do not work well on generating realistic music from videos • It is because ... • Music is highly compositional and contains many structured events • Machine learning models barely discover these rules contained in the music  The paper uses the Musical Instrument Digital Interface (MIDI) as the audio representations • MIDI is composed of timing information note-on and note-off events • Each event defines note pitch • Note-on events contains additional velocity information (indicating how strong the note was played) Visual and Audio Representations Body Motions to MIDI Predictions Training and Inference
  • 14. 13 Approach • Audio Representations  ① Detect MIDI events from the audio track of the videos using a music transaction software • 6-second video clip: contains around 500 MIDI events (length varies for different music)  ② Generate expressive timing information for music modeling • Using a music performance encoding [41] • A vocabulary of 88 note-on events, 88 note-off events, 32 velocity bins, 32 time-shift events Visual and Audio Representations Body Motions to MIDI Predictions Training and Inference
  • 15. 14 Approach • Overall Architecture  The authors build a Graph-Transformer module • To model the correlations between the human body parts and hand movements with the MIDI events  ① Adopt a spatial-temporal graph convolutional network on body keypoint coordinates over time to capture body motions  ② Feed the encoded pose feature to a music transformer decoder to generate a sequence of the MIDI events Visual and Audio Representations Body Motions to MIDI Predictions Training and Inference
  • 16. 15 Approach • Visual Encoder  ① Extract 2D keypoints coordinates from the raw videos  ② Adopt a Graph CNN to model the spatial-temporal relationships among different keypoints on the body and hands (human skeleton sequence) • MIDI Decoder  Music signals is a sequence of MIDI events  Consider music generation from body motions as a sequence prediction problem  Utilize Transformer model [28] • An encoder-decoder based autoregressive generative model Visual and Audio Representations Body Motions to MIDI Predictions Training and Inference
  • 17. 16 Approach • Training  Objective: • To minimize the cross-entropy loss given a source target sequence of MIDI events • To predict a sequence of MIDI events  Input: • 2D coordinates of human skeleton  At each generation process • Predict the next MIDI event using current MIDI event • ⇒ The model generates MIDI events Visual and Audio Representations Body Motions to MIDI Predictions Training and Inference
  • 18. 17 Experiments • Experimental Setup  Datasets • Three video datasets of music performances • URMP [34] for ablated study • A high-quality multi-instrument video dataset recorded in a studio • Provides MIDI file for each recorded video • AtinPiano [64] for comparison with SOTA models • A YouTube channel including piano video recordings • Camera looking down on the keyboard and hands • MUSIC [64] for comparison with SOTA models • Untrimmed video dataset downloaded by querying keywords from YouTube • 1000 music performance videos, 11 categories  Implementation Details • OpenPose [6] • To extract the coordinates of body and hand keypoints for each frame • Pre-processing • Extract MIDI events from audio recordings • Training • Randomly take a 6-sec video clip from the dataset
  • 19. 18 Experiments • Comparison with State-of-the-arts Baseline Qualitative Evaluation with Human Study Visualizations Evaluation • Baseline  SampleRNN, WaveNet, GAN-based • Qualitative Evaluation with Human Study  Note that: • The quality of the generated sound can be very subjective • ⇒ Conduct qualitative study (Amazon Mechanical Turk)  Correctness • Which music recording is more relevant to video content  Least noise • Which music recording has least noise  Synchronization • Which music recording temporally aligns with the video content best  Overall • Which sound they prefer to listen to all 9 instruments from MUSIC and AtinPiano dataset to compare against previous systems (Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
  • 20. 19 Experiments • Comparison with State-of-the-arts Baseline Qualitative Evaluation with Human Study Visualizations Evaluation • Compare MIDI prediction result and the ground truth  The predicted MIDI event are reasonable similar to the ground truth 9 instruments from MUSIC and AtinPiano dataset to compare against previous systems (Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
  • 21. 20 Experiments • Comparison with State-of-the-arts Baseline Qualitative Evaluation with Human Study Visualizations Evaluation • Sound spectrogram generated by different approaches  The model generates more structured harmonic components than other baselines 9 instruments from MUSIC and AtinPiano dataset to compare against previous systems (Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
  • 22. 21 Experiments • Comparison with State-of-the-arts Baseline Qualitative Evaluation with Human Study Visualizations Evaluation • Qualitative Evaluation with real or fake study  To assess whether the generated audios can fool people into thinking that they are real  ① For AMT turkers, the authors provide two videos with • Real audio (originally belonging to this video) • Fake audio (generated by computers  ② Turkers choose the video that they think is real  Evaluation criteria • Synchronization, artifacts, or containing noise 9 instruments from MUSIC and AtinPiano dataset to compare against previous systems (Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
  • 23. 22 Experiments • Comparison with State-of-the-arts Baseline Qualitative Evaluation with Human Study Visualizations Evaluation • Quantitative Evaluation with Automatic Metrics  Goal • To evaluate the diversity of generated sound  Metric: • The Number of Statistically-Different Bins (NDB)  The lower the better • NDB indicates the number of cells in which the training samples are significantly different from the number of testing examples  Process • ① Transform sound to log-spectrogram • ② Cluster the spectrogram in the training set using k-means algorithm (k=50) • ③ Each generated sound in the testing set is assigned to the nearest cell  Result • The model achieves significantly lower NDB • Generates more diverse sound 9 instruments from MUSIC and AtinPiano dataset to compare against previous systems (Accordion, bass, bassoon, cello, guitar, piano, tuba, ukulele, violin)
  • 24. 23 Experiments • Ablated Study The effectiveness of Body Motions The effectiveness of Music Transformers 5 instruments from URMP (violin, viola, cello, trumpet, and flute) • Overall  Goal: • To assess the impact of each component of the model  Dataset: • 5 instruments from URMP (violin, viola, cello, trumpet, and flute) for quantitative evaluations
  • 25. 24 Experiments • Ablated Study The effectiveness of Body Motions The effectiveness of Music Transformers • Goal  To understand the ability of the visual representations (skeleton) of the model • Representations for explicit body motions through keypoint-based structure representations to guide music generation • Metric  NEL loss (the lower the better) • Process  Replace keypoint-based structure representation (skeleton)with • RGB images and • Optical flow representation • Result  Keypoint-based representation achieves better MIDI prediction accuracy than other options 5 instruments from URMP (violin, viola, cello, trumpet, and flute)
  • 26. 25 Experiments • Ablated Study The effectiveness of Body Motions The effectiveness of Music Transformers • Goal  To verify the efficacy of a music transformers framework (for sequence predictions) • Metric  NEL loss (the lower the better) • Process  Replace the music transformer module with GRU, and keep the other parts of the pipeline the same • Result  The model captures the long-term dependencies in music 5 instruments from URMP (violin, viola, cello, trumpet, and flute)
  • 27. 26 Experiments • Music Editing with MIDI  Performs music editing by manipulating the MIDI file  Fig. 6 demonstrates the flexibility of MIDI representations • Manipulate the key of the predicted MIDI  Result • The model is capable to generate music with different styles • The model enables new applications on controllable music generation (which was not available using waveform or spectrogram as the audio representation)
  • 28. 27 Conclusion and Future Work • Foley music system  Generate expressive music from videos  Takes video as input  Detects human skeletons  Recognizes interactions with musical instruments over time  Predicts the corresponding MIDI files  Generates music with different styles through the MIDI representations • Future Work  Further study on studying the connections between video and music