Music Gesture for Visual Sound Separation

Music Gesture for Visual Sound Separation
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
CVPR , 2020
October 6th, 2021
Presenter: Yonghyun Kim

Contents
• Overview
• Related Work
• Approach
• Experiments
• Conclusion and Future Works

2
Overview
• “Music Gesture” A key point-based structured representation to explicitly model
the body and finger movements of musicians when they perform music

3
Overview
• Pipeline
 Adopt a context-aware
graph network to integrate
visual semantic context
with body dynamics.
 Apply an Audio-Visual(A/V)
fusion model to associate
body movements with
the corresponding
audio signals.

4
Overview
• Datasets (1 of 3)
 URMP (University of Rochester Multi-Modal Music Performance) [31] Dataset
• A high-quality multi-instrument video dataset recorded in studio and provides Ground-truth labels for each sound source
.
• For each piece, there are music score in MIDI format, Individual instrument audio recordings, and the videos.
• 44 Pieces of music: 11 Duets, 12 Trios, 14 Quartets, and 7 Quintets
• 14 Instruments which are common in orchestra: 4 Strings, 6 Woodwinds, 4 Brass

5
Overview
 MUSIC-21 [55] Dataset
• An untrimmed video dataset crawled by keyword query from YouTube.
• It contains music performances belonging to 21 categories.
• This dataset is relatively clean and collected for the purpose of training and evaluating visual sound source
separation models.
 AtinPiano [35] Dataset
• A dataset where the piano video recordings
are filmed in a way that camera is looking down
on the keyboard and hands.
• AtinPiano is a YouTube channel,
including piano video recordings.
AtinPiano

6
Related Work
• Sound Separation
 Sound Separation is a central problem in the audio signal processing area [33, 22].
 Classic solutions were based on Non-negative Matrix Factorization (NMF) [51, 11, 47].
• But these are not very effective as they rely on low-level correlations in the signals.
 Deep learning based methods are taking over in the recent years.
• CNN models (to predict time-frequency masks for music source separation and enhancement) [46, 9]
 A spectrogram classification model could not deal with the case with arbitrary number of speakers
simultaneously. To solve this, following researches has been conducted:
• Deep Clustering (Hershey et al.) [24]
• Speaker-independent training framework (Yu et al.) [54]
Sound Separation NMF: V is represented by the two smaller matrices W and H

7
Related Work
• Visual Sound Separation
 Research on “How semantic appearances could help with sound separation?”
 Formal Researches have limited capabilities to capture the motion cues, thus restricts
their applicability to solve harder sound source separation problems.
 There were no researches that explicit body movement cues using structured key point-based
structured representations for audio-visual learning. (But this does)
• Audio-Visual Learning
 Thanks to DNN, a series of works have been published in the past few years on Audio-Visual learning.
 By learning audio model and image model jointly or separately by knowledge distillation,
good Audio-Visual representations can be achieved.
 Examples
• Sounding Object Localization, Sound Generation for videos, 360/Stereo sound from videos, and so forth.

8
Related Work
• Audio and Body Dynamics
 Many Face – Audio researches have been conducted.
• Ex) Generate talking face from audio [48, 27], Separate mixed speech signals of multiple speakers [12],
lip reading from raw videos [10], etc.
 In contrast, the correlations between Body pose – Audio were less explored.
• Ex) Predicting Dynamics from music [44], Body rhythms from speech [20]
Lip reading from raw videos Generate talking face from audio

9
• Model Architecture – Overview
 Sound Separation is a central problem in the audio signal
 Goal: To associate
the body dynamics with
the audio signals for
sound source separation
 Adopt the commonly used “
mix-and-separate”
self-supervised training [56]
 Create synthetic training
data by mixing arbitrary
sound sources from
different video clips. Then
separate each sound from
mixtures conditioned on its
associated visual context.
Approach

10
 2 Major Components:
Video Analysis Network,
A/V Separation Network
 During the training, we
randomly select N videos
with paired video frames
and audio signal {Vk, Sk}.
 Then mix their audios by
linear combinations of
the audio inputs to form
a synthetic mixture Smix.
Approach

11
 Given a video clip Vk,
the Video Analysis Network
extracts global context
and body dynamic features
from videos.
 A/V Separation Network
separates its audio signal Sk
from Smix corresponding to
its visual context Vk.
 Neural network was trained
in a supervised fashion, but
it learned from unlabeled
video data. (self-supervised)
Smix
Vk
S1 S2 S3
Approach

12
• Model Architecture – Video Analysis Network
 ResNet-50 [23]: To extract global semantic features from video frames
• Extract the features after the last spatial average pooling layer from the first frame of each Vk
• Obtain 2048-Dimensional context feature vector for each Vk (Da=2048)
 AlphaPose toolbox [13]: To estimate the 2D locations of human body joints
• Obtain 18 key points for human body
 OpenPose hand API [8, 45]: To estimate the coordinates of hand key points
• Obtain 21 key points for each hand
• N=18+21=39: Total key points
Vk
Approach

13
• Model Architecture – Video Analysis Network
 Context-Aware Graph CNN: To fuse the semantic context of instruments and human body dynamics
• Undirected spatial-temporal graph G={V, E} on a human skeleton sequence.
• Each node vi in {V} corresponds to a key point of the human body.
• Edges reflect the natural connectivity of body key points.
• fv: The encoded pose feature
• N: The number of key points
• Dn: The feature dimension for each input node
• X: input features; Â: Row-normalized adjacency matrix
• Ws, Wt: Weight matrix of spatial graph convolution and 2D convolution each
Vk
Approach

14
• Model Architecture – Audio-Visual Separation Network
 Takes the spectrogram of mixture audio with visual representation produced by
the video analysis network as input, then predict a spectrogram mask and generate the
audio signal for the selected video
 Audio Network
• Adopt U-Net style architecture [41]
• Consists of 4 dilated convolution layers and 4 dilated up-convolution layers
• Use 3*3 spatial filters with stride2, dilation 1
• BatchNorm layer, Leaky ReLU
• Input: 2D time-frequency spectrogram of Smix
• Output: A same-size binary spectrogram mask
Approach
Smix
S1 S2 S3
Dilated Convolution (Atrous convolution)

15
• Model Architecture – Audio-Visual Separation Network
 Audio-Visual Fusion
• Adopt a self-attention [50] based cross-modal
early fusion module to capture the correlations
between body movement with the sound signals
Approach
Smix
S1 S2 S3
Audio-Visual Fusion Model
* MLP: Multi-Layer Perceptron (2 Fully-connected layers with ReLU)

16
• Model Architecture – Training and Inferences
 The learning objective is to estimate a binary mask Mk
• GT mask of k-th video is calculated weather the target sound is the dominant component in the input
mixed sound on magnitude spectrogram S
• Trained by minimizing the per-pixel sigmoid cross entropy loss between the estimated masks and the GT.
• Then predicted mask is thresholded and multiplied with the input complex STFT(Short Time Fourier Transf-
orm) coefficients to get a predicted sound spectrogram.
• Finally, apply an iSTFT(inverse STFT) with the same transformation parameters on the
predicted spectrogram to reconstruct the waveform of separated sound.
Approach
Smix
S1 S2 S3

17
• Hetero-musical Separation
 Evaluate the model performance of separating sounds from different kinds of instruments
of the MUSIC-21 dataset.
 SDR (Source-to Distortion Ratio) SIR (Source-to Interference Ratio)
 Consider 6 state-of-the-art systems to compare against.
• NMF [51], Deep Separation [9], MIML [17], Sound of Pixels [56], Co-separation [19], Sound of Motions [55]
 Train and evaluate model using mix-2 and mix-3 samples
 Because MUSIC dataset do not have GT labels for quantitative evaluation, researchers construct
a synthetic testing set by mixing solo videos.
 The result of model performances are reported on a validation set with 256 pairs of sound mixtures.
 Pytorch Framework
 GCN Model with 11 layers with residual connections
 Further conduct a subjective human study using real mixture videos from MUSIC-21 and URMP
datasets on Amazon Mechanical Turk (AMT).
Experiments

18
• Hetero-musical Separation
Experiments

19
• Homo-musical Separation
 Evaluate the model performance of separating sounds from same kinds of instruments
of the MUSIC-21 dataset.
 Select 5 kinds of musical instruments whose sounds are closely related to body dynamic:
• Trumpet, Flute, Piano, Violin, and Cello
 Pre-train the model on multiple instrument separation, then learn to separate the same instrument.
 Compare model with SoM
 The results are measured by both automatic SDR scores and human evaluations on AMT.
Experiments

20
Conclusions and Future Works
 Perform better on Audio-Visual source
separation of different instruments
 Achieve remarkable results on
separating sounds of same 5
instruments
 Pave a new research direction on
exploiting body dynamic motions with
structured key point-based video
representations to guide the sound
source separation
 Propose a novel audio-video fusion
module to associate human body
motion cues with sound signals.
 Try different modules or architecture
for enhancing the separation result
 Utilize some datasets or modules for
exploiting in our researches

Music Gesture for Visual Sound Separation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Music Gesture for Visual Sound Separation

Similar to Music Gesture for Visual Sound Separation (20)

More from ivaderivader

More from ivaderivader (20)

Recently uploaded

Recently uploaded (20)

Music Gesture for Visual Sound Separation