SlideShare a Scribd company logo
Music Gesture for Visual Sound Separation
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
CVPR , 2020
October 6th, 2021
Presenter: Yonghyun Kim
Contents
• Overview
• Related Work
• Approach
• Experiments
• Conclusion and Future Works
2
Overview
• “Music Gesture” A key point-based structured representation to explicitly model
the body and finger movements of musicians when they perform music
3
Overview
• Pipeline
 Adopt a context-aware
graph network to integrate
visual semantic context
with body dynamics.
 Apply an Audio-Visual(A/V)
fusion model to associate
body movements with
the corresponding
audio signals.
4
Overview
• Datasets (1 of 3)
 URMP (University of Rochester Multi-Modal Music Performance) [31] Dataset
• A high-quality multi-instrument video dataset recorded in studio and provides Ground-truth labels for each sound source
.
• For each piece, there are music score in MIDI format, Individual instrument audio recordings, and the videos.
• 44 Pieces of music: 11 Duets, 12 Trios, 14 Quartets, and 7 Quintets
• 14 Instruments which are common in orchestra: 4 Strings, 6 Woodwinds, 4 Brass
5
Overview
• Datasets (2 of 3)
 MUSIC-21 [55] Dataset
• An untrimmed video dataset crawled by keyword query from YouTube.
• It contains music performances belonging to 21 categories.
• This dataset is relatively clean and collected for the purpose of training and evaluating visual sound source
separation models.
• Datasets (3 of 3)
 AtinPiano [35] Dataset
• A dataset where the piano video recordings
are filmed in a way that camera is looking down
on the keyboard and hands.
• AtinPiano is a YouTube channel,
including piano video recordings.
AtinPiano
6
Related Work
• Sound Separation
 Sound Separation is a central problem in the audio signal processing area [33, 22].
 Classic solutions were based on Non-negative Matrix Factorization (NMF) [51, 11, 47].
• But these are not very effective as they rely on low-level correlations in the signals.
 Deep learning based methods are taking over in the recent years.
• CNN models (to predict time-frequency masks for music source separation and enhancement) [46, 9]
 A spectrogram classification model could not deal with the case with arbitrary number of speakers
simultaneously. To solve this, following researches has been conducted:
• Deep Clustering (Hershey et al.) [24]
• Speaker-independent training framework (Yu et al.) [54]
Sound Separation NMF: V is represented by the two smaller matrices W and H
7
Related Work
• Visual Sound Separation
 Research on “How semantic appearances could help with sound separation?”
 Formal Researches have limited capabilities to capture the motion cues, thus restricts
their applicability to solve harder sound source separation problems.
 There were no researches that explicit body movement cues using structured key point-based
structured representations for audio-visual learning. (But this does)
• Audio-Visual Learning
 Thanks to DNN, a series of works have been published in the past few years on Audio-Visual learning.
 By learning audio model and image model jointly or separately by knowledge distillation,
good Audio-Visual representations can be achieved.
 Examples
• Sounding Object Localization, Sound Generation for videos, 360/Stereo sound from videos, and so forth.
8
Related Work
• Audio and Body Dynamics
 Many Face – Audio researches have been conducted.
• Ex) Generate talking face from audio [48, 27], Separate mixed speech signals of multiple speakers [12],
lip reading from raw videos [10], etc.
 In contrast, the correlations between Body pose – Audio were less explored.
• Ex) Predicting Dynamics from music [44], Body rhythms from speech [20]
Lip reading from raw videos Generate talking face from audio
9
• Model Architecture – Overview
 Sound Separation is a central problem in the audio signal
 Deep learning based methods are taking over in the recent years.
 Goal: To associate
the body dynamics with
the audio signals for
sound source separation
 Adopt the commonly used “
mix-and-separate”
self-supervised training [56]
 Create synthetic training
data by mixing arbitrary
sound sources from
different video clips. Then
separate each sound from
mixtures conditioned on its
associated visual context.
Approach
10
• Model Architecture – Overview
 Sound Separation is a central problem in the audio signal
 Deep learning based methods are taking over in the recent years.
 2 Major Components:
Video Analysis Network,
A/V Separation Network
 During the training, we
randomly select N videos
with paired video frames
and audio signal {Vk, Sk}.
 Then mix their audios by
linear combinations of
the audio inputs to form
a synthetic mixture Smix.
Approach
11
• Model Architecture – Overview
 Sound Separation is a central problem in the audio signal
 Deep learning based methods are taking over in the recent years.
 Given a video clip Vk,
the Video Analysis Network
extracts global context
and body dynamic features
from videos.
 A/V Separation Network
separates its audio signal Sk
from Smix corresponding to
its visual context Vk.
 Neural network was trained
in a supervised fashion, but
it learned from unlabeled
video data. (self-supervised)
Smix
Vk
S1 S2 S3
Approach
12
• Model Architecture – Video Analysis Network
 ResNet-50 [23]: To extract global semantic features from video frames
• Extract the features after the last spatial average pooling layer from the first frame of each Vk
• Obtain 2048-Dimensional context feature vector for each Vk (Da=2048)
 AlphaPose toolbox [13]: To estimate the 2D locations of human body joints
• Obtain 18 key points for human body
 OpenPose hand API [8, 45]: To estimate the coordinates of hand key points
• Obtain 21 key points for each hand
• N=18+21=39: Total key points
Vk
Approach
13
• Model Architecture – Video Analysis Network
 Context-Aware Graph CNN: To fuse the semantic context of instruments and human body dynamics
• Undirected spatial-temporal graph G={V, E} on a human skeleton sequence.
• Each node vi in {V} corresponds to a key point of the human body.
• Edges reflect the natural connectivity of body key points.
• fv: The encoded pose feature
• N: The number of key points
• Dn: The feature dimension for each input node
• X: input features; Â: Row-normalized adjacency matrix
• Ws, Wt: Weight matrix of spatial graph convolution and 2D convolution each
Vk
Approach
14
• Model Architecture – Audio-Visual Separation Network
 Takes the spectrogram of mixture audio with visual representation produced by
the video analysis network as input, then predict a spectrogram mask and generate the
audio signal for the selected video
 Audio Network
• Adopt U-Net style architecture [41]
• Consists of 4 dilated convolution layers and 4 dilated up-convolution layers
• Use 3*3 spatial filters with stride2, dilation 1
• BatchNorm layer, Leaky ReLU
• Input: 2D time-frequency spectrogram of Smix
• Output: A same-size binary spectrogram mask
Approach
Smix
S1 S2 S3
Dilated Convolution (Atrous convolution)
15
• Model Architecture – Audio-Visual Separation Network
 Audio-Visual Fusion
• Adopt a self-attention [50] based cross-modal
early fusion module to capture the correlations
between body movement with the sound signals
Approach
Smix
S1 S2 S3
Audio-Visual Fusion Model
* MLP: Multi-Layer Perceptron (2 Fully-connected layers with ReLU)
16
• Model Architecture – Training and Inferences
 The learning objective is to estimate a binary mask Mk
• GT mask of k-th video is calculated weather the target sound is the dominant component in the input
mixed sound on magnitude spectrogram S
• Trained by minimizing the per-pixel sigmoid cross entropy loss between the estimated masks and the GT.
• Then predicted mask is thresholded and multiplied with the input complex STFT(Short Time Fourier Transf-
orm) coefficients to get a predicted sound spectrogram.
• Finally, apply an iSTFT(inverse STFT) with the same transformation parameters on the
predicted spectrogram to reconstruct the waveform of separated sound.
Approach
Smix
S1 S2 S3
17
• Hetero-musical Separation
 Evaluate the model performance of separating sounds from different kinds of instruments
of the MUSIC-21 dataset.
 SDR (Source-to Distortion Ratio) SIR (Source-to Interference Ratio)
 Consider 6 state-of-the-art systems to compare against.
• NMF [51], Deep Separation [9], MIML [17], Sound of Pixels [56], Co-separation [19], Sound of Motions [55]
 Train and evaluate model using mix-2 and mix-3 samples
 Because MUSIC dataset do not have GT labels for quantitative evaluation, researchers construct
a synthetic testing set by mixing solo videos.
 The result of model performances are reported on a validation set with 256 pairs of sound mixtures.
 Pytorch Framework
 GCN Model with 11 layers with residual connections
 Further conduct a subjective human study using real mixture videos from MUSIC-21 and URMP
datasets on Amazon Mechanical Turk (AMT).
Experiments
18
• Hetero-musical Separation
Experiments
19
• Homo-musical Separation
 Evaluate the model performance of separating sounds from same kinds of instruments
of the MUSIC-21 dataset.
 Select 5 kinds of musical instruments whose sounds are closely related to body dynamic:
• Trumpet, Flute, Piano, Violin, and Cello
 Pre-train the model on multiple instrument separation, then learn to separate the same instrument.
 Compare model with SoM
 The results are measured by both automatic SDR scores and human evaluations on AMT.
Experiments
20
Conclusions and Future Works
 Perform better on Audio-Visual source
separation of different instruments
 Achieve remarkable results on
separating sounds of same 5
instruments
 Pave a new research direction on
exploiting body dynamic motions with
structured key point-based video
representations to guide the sound
source separation
 Propose a novel audio-video fusion
module to associate human body
motion cues with sound signals.
 Try different modules or architecture
for enhancing the separation result
 Utilize some datasets or modules for
exploiting in our researches
Thank you
Any questions?

More Related Content

What's hot

Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi
 
Unit v transfer learning
Unit v transfer  learningUnit v transfer  learning
Unit v transfer learning
DineshBabuAnbalagan
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
shanullah3
 
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
sipij
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
Jinwon Lee
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
MediaEval2012
 
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
Best Jobs
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
Performance Analysis of Digital Watermarking Of Video in the Spatial Domain
Performance Analysis of Digital Watermarking Of Video in the Spatial DomainPerformance Analysis of Digital Watermarking Of Video in the Spatial Domain
Performance Analysis of Digital Watermarking Of Video in the Spatial Domain
paperpublications3
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Sangmin Woo
 
ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network
신동 강
 
Wavelet Based Image Watermarking
Wavelet Based Image WatermarkingWavelet Based Image Watermarking
Wavelet Based Image Watermarking
IJERA Editor
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
CSCJournals
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
Tianxiang Xiong
 
The Importance of Terminology and sRGB Uncertainty - Notes - 0.5
The Importance of Terminology and sRGB Uncertainty - Notes - 0.5The Importance of Terminology and sRGB Uncertainty - Notes - 0.5
The Importance of Terminology and sRGB Uncertainty - Notes - 0.5
Thomas Mansencal
 
Underwater sparse image classification using deep convolutional neural networks
Underwater sparse image classification using deep convolutional neural networksUnderwater sparse image classification using deep convolutional neural networks
Underwater sparse image classification using deep convolutional neural networks
Mohamed Elawady
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2
Jeong Choi
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
csandit
 

What's hot (20)

Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Unit v transfer learning
Unit v transfer  learningUnit v transfer  learning
Unit v transfer learning
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
 
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Performance Analysis of Digital Watermarking Of Video in the Spatial Domain
Performance Analysis of Digital Watermarking Of Video in the Spatial DomainPerformance Analysis of Digital Watermarking Of Video in the Spatial Domain
Performance Analysis of Digital Watermarking Of Video in the Spatial Domain
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network
 
Wavelet Based Image Watermarking
Wavelet Based Image WatermarkingWavelet Based Image Watermarking
Wavelet Based Image Watermarking
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
The Importance of Terminology and sRGB Uncertainty - Notes - 0.5
The Importance of Terminology and sRGB Uncertainty - Notes - 0.5The Importance of Terminology and sRGB Uncertainty - Notes - 0.5
The Importance of Terminology and sRGB Uncertainty - Notes - 0.5
 
Underwater sparse image classification using deep convolutional neural networks
Underwater sparse image classification using deep convolutional neural networksUnderwater sparse image classification using deep convolutional neural networks
Underwater sparse image classification using deep convolutional neural networks
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 

Similar to Music Gesture for Visual Sound Separation

Foley Music: Learning to Generate Music from Videos
Foley Music: Learning to Generate Music from VideosFoley Music: Learning to Generate Music from Videos
Foley Music: Learning to Generate Music from Videos
ivaderivader
 
Mini Project- Audio Enhancement
Mini Project-  Audio EnhancementMini Project-  Audio Enhancement
1801 1805
1801 18051801 1805
1801 1805
Editor IJARCET
 
1801 1805
1801 18051801 1805
1801 1805
Editor IJARCET
 
Investigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio ClassificationInvestigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio Classification
ijaia
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Saimunur Rahman
 
MLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicMLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to Music
Eric Battenberg
 
Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_share
MLconf
 
Automatic sound synthesis using the fly algorithm
Automatic sound synthesis using the fly algorithmAutomatic sound synthesis using the fly algorithm
Automatic sound synthesis using the fly algorithm
TELKOMNIKA JOURNAL
 
Visual and audio scene classification for detecting discrepancies (MAD'24 wo...
Visual and audio scene classification  for detecting discrepancies (MAD'24 wo...Visual and audio scene classification  for detecting discrepancies (MAD'24 wo...
Visual and audio scene classification for detecting discrepancies (MAD'24 wo...
kapost1
 
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Esra Açar
 
Emotion based music player
Emotion based music playerEmotion based music player
Emotion based music player
Nizam Muhammed
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Visual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic VideosVisual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic Videos
Universitat Politècnica de Catalunya
 
From Unsupervised to Semi-Supervised Event Detection
From Unsupervised to Semi-Supervised Event DetectionFrom Unsupervised to Semi-Supervised Event Detection
From Unsupervised to Semi-Supervised Event Detection
Vincent Chu
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
IRJET Journal
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET Journal
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
NAVER Engineering
 
A Computational Framework for Sound Segregation in Music Signals using Marsyas
A Computational Framework for Sound Segregation in Music Signals using MarsyasA Computational Framework for Sound Segregation in Music Signals using Marsyas
A Computational Framework for Sound Segregation in Music Signals using Marsyas
Luís Gustavo Martins
 
AC overview
AC overviewAC overview
AC overview
WarNik Chow
 

Similar to Music Gesture for Visual Sound Separation (20)

Foley Music: Learning to Generate Music from Videos
Foley Music: Learning to Generate Music from VideosFoley Music: Learning to Generate Music from Videos
Foley Music: Learning to Generate Music from Videos
 
Mini Project- Audio Enhancement
Mini Project-  Audio EnhancementMini Project-  Audio Enhancement
Mini Project- Audio Enhancement
 
1801 1805
1801 18051801 1805
1801 1805
 
1801 1805
1801 18051801 1805
1801 1805
 
Investigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio ClassificationInvestigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio Classification
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
 
MLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicMLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to Music
 
Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_share
 
Automatic sound synthesis using the fly algorithm
Automatic sound synthesis using the fly algorithmAutomatic sound synthesis using the fly algorithm
Automatic sound synthesis using the fly algorithm
 
Visual and audio scene classification for detecting discrepancies (MAD'24 wo...
Visual and audio scene classification  for detecting discrepancies (MAD'24 wo...Visual and audio scene classification  for detecting discrepancies (MAD'24 wo...
Visual and audio scene classification for detecting discrepancies (MAD'24 wo...
 
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emot...
 
Emotion based music player
Emotion based music playerEmotion based music player
Emotion based music player
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Visual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic VideosVisual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic Videos
 
From Unsupervised to Semi-Supervised Event Detection
From Unsupervised to Semi-Supervised Event DetectionFrom Unsupervised to Semi-Supervised Event Detection
From Unsupervised to Semi-Supervised Event Detection
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
A Computational Framework for Sound Segregation in Music Signals using Marsyas
A Computational Framework for Sound Segregation in Music Signals using MarsyasA Computational Framework for Sound Segregation in Music Signals using Marsyas
A Computational Framework for Sound Segregation in Music Signals using Marsyas
 
AC overview
AC overviewAC overview
AC overview
 

More from ivaderivader

Argument Mining
Argument MiningArgument Mining
Argument Mining
ivaderivader
 
Papers at CHI23
Papers at CHI23Papers at CHI23
Papers at CHI23
ivaderivader
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
ivaderivader
 
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
ivaderivader
 
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
ivaderivader
 
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
ivaderivader
 
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
ivaderivader
 
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial NetworksA Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
ivaderivader
 
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
ivaderivader
 
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for VisualizationPerception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
ivaderivader
 
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
ivaderivader
 
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-PoolingNeural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
ivaderivader
 
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
ivaderivader
 
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTubeBad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
ivaderivader
 
Invertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise RemovalInvertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise Removal
ivaderivader
 
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural NetworkTraffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
ivaderivader
 
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training  MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
ivaderivader
 
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI ComponentsScreen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
ivaderivader
 
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
ivaderivader
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translation
ivaderivader
 

More from ivaderivader (20)

Argument Mining
Argument MiningArgument Mining
Argument Mining
 
Papers at CHI23
Papers at CHI23Papers at CHI23
Papers at CHI23
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
 
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
 
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
 
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
 
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial NetworksA Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
 
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
 
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for VisualizationPerception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
 
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
 
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-PoolingNeural Approximate Dynamic Programming for On-Demand Ride-Pooling
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
 
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
 
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTubeBad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
 
Invertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise RemovalInvertible Denoising Network: A Light Solution for Real Noise Removal
Invertible Denoising Network: A Light Solution for Real Noise Removal
 
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural NetworkTraffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
 
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training  MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
 
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI ComponentsScreen2Vec: Semantic Embedding of GUI Screens and GUI Components
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
 
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translation
 

Recently uploaded

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 

Recently uploaded (20)

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 

Music Gesture for Visual Sound Separation

  • 1. Music Gesture for Visual Sound Separation Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba CVPR , 2020 October 6th, 2021 Presenter: Yonghyun Kim
  • 2. Contents • Overview • Related Work • Approach • Experiments • Conclusion and Future Works
  • 3. 2 Overview • “Music Gesture” A key point-based structured representation to explicitly model the body and finger movements of musicians when they perform music
  • 4. 3 Overview • Pipeline  Adopt a context-aware graph network to integrate visual semantic context with body dynamics.  Apply an Audio-Visual(A/V) fusion model to associate body movements with the corresponding audio signals.
  • 5. 4 Overview • Datasets (1 of 3)  URMP (University of Rochester Multi-Modal Music Performance) [31] Dataset • A high-quality multi-instrument video dataset recorded in studio and provides Ground-truth labels for each sound source . • For each piece, there are music score in MIDI format, Individual instrument audio recordings, and the videos. • 44 Pieces of music: 11 Duets, 12 Trios, 14 Quartets, and 7 Quintets • 14 Instruments which are common in orchestra: 4 Strings, 6 Woodwinds, 4 Brass
  • 6. 5 Overview • Datasets (2 of 3)  MUSIC-21 [55] Dataset • An untrimmed video dataset crawled by keyword query from YouTube. • It contains music performances belonging to 21 categories. • This dataset is relatively clean and collected for the purpose of training and evaluating visual sound source separation models. • Datasets (3 of 3)  AtinPiano [35] Dataset • A dataset where the piano video recordings are filmed in a way that camera is looking down on the keyboard and hands. • AtinPiano is a YouTube channel, including piano video recordings. AtinPiano
  • 7. 6 Related Work • Sound Separation  Sound Separation is a central problem in the audio signal processing area [33, 22].  Classic solutions were based on Non-negative Matrix Factorization (NMF) [51, 11, 47]. • But these are not very effective as they rely on low-level correlations in the signals.  Deep learning based methods are taking over in the recent years. • CNN models (to predict time-frequency masks for music source separation and enhancement) [46, 9]  A spectrogram classification model could not deal with the case with arbitrary number of speakers simultaneously. To solve this, following researches has been conducted: • Deep Clustering (Hershey et al.) [24] • Speaker-independent training framework (Yu et al.) [54] Sound Separation NMF: V is represented by the two smaller matrices W and H
  • 8. 7 Related Work • Visual Sound Separation  Research on “How semantic appearances could help with sound separation?”  Formal Researches have limited capabilities to capture the motion cues, thus restricts their applicability to solve harder sound source separation problems.  There were no researches that explicit body movement cues using structured key point-based structured representations for audio-visual learning. (But this does) • Audio-Visual Learning  Thanks to DNN, a series of works have been published in the past few years on Audio-Visual learning.  By learning audio model and image model jointly or separately by knowledge distillation, good Audio-Visual representations can be achieved.  Examples • Sounding Object Localization, Sound Generation for videos, 360/Stereo sound from videos, and so forth.
  • 9. 8 Related Work • Audio and Body Dynamics  Many Face – Audio researches have been conducted. • Ex) Generate talking face from audio [48, 27], Separate mixed speech signals of multiple speakers [12], lip reading from raw videos [10], etc.  In contrast, the correlations between Body pose – Audio were less explored. • Ex) Predicting Dynamics from music [44], Body rhythms from speech [20] Lip reading from raw videos Generate talking face from audio
  • 10. 9 • Model Architecture – Overview  Sound Separation is a central problem in the audio signal  Deep learning based methods are taking over in the recent years.  Goal: To associate the body dynamics with the audio signals for sound source separation  Adopt the commonly used “ mix-and-separate” self-supervised training [56]  Create synthetic training data by mixing arbitrary sound sources from different video clips. Then separate each sound from mixtures conditioned on its associated visual context. Approach
  • 11. 10 • Model Architecture – Overview  Sound Separation is a central problem in the audio signal  Deep learning based methods are taking over in the recent years.  2 Major Components: Video Analysis Network, A/V Separation Network  During the training, we randomly select N videos with paired video frames and audio signal {Vk, Sk}.  Then mix their audios by linear combinations of the audio inputs to form a synthetic mixture Smix. Approach
  • 12. 11 • Model Architecture – Overview  Sound Separation is a central problem in the audio signal  Deep learning based methods are taking over in the recent years.  Given a video clip Vk, the Video Analysis Network extracts global context and body dynamic features from videos.  A/V Separation Network separates its audio signal Sk from Smix corresponding to its visual context Vk.  Neural network was trained in a supervised fashion, but it learned from unlabeled video data. (self-supervised) Smix Vk S1 S2 S3 Approach
  • 13. 12 • Model Architecture – Video Analysis Network  ResNet-50 [23]: To extract global semantic features from video frames • Extract the features after the last spatial average pooling layer from the first frame of each Vk • Obtain 2048-Dimensional context feature vector for each Vk (Da=2048)  AlphaPose toolbox [13]: To estimate the 2D locations of human body joints • Obtain 18 key points for human body  OpenPose hand API [8, 45]: To estimate the coordinates of hand key points • Obtain 21 key points for each hand • N=18+21=39: Total key points Vk Approach
  • 14. 13 • Model Architecture – Video Analysis Network  Context-Aware Graph CNN: To fuse the semantic context of instruments and human body dynamics • Undirected spatial-temporal graph G={V, E} on a human skeleton sequence. • Each node vi in {V} corresponds to a key point of the human body. • Edges reflect the natural connectivity of body key points. • fv: The encoded pose feature • N: The number of key points • Dn: The feature dimension for each input node • X: input features; Â: Row-normalized adjacency matrix • Ws, Wt: Weight matrix of spatial graph convolution and 2D convolution each Vk Approach
  • 15. 14 • Model Architecture – Audio-Visual Separation Network  Takes the spectrogram of mixture audio with visual representation produced by the video analysis network as input, then predict a spectrogram mask and generate the audio signal for the selected video  Audio Network • Adopt U-Net style architecture [41] • Consists of 4 dilated convolution layers and 4 dilated up-convolution layers • Use 3*3 spatial filters with stride2, dilation 1 • BatchNorm layer, Leaky ReLU • Input: 2D time-frequency spectrogram of Smix • Output: A same-size binary spectrogram mask Approach Smix S1 S2 S3 Dilated Convolution (Atrous convolution)
  • 16. 15 • Model Architecture – Audio-Visual Separation Network  Audio-Visual Fusion • Adopt a self-attention [50] based cross-modal early fusion module to capture the correlations between body movement with the sound signals Approach Smix S1 S2 S3 Audio-Visual Fusion Model * MLP: Multi-Layer Perceptron (2 Fully-connected layers with ReLU)
  • 17. 16 • Model Architecture – Training and Inferences  The learning objective is to estimate a binary mask Mk • GT mask of k-th video is calculated weather the target sound is the dominant component in the input mixed sound on magnitude spectrogram S • Trained by minimizing the per-pixel sigmoid cross entropy loss between the estimated masks and the GT. • Then predicted mask is thresholded and multiplied with the input complex STFT(Short Time Fourier Transf- orm) coefficients to get a predicted sound spectrogram. • Finally, apply an iSTFT(inverse STFT) with the same transformation parameters on the predicted spectrogram to reconstruct the waveform of separated sound. Approach Smix S1 S2 S3
  • 18. 17 • Hetero-musical Separation  Evaluate the model performance of separating sounds from different kinds of instruments of the MUSIC-21 dataset.  SDR (Source-to Distortion Ratio) SIR (Source-to Interference Ratio)  Consider 6 state-of-the-art systems to compare against. • NMF [51], Deep Separation [9], MIML [17], Sound of Pixels [56], Co-separation [19], Sound of Motions [55]  Train and evaluate model using mix-2 and mix-3 samples  Because MUSIC dataset do not have GT labels for quantitative evaluation, researchers construct a synthetic testing set by mixing solo videos.  The result of model performances are reported on a validation set with 256 pairs of sound mixtures.  Pytorch Framework  GCN Model with 11 layers with residual connections  Further conduct a subjective human study using real mixture videos from MUSIC-21 and URMP datasets on Amazon Mechanical Turk (AMT). Experiments
  • 20. 19 • Homo-musical Separation  Evaluate the model performance of separating sounds from same kinds of instruments of the MUSIC-21 dataset.  Select 5 kinds of musical instruments whose sounds are closely related to body dynamic: • Trumpet, Flute, Piano, Violin, and Cello  Pre-train the model on multiple instrument separation, then learn to separate the same instrument.  Compare model with SoM  The results are measured by both automatic SDR scores and human evaluations on AMT. Experiments
  • 21. 20 Conclusions and Future Works  Perform better on Audio-Visual source separation of different instruments  Achieve remarkable results on separating sounds of same 5 instruments  Pave a new research direction on exploiting body dynamic motions with structured key point-based video representations to guide the sound source separation  Propose a novel audio-video fusion module to associate human body motion cues with sound signals.  Try different modules or architecture for enhancing the separation result  Utilize some datasets or modules for exploiting in our researches