SlideShare a Scribd company logo
Institute of
Computational
Perception
Emotion and Theme Recognition in Music with
Frequency-Aware Receptive Field Regularized CNNs
Khaled Koutini, Shreyan Chowdhury, Verena Haunschmid,
Hamid Eghbal-Zadeh, Gerhard Widmer
Agenda
■ Overview
■ Approaches
■ What worked
■ What didn’t work
■ Results
■ Conclusion
2
Takeaways
■ Insights
Make the network see less of the input to avoid overfitting.
Frequency context helps.
Ensembling helps.
■ Results
Validation PR-AUC Testing PR-AUC
0.1189 0.1546
- 0.1077 (baseline)
3
Overview
■ Emotion and theme recognition task – instance of auto-tagging.
■ Current state-of-the-art in many audio tasks, including audio-tagging uses
CNNs in a VGG-like architecture.
■ Input: Time-Frequency representation of audio (Spectrograms)
4
Overview
■ The success of CNNs started in computer vision tasks.
■ Later improvements on CNNs architectures made the networks deeper.
■ Deeper architectures such as ResNet and DenseNet don’t perform as well
in audio processing tasks.
5
Approach: Baseline Models
■ VGG-like
■ ResNet18, ResNet34, ResNet50
■ CRNN – Convolutional Recurrent Neural Network
Model Validation PR-AUC Testing PR-AUC
VGG-like - 0.1077
ResNet34 0.0924 0.1021
CRNN 0.0924 0.1172
6
Deeper = Better?
■ ResNet [1] and DenseNet [2] variants outperform earlier (and shallower)
VGG-based [3] variants by a significant margin (Vision tasks).
■ They address shortcomings of VGG such as the vanishing gradient.
Sources: Figures from https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803
7
ConvNet
ResNet
DenseNet
Deeper = Better? Can lead to overfitting
Hershey et al. [8] compared various vision CNNs on a large-scale dataset
of 70M audio clips from YouTube.
■ ResNet-50 can perform very well.
■ However, training such deep architectures on smaller datasets results in heavy
overfitting on the training samples.
8
The Receptive Field in CNNs
■ In fully-connected layers, each neuron is affected by the whole input. In
contrast, in convolutional layers each neuron has a strictly limited ‘field of
view’ (RF).
■ Input values outside of this RF cannot influence the neuron’s activation.
■ The maximum RF can be calculated:
9
sn, kn are stride and kernel size of layer n, respectively, and Sn, RFn are cumulative stride and
RF of a unit from layer n to the network input.
Mel-Frequency
DenseNet ResNet VGG
el-Frequency
DN1 RN1 RN2 RN3
The Effective Receptive Field in CNNs
■ A neuron may not actually use all the available receptive field.
■ The set of input pixels or units that effectively influence a neuron is called
its Effective Receptive Field (ERF) by Luo et al. [14]
10
Adapting the RF of Vision Architectures
■ Changing filter sizes to change the maximum receptive field.
■ We changed some filter sizes from 3 × 3 to 1 × 1.
■ Filter size is a hyperparameter.
11
Results with Receptive Field Adaptation
12
Frequency Aware Networks
■ Drawback of CNN for audio domain: lack of spatial ordering in
convolutional layers.
■ Solution: Add a channel that encodes frequency as a value in [-1, 1].
13
BN - Conv - ReLU
Shake-Shake Regularization
Image source: Gastaldi, X., 2017. Shake-shake regularization. arXiv preprint arXiv:1705.07485.
14
Ensembling
■ Stochastic Weight Averaging
– Maintain a paired network with running average of weights
■ Snapshot Averaging
– Average the predictions of the 5 snapshots of the model during
training
■ Multi-model averaging
– Average predictions from models with different initializations, RFs, etc.
15
Results
16
Conclusions
■ Receptive field regularization is useful to avoid overfitting.
■ Frequency-aware networks improve performance (spectrogram as input).
■ Ensembling improved results.
■ Future work:
– Temporal context?
– Perceptual features?
17
JOHANNES KEPLER
UNIVERSITY LINZ
Altenberger Str. 69
4040 Linz, Austria
www.jku.at
Thank you!
Questions?
shreyan.chowdhury@jku.at

More Related Content

What's hot

Digital signal processing part1
Digital signal processing part1Digital signal processing part1
Digital signal processing part1
Vaagdevi College of Engineering
 
09 spread spectrum
09 spread spectrum09 spread spectrum
09 spread spectrum
Kashif Amjad
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
eSAT Journals
 
Digital communication (DSSS)
Digital communication  (DSSS)Digital communication  (DSSS)
Digital communication (DSSS)
SSGMCE SHEGAON
 
Dnc and bluetooth
Dnc and bluetoothDnc and bluetooth
Dnc and bluetooth
ahmad almaleh
 
DSP_2018_FOEHU - Lec 0 - Course Outlines
DSP_2018_FOEHU - Lec 0 - Course OutlinesDSP_2018_FOEHU - Lec 0 - Course Outlines
DSP_2018_FOEHU - Lec 0 - Course Outlines
Amr E. Mohamed
 
Pulse code modulation
Pulse code modulationPulse code modulation
Pulse code modulation
Abhijay Sisodia
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
Deepesh Lekhak
 
Spread spectrum
Spread spectrumSpread spectrum
Spread spectrum
mpsrekha83
 
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications IDSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
Amr E. Mohamed
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic models
Takuya Yoshioka
 
Speech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognitionSpeech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognition
Takuya Yoshioka
 
Pulse code modulation
Pulse code modulationPulse code modulation
Pulse code modulation
Gec bharuch
 
Final ppt
Final pptFinal ppt
Final ppt
Sharu Sparky
 
Speech based password authentication system on FPGA
Speech based password authentication system on FPGASpeech based password authentication system on FPGA
Speech based password authentication system on FPGA
Rajesh Roshan
 
UMKC Dynamics of BER smaller
UMKC Dynamics of BER smallerUMKC Dynamics of BER smaller
UMKC Dynamics of BER smaller
AnirudhSrinivas Prativadi Bhayankaram
 
Pulse Code Modulation
Pulse Code ModulationPulse Code Modulation
Pulse Code Modulation
Ridwanul Hoque
 
Speech compression-using-gsm
Speech compression-using-gsmSpeech compression-using-gsm
Speech compression-using-gsm
Nghĩa Trần Trung
 
Field Acqusition Paramater Design
Field Acqusition Paramater DesignField Acqusition Paramater Design
Field Acqusition Paramater Design
Ali Osman Öncel
 
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
ijsrd.com
 

What's hot (20)

Digital signal processing part1
Digital signal processing part1Digital signal processing part1
Digital signal processing part1
 
09 spread spectrum
09 spread spectrum09 spread spectrum
09 spread spectrum
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
Digital communication (DSSS)
Digital communication  (DSSS)Digital communication  (DSSS)
Digital communication (DSSS)
 
Dnc and bluetooth
Dnc and bluetoothDnc and bluetooth
Dnc and bluetooth
 
DSP_2018_FOEHU - Lec 0 - Course Outlines
DSP_2018_FOEHU - Lec 0 - Course OutlinesDSP_2018_FOEHU - Lec 0 - Course Outlines
DSP_2018_FOEHU - Lec 0 - Course Outlines
 
Pulse code modulation
Pulse code modulationPulse code modulation
Pulse code modulation
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
 
Spread spectrum
Spread spectrumSpread spectrum
Spread spectrum
 
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications IDSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic models
 
Speech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognitionSpeech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognition
 
Pulse code modulation
Pulse code modulationPulse code modulation
Pulse code modulation
 
Final ppt
Final pptFinal ppt
Final ppt
 
Speech based password authentication system on FPGA
Speech based password authentication system on FPGASpeech based password authentication system on FPGA
Speech based password authentication system on FPGA
 
UMKC Dynamics of BER smaller
UMKC Dynamics of BER smallerUMKC Dynamics of BER smaller
UMKC Dynamics of BER smaller
 
Pulse Code Modulation
Pulse Code ModulationPulse Code Modulation
Pulse Code Modulation
 
Speech compression-using-gsm
Speech compression-using-gsmSpeech compression-using-gsm
Speech compression-using-gsm
 
Field Acqusition Paramater Design
Field Acqusition Paramater DesignField Acqusition Paramater Design
Field Acqusition Paramater Design
 
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
 

Similar to Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs

WaveNet
WaveNetWaveNet
WaveNet
AbeyHurtis
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
ssuser849b73
 
USRP Project Final Report
USRP Project Final ReportUSRP Project Final Report
USRP Project Final Report
Arjan Gupta
 
D1103023339
D1103023339D1103023339
D1103023339
IOSR Journals
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
Direct digital frequency synthesizer
Direct digital frequency synthesizerDirect digital frequency synthesizer
Direct digital frequency synthesizer
Venkat Malai Avichi
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
changedaeoh
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
VOICE CONTROLLED WHEELCHAIR using Amharic.pdf
VOICE CONTROLLED WHEELCHAIR using Amharic.pdfVOICE CONTROLLED WHEELCHAIR using Amharic.pdf
VOICE CONTROLLED WHEELCHAIR using Amharic.pdf
Mubarek kebede
 
Max_Poster_FINAL
Max_Poster_FINALMax_Poster_FINAL
Max_Poster_FINAL
Max Robertson
 
Portable Ground Receving Station
Portable Ground Receving StationPortable Ground Receving Station
Portable Ground Receving Station
Jude Ogbonna
 
Implementation of Wide Band Frequency Synthesizer Base on DFS (Digital Frequ...
Implementation of Wide Band Frequency Synthesizer Base on  DFS (Digital Frequ...Implementation of Wide Band Frequency Synthesizer Base on  DFS (Digital Frequ...
Implementation of Wide Band Frequency Synthesizer Base on DFS (Digital Frequ...
IJMER
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
ssuser849b73
 
NASA Fundamental of FSO.pdf
NASA Fundamental of FSO.pdfNASA Fundamental of FSO.pdf
NASA Fundamental of FSO.pdf
zoohir
 
444070326 antenna-lab-57-200
444070326 antenna-lab-57-200444070326 antenna-lab-57-200
444070326 antenna-lab-57-200
Huynh MVT
 
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
Naoki Shibata
 
Multiband Transceivers - [Chapter 4] Design Parameters of Wireless Radios
Multiband Transceivers - [Chapter 4] Design Parameters of Wireless RadiosMultiband Transceivers - [Chapter 4] Design Parameters of Wireless Radios
Multiband Transceivers - [Chapter 4] Design Parameters of Wireless Radios
Simen Li
 
Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...
Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...
Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...
IRJET Journal
 
Spread spectrum
Spread spectrumSpread spectrum
Spread spectrum
Patruni Chidananda Sastry
 
4 g lte_drive_test_parameters
4 g lte_drive_test_parameters4 g lte_drive_test_parameters
4 g lte_drive_test_parameters
Aryan Chaturanan
 

Similar to Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs (20)

WaveNet
WaveNetWaveNet
WaveNet
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
 
USRP Project Final Report
USRP Project Final ReportUSRP Project Final Report
USRP Project Final Report
 
D1103023339
D1103023339D1103023339
D1103023339
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Direct digital frequency synthesizer
Direct digital frequency synthesizerDirect digital frequency synthesizer
Direct digital frequency synthesizer
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
VOICE CONTROLLED WHEELCHAIR using Amharic.pdf
VOICE CONTROLLED WHEELCHAIR using Amharic.pdfVOICE CONTROLLED WHEELCHAIR using Amharic.pdf
VOICE CONTROLLED WHEELCHAIR using Amharic.pdf
 
Max_Poster_FINAL
Max_Poster_FINALMax_Poster_FINAL
Max_Poster_FINAL
 
Portable Ground Receving Station
Portable Ground Receving StationPortable Ground Receving Station
Portable Ground Receving Station
 
Implementation of Wide Band Frequency Synthesizer Base on DFS (Digital Frequ...
Implementation of Wide Band Frequency Synthesizer Base on  DFS (Digital Frequ...Implementation of Wide Band Frequency Synthesizer Base on  DFS (Digital Frequ...
Implementation of Wide Band Frequency Synthesizer Base on DFS (Digital Frequ...
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
NASA Fundamental of FSO.pdf
NASA Fundamental of FSO.pdfNASA Fundamental of FSO.pdf
NASA Fundamental of FSO.pdf
 
444070326 antenna-lab-57-200
444070326 antenna-lab-57-200444070326 antenna-lab-57-200
444070326 antenna-lab-57-200
 
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
 
Multiband Transceivers - [Chapter 4] Design Parameters of Wireless Radios
Multiband Transceivers - [Chapter 4] Design Parameters of Wireless RadiosMultiband Transceivers - [Chapter 4] Design Parameters of Wireless Radios
Multiband Transceivers - [Chapter 4] Design Parameters of Wireless Radios
 
Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...
Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...
Design Band Pass FIR Digital Filter for Cut off Frequency Calculation Using A...
 
Spread spectrum
Spread spectrumSpread spectrum
Spread spectrum
 
4 g lte_drive_test_parameters
4 g lte_drive_test_parameters4 g lte_drive_test_parameters
4 g lte_drive_test_parameters
 

More from multimediaeval

Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
multimediaeval
 
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
multimediaeval
 
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
Sports Video Classification: Classification of Strokes in Table Tennis for Me...Sports Video Classification: Classification of Strokes in Table Tennis for Me...
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
multimediaeval
 
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
multimediaeval
 
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
Essex-NLIP at MediaEval Predicting Media Memorability 2020 TaskEssex-NLIP at MediaEval Predicting Media Memorability 2020 Task
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
multimediaeval
 
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
multimediaeval
 
Fooling an Automatic Image Quality Estimator
Fooling an Automatic Image Quality EstimatorFooling an Automatic Image Quality Estimator
Fooling an Automatic Image Quality Estimator
multimediaeval
 
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
multimediaeval
 
Pixel Privacy: Quality Camouflage for Social Images
Pixel Privacy: Quality Camouflage for Social ImagesPixel Privacy: Quality Camouflage for Social Images
Pixel Privacy: Quality Camouflage for Social Images
multimediaeval
 
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-MatchingHCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
multimediaeval
 
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
multimediaeval
 
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
multimediaeval
 
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
multimediaeval
 
Deep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp SegmentationDeep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp Segmentation
multimediaeval
 
A Temporal-Spatial Attention Model for Medical Image Detection
A Temporal-Spatial Attention Model for Medical Image DetectionA Temporal-Spatial Attention Model for Medical Image Detection
A Temporal-Spatial Attention Model for Medical Image Detection
multimediaeval
 
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
multimediaeval
 
Fine-tuning for Polyp Segmentation with Attention
Fine-tuning for Polyp Segmentation with AttentionFine-tuning for Polyp Segmentation with Attention
Fine-tuning for Polyp Segmentation with Attention
multimediaeval
 
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
multimediaeval
 
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
multimediaeval
 
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
 Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ... Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
multimediaeval
 

More from multimediaeval (20)

Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...
 
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...
 
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
Sports Video Classification: Classification of Strokes in Table Tennis for Me...Sports Video Classification: Classification of Strokes in Table Tennis for Me...
Sports Video Classification: Classification of Strokes in Table Tennis for Me...
 
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...
 
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
Essex-NLIP at MediaEval Predicting Media Memorability 2020 TaskEssex-NLIP at MediaEval Predicting Media Memorability 2020 Task
Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task
 
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...
 
Fooling an Automatic Image Quality Estimator
Fooling an Automatic Image Quality EstimatorFooling an Automatic Image Quality Estimator
Fooling an Automatic Image Quality Estimator
 
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...
 
Pixel Privacy: Quality Camouflage for Social Images
Pixel Privacy: Quality Camouflage for Social ImagesPixel Privacy: Quality Camouflage for Social Images
Pixel Privacy: Quality Camouflage for Social Images
 
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-MatchingHCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching
 
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...
 
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...
 
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...
 
Deep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp SegmentationDeep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp Segmentation
 
A Temporal-Spatial Attention Model for Medical Image Detection
A Temporal-Spatial Attention Model for Medical Image DetectionA Temporal-Spatial Attention Model for Medical Image Detection
A Temporal-Spatial Attention Model for Medical Image Detection
 
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...
 
Fine-tuning for Polyp Segmentation with Attention
Fine-tuning for Polyp Segmentation with AttentionFine-tuning for Polyp Segmentation with Attention
Fine-tuning for Polyp Segmentation with Attention
 
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...
 
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...
 
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
 Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ... Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
 

Recently uploaded

JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 

Recently uploaded (20)

JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 

Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs

  • 1. Institute of Computational Perception Emotion and Theme Recognition in Music with Frequency-Aware Receptive Field Regularized CNNs Khaled Koutini, Shreyan Chowdhury, Verena Haunschmid, Hamid Eghbal-Zadeh, Gerhard Widmer
  • 2. Agenda ■ Overview ■ Approaches ■ What worked ■ What didn’t work ■ Results ■ Conclusion 2
  • 3. Takeaways ■ Insights Make the network see less of the input to avoid overfitting. Frequency context helps. Ensembling helps. ■ Results Validation PR-AUC Testing PR-AUC 0.1189 0.1546 - 0.1077 (baseline) 3
  • 4. Overview ■ Emotion and theme recognition task – instance of auto-tagging. ■ Current state-of-the-art in many audio tasks, including audio-tagging uses CNNs in a VGG-like architecture. ■ Input: Time-Frequency representation of audio (Spectrograms) 4
  • 5. Overview ■ The success of CNNs started in computer vision tasks. ■ Later improvements on CNNs architectures made the networks deeper. ■ Deeper architectures such as ResNet and DenseNet don’t perform as well in audio processing tasks. 5
  • 6. Approach: Baseline Models ■ VGG-like ■ ResNet18, ResNet34, ResNet50 ■ CRNN – Convolutional Recurrent Neural Network Model Validation PR-AUC Testing PR-AUC VGG-like - 0.1077 ResNet34 0.0924 0.1021 CRNN 0.0924 0.1172 6
  • 7. Deeper = Better? ■ ResNet [1] and DenseNet [2] variants outperform earlier (and shallower) VGG-based [3] variants by a significant margin (Vision tasks). ■ They address shortcomings of VGG such as the vanishing gradient. Sources: Figures from https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803 7 ConvNet ResNet DenseNet
  • 8. Deeper = Better? Can lead to overfitting Hershey et al. [8] compared various vision CNNs on a large-scale dataset of 70M audio clips from YouTube. ■ ResNet-50 can perform very well. ■ However, training such deep architectures on smaller datasets results in heavy overfitting on the training samples. 8
  • 9. The Receptive Field in CNNs ■ In fully-connected layers, each neuron is affected by the whole input. In contrast, in convolutional layers each neuron has a strictly limited ‘field of view’ (RF). ■ Input values outside of this RF cannot influence the neuron’s activation. ■ The maximum RF can be calculated: 9 sn, kn are stride and kernel size of layer n, respectively, and Sn, RFn are cumulative stride and RF of a unit from layer n to the network input.
  • 10. Mel-Frequency DenseNet ResNet VGG el-Frequency DN1 RN1 RN2 RN3 The Effective Receptive Field in CNNs ■ A neuron may not actually use all the available receptive field. ■ The set of input pixels or units that effectively influence a neuron is called its Effective Receptive Field (ERF) by Luo et al. [14] 10
  • 11. Adapting the RF of Vision Architectures ■ Changing filter sizes to change the maximum receptive field. ■ We changed some filter sizes from 3 × 3 to 1 × 1. ■ Filter size is a hyperparameter. 11
  • 12. Results with Receptive Field Adaptation 12
  • 13. Frequency Aware Networks ■ Drawback of CNN for audio domain: lack of spatial ordering in convolutional layers. ■ Solution: Add a channel that encodes frequency as a value in [-1, 1]. 13 BN - Conv - ReLU
  • 14. Shake-Shake Regularization Image source: Gastaldi, X., 2017. Shake-shake regularization. arXiv preprint arXiv:1705.07485. 14
  • 15. Ensembling ■ Stochastic Weight Averaging – Maintain a paired network with running average of weights ■ Snapshot Averaging – Average the predictions of the 5 snapshots of the model during training ■ Multi-model averaging – Average predictions from models with different initializations, RFs, etc. 15
  • 17. Conclusions ■ Receptive field regularization is useful to avoid overfitting. ■ Frequency-aware networks improve performance (spectrogram as input). ■ Ensembling improved results. ■ Future work: – Temporal context? – Perceptual features? 17
  • 18. JOHANNES KEPLER UNIVERSITY LINZ Altenberger Str. 69 4040 Linz, Austria www.jku.at Thank you! Questions? shreyan.chowdhury@jku.at