SlideShare a Scribd company logo
1 of 47
Download to read offline
A Hybrid Approach with
Multi-channel I-Vectors and
Convolutional Neural Networks for
Acoustic Scene Classification
Hamid Eghbal-zadeh, Bernhard Lehner, Matthias Dorfer and Gerhard Widmer
A closer look into our winning submission
at IEEE DCASE-2016 challenge1
for
Acoustic Scene Classification
1) www.cs.tut.fi/sgn/arg/dcase2016/
Introduction
 
What is Acoustic Scene Classification
(ASC)?
photo credit: www.cs.tut.fi/sgn/arg/dcase2016/
1/23
 
Challenges in ASC: Overlapping events
Park City Center
photo credit: www.cs.tut.fi/sgn/arg/dcase2016/
2/23
 
Challenges in ASC: Session Variability
Vienna! Athens!
photo credit:www.google.com
3/23
 
Challenges in ASC: Session Variability
Vienna! Athens!
υγειά μαςProst
photo credit:www.google.com
3/23
 
Challenges in ASC: Session Variability
Vienna! Athens!
photo credit:www.google.com
3/23
 
ASC is not that easy ...
photo credit: www.memegen.com
Methods for ASC
 
Deeeeeep learning
⬛ Pros:
⬜ A powerful method for supervised learning
⬜ Convolutional Neural Networks (CNNs)
⬜ Spectrograms as images
⬜ Feature Learning
⬜ Successfully applied on images, speech and music
⬛ Cons:
⬜ Confusion of classes when dealing with noisy scenes
and blurry spectrograms
⬜ Lack of generalization and overfitting if the training data does
not contain various sessions
Piczak, K. J., et al "Environmental sound classification with convolutional neural networks.", 2015.
photo credit: Yann Lecun's slides at NIPS2016 keynote 4/23
 
Factor Analysis
⬛ Pros:
⬜ Session Variability reduction
⬜ Use of a Universal Background Model (UBM)
⬜ Better generalization due to the unsupervised methodology
⬜ Successfully applied on sequential data such as Speech and
Music
⬛ Cons:
⬜ Relying on engineered features
⬜ Limits to use specialized features for Audio Scene Analysis
because of the independence and Gaussian assumptions in FA
[1] Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011.
[2] Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013).
5/23
 
A Hybrid system to overcome the
complexities ...
photo credit: www.imgflip.com
 
A hybrid approach to ASC
⬛ We combine a CNN with an I-Vector based ASC system:
⬜ A CNN is trained on spectrograms
⬜ I-Vector features (based on FA) are extracted from MFCCs
⬛ Late fusion
⬜ A score fusion technique is used to combine the two methods
⬛ Model averaging for better generalization
⬜ Multiple models are trained and the decision from different
models are averaged
Brummer, N., et al. "On calibration of language recognition scores." , 2006.
6/23
CNNs for ASC
 
A hybrid approach to ASC
⬛ A VGG-style fully convolutional architecture
⬜ A well-known model for object recognition
Conv layer Pooling layer Average pooling layer
Slide...
Sumtheprobabilities
30secs
Feature Learning part Feed-Forward part
Simonyan, K., et al. "Very deep convolutional networks for large-scale image recognition.", 2014.
7/23
I-Vectors for ASC
 
I-Vector Features
GMM Train
I-Vector
model
Sparse
statistics
Adapted GMM params = GMM params – unknown matrix . hidden factor
Learned via EM
Training
MFCCs
Many components high dimension
[1] Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011.
[2] Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013).
[3] Kenny, Patrick, et al. "Uncertainty Modeling Without Subspace Methods For Text-Dependent
Speaker Recognition.", 2016.
low dimension
I-vector
Point estimate
low dimension
EM
8/23
 
I-Vector Features
GMM
MFCCs
Sparse
statistics
I-vector
Adapted GMM params = GMM params – unknown matrix . hidden factor
Sparse
statistics
TrainingExtraction
Learned via EM
Many components
high dimension
high dimension
Train
I-Vector
model
I-vector
Point estimate
low dimension
EM
[1] Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011.
[2] Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013).
[3] Kenny, Patrick, et al. "Uncertainty Modeling Without Subspace Methods For Text-Dependent
Speaker Recognition.", 2016.
low dimension
9/23
 
I-Vector Features
⬛ Requires a Universal Background Model (UBM):
⬜ A GMM with 256 Gaussian components
⬜ MFCCs features
⬛ MAP estimation of a hidden factor:
⬜ m: mean from the GMM
⬜ M: adapted GMM mean to MFCCs of an audio segment
⬜ Solving the following factor analysis equation:
M = m + T.y
⬜ y is the hidden factor and its MAP estimation is the I-vector
Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011.
10/23
 
Improving I-Vector Features for ASC
GMM I-vectorleft
right
average
difference
GMM I-vector
GMM I-vector
GMM I-vector
⬛ Tuning MFCC parameters:
⬛ I-vectors from MFCCs of different channels
Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013).
11/23
 
Post-processing and Scoring I-Vector
Features
⬛ Length-Normalization
⬛ Within-class Covariance Normalization (WCCN)
⬛ Linear Discriminant Analysis (LDA)
⬛ Cosine Similarity:
⬜ Average I-vectors of each class in training set (Model I-vector)
⬜ Compute cosine similarity from each test I-vector to model I-
vector of each class
⬜ Pick the class with maximum similarity
[1] Garcia-Romero, D., et al. "Analysis of i-vector Length Normalization in Speaker Recognition.", 2011.
[2] Hatch, A. O.,et al. "Within-class covariance normalization for SVM-based speaker recognition.", 2006.
[3] Dehak, Najim, et al. "Cosine similarity scoring without score normalization techniques." 2010.
12/23
Hybrid system
 
Why hybrid?
photo credit: www.mechanicalengineeringblog.com
www.google.com
 
Linear Logistic Regression for Score
Fusion
⬛ Combining cosine scores of I-vectors with CNN probabilities
⬛ A Linear Logistic Regression (LLR) model is trained on validation
set
⬛ A coefficient is learned for each model and a bias term for each
class.
⬛ Final score is computed by applying the learned coefficients and
the bias terms on the test set scores.
13/23
 
Model averaging
⬛ 4 separate models trained from each fold
⬛ Average the final score from models in each fold
14/23
Dataset
 
TUT Acoustic Scenes 2016 dataset
⬛ 30-seconds audio segments from 15 acoustic scenes:
⬜ Bus - traveling by bus in the city (vehicle)
⬜ Cafe / Restaurant - small cafe/restaurant (indoor)
⬜ Car - driving or traveling as a passenger, in the city (vehicle)
⬜ City center (outdoor)
⬜ Forest path (outdoor)
⬜ Grocery store - medium size grocery store (indoor)
⬜ Home (indoor)
⬜ Lakeside beach (outdoor)
⬜ Library (indoor)
⬜ Metro station (indoor)
⬜ Office - multiple persons, typical work day (indoor)
⬜ Residential area (outdoor)
⬜ Train (traveling, vehicle)
⬜ Tram (traveling, vehicle)
⬜ Urban park (outdoor)
⬛ Development set:
⬜ Each acoustic scene has 78 segments totaling 39 minutes of audio.
⬜ 4 folds cross validation
⬛ Evaluation set:
⬜ 26 segments totaling 13 minutes of audio.
Mesaros, A.,et al "TUT database for acoustic scene classification and sound event detection." , 2016.
15/23
Results
 
CNN vs I-Vectors: Without Calibration
CNN
Accuracy:83.33%
I-Vectors
Accuracy:86.41%
Class-wise accuracy Confusion matrix Scores
16/23
 
CNN vs I-Vectors: With Calibration
CNN
Accuracy:84.10%
I-Vectors
Accuracy:88.70%
Class-wise accuracy Confusion matrix Scores
17/23
 
Hybrid
Hybrid
Accuracy:89.74%
Hybrid
Accuracy:89.74%
Class-wise accuracy Confusion matrix Scores
18/23
Analysis: Score Calibration
 
CNN vs I-Vectors: Without Calibration
CNN
Accuracy:83.33%
I-Vectors
Accuracy:86.41%
Class-wise accuracy Confusion matrix Scores
19/23
 
CNN vs I-Vectors: With Calibration
CNN
Accuracy:84.10%
I-Vectors
Accuracy:88.70%
Class-wise accuracy Confusion matrix Scores
19/23
Analysis: Score Fusion
 
CNN vs I-Vectors: Without Calibration
CNN
Accuracy:83.33%
I-Vectors
Accuracy:86.41%
metro train
Class-wise accuracy Confusion matrix Scores
20/23
 
Hybrid
Hybrid
Accuracy:89.7%
Hybrid
Accuracy:89.7%
Class-wise accuracy Confusion matrix Scores
20/23
 
DCASE 2016 challenge: Results
Hybrid
Calibrated
Multi-channel
I-vectors
Multi-channel
I-vectors
Oursingle
CNN
MFCC+GMM
baseline
21/23
 
DCASE 2016 challenge: Observations
22/23
 
Challenges in ASC: Session Variability
Kippis Kippis
photo credit:www.google.com
22/23
Conclusion
 
Conclusion
⬛ Performance of I-Vectors can be noticeably improved by tuning
MFCCs
⬛ Different channels contain different information from a scene that
is beneficial to the I-vector system
⬛ I-Vectors and CNNs are complementary
⬛ Score Calibration improved both I-Vectors and CNN
⬛ A late-fusion can efficiently combine the two system’s predictions
⬛ This method is easily adaptable to new conditions
23/23
 
JOHANNES KEPLER
UNIVERSITY LINZ
Altenberger Str. 69
4040 Linz, Austria
www.jku.at
Thank you!
For more information about this presentation, don’t hesitate to
contact me via:
hamid.eghbal-zadeh@jku.at
Slides available at:
https://www.slideshare.net/heghbalz
Code soon will be available at:
https://github.com/cpjku

More Related Content

What's hot

Call center management system ppt
Call center management system pptCall center management system ppt
Call center management system ppt
Sameer Bhatt
 
Mindfulness and Lean - Webinar by Dave Kippen
Mindfulness and Lean - Webinar by Dave KippenMindfulness and Lean - Webinar by Dave Kippen
Mindfulness and Lean - Webinar by Dave Kippen
KaiNexus
 

What's hot (16)

ieee-code-of-ethics.pdf
ieee-code-of-ethics.pdfieee-code-of-ethics.pdf
ieee-code-of-ethics.pdf
 
Transformations ppt
Transformations pptTransformations ppt
Transformations ppt
 
Key ideas, terms and concepts in SEM
Key ideas, terms and concepts in SEMKey ideas, terms and concepts in SEM
Key ideas, terms and concepts in SEM
 
Multiplying Polynomials
Multiplying PolynomialsMultiplying Polynomials
Multiplying Polynomials
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
Free gmat-flashcards
Free gmat-flashcardsFree gmat-flashcards
Free gmat-flashcards
 
Types of contract - transactional analysis - Manu Melwin Joy
Types of contract -  transactional analysis - Manu Melwin JoyTypes of contract -  transactional analysis - Manu Melwin Joy
Types of contract - transactional analysis - Manu Melwin Joy
 
Call center management system ppt
Call center management system pptCall center management system ppt
Call center management system ppt
 
Introduction to Structural Equation Modeling
Introduction to Structural Equation ModelingIntroduction to Structural Equation Modeling
Introduction to Structural Equation Modeling
 
Mindfulness and Lean - Webinar by Dave Kippen
Mindfulness and Lean - Webinar by Dave KippenMindfulness and Lean - Webinar by Dave Kippen
Mindfulness and Lean - Webinar by Dave Kippen
 
Another Free Template Download
Another Free Template DownloadAnother Free Template Download
Another Free Template Download
 
Structural Equation Modelling (SEM) Part 3
Structural Equation Modelling (SEM) Part 3Structural Equation Modelling (SEM) Part 3
Structural Equation Modelling (SEM) Part 3
 
Contracts for change - transactional analysis - Manu Melwin Joy
Contracts for change - transactional analysis - Manu Melwin JoyContracts for change - transactional analysis - Manu Melwin Joy
Contracts for change - transactional analysis - Manu Melwin Joy
 
Concept of set.
Concept of set.Concept of set.
Concept of set.
 
Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)
 
Introduction to Genetic Algorithm
Introduction to Genetic Algorithm Introduction to Genetic Algorithm
Introduction to Genetic Algorithm
 

Similar to Slides of my presentation at EUSIPCO 2017

Iisrt subha guru
Iisrt subha guruIisrt subha guru
Iisrt subha guru
IISRT
 
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docx
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docxIEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docx
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docx
sheronlewthwaite
 
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
ijtsrd
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
CSCJournals
 

Similar to Slides of my presentation at EUSIPCO 2017 (20)

Iisrt subha guru
Iisrt subha guruIisrt subha guru
Iisrt subha guru
 
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docx
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docxIEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docx
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 3, M.docx
 
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
Classification of Vehicles Based on Audio Signals using Quadratic Discriminan...
 
Classification of vehicles based on audio signals
Classification of vehicles based on audio signalsClassification of vehicles based on audio signals
Classification of vehicles based on audio signals
 
IRJET- Musical Instrument Recognition using CNN and SVM
IRJET-  	  Musical Instrument Recognition using CNN and SVMIRJET-  	  Musical Instrument Recognition using CNN and SVM
IRJET- Musical Instrument Recognition using CNN and SVM
 
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
 
Automatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVMAutomatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVM
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
Resume_Naveena1
Resume_Naveena1Resume_Naveena1
Resume_Naveena1
 
Traffic Signboard Classification with Voice alert to the driver.pptx
Traffic Signboard Classification with Voice alert to the driver.pptxTraffic Signboard Classification with Voice alert to the driver.pptx
Traffic Signboard Classification with Voice alert to the driver.pptx
 
Final year project proposal
Final year project proposalFinal year project proposal
Final year project proposal
 
IRJET- Application of MCNN in Object Detection
IRJET-  	  Application of MCNN in Object DetectionIRJET-  	  Application of MCNN in Object Detection
IRJET- Application of MCNN in Object Detection
 
Face recognition technology
Face recognition technologyFace recognition technology
Face recognition technology
 
TRAFFIC MANAGEMENT THROUGH SATELLITE IMAGING -- Part 1
TRAFFIC MANAGEMENT THROUGH SATELLITE IMAGING -- Part 1TRAFFIC MANAGEMENT THROUGH SATELLITE IMAGING -- Part 1
TRAFFIC MANAGEMENT THROUGH SATELLITE IMAGING -- Part 1
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
project final ppt.pptx
project final ppt.pptxproject final ppt.pptx
project final ppt.pptx
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
Deep hypersphere embedding for real-time face recognition
Deep hypersphere embedding for real-time face recognitionDeep hypersphere embedding for real-time face recognition
Deep hypersphere embedding for real-time face recognition
 
IRJET- Texture Analysis and Fracture Identification of Bones X-Ray Images...
IRJET-  	  Texture Analysis and Fracture Identification of Bones X-Ray Images...IRJET-  	  Texture Analysis and Fracture Identification of Bones X-Ray Images...
IRJET- Texture Analysis and Fracture Identification of Bones X-Ray Images...
 
Traffic Sign Recognition using CNNs
Traffic Sign Recognition using CNNsTraffic Sign Recognition using CNNs
Traffic Sign Recognition using CNNs
 

Recently uploaded

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 

Recently uploaded (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 

Slides of my presentation at EUSIPCO 2017

  • 1.
  • 2.
  • 3. A Hybrid Approach with Multi-channel I-Vectors and Convolutional Neural Networks for Acoustic Scene Classification Hamid Eghbal-zadeh, Bernhard Lehner, Matthias Dorfer and Gerhard Widmer
  • 4. A closer look into our winning submission at IEEE DCASE-2016 challenge1 for Acoustic Scene Classification 1) www.cs.tut.fi/sgn/arg/dcase2016/
  • 6.   What is Acoustic Scene Classification (ASC)? photo credit: www.cs.tut.fi/sgn/arg/dcase2016/ 1/23
  • 7.   Challenges in ASC: Overlapping events Park City Center photo credit: www.cs.tut.fi/sgn/arg/dcase2016/ 2/23
  • 8.   Challenges in ASC: Session Variability Vienna! Athens! photo credit:www.google.com 3/23
  • 9.   Challenges in ASC: Session Variability Vienna! Athens! υγειά μαςProst photo credit:www.google.com 3/23
  • 10.   Challenges in ASC: Session Variability Vienna! Athens! photo credit:www.google.com 3/23
  • 11.   ASC is not that easy ... photo credit: www.memegen.com
  • 13.   Deeeeeep learning ⬛ Pros: ⬜ A powerful method for supervised learning ⬜ Convolutional Neural Networks (CNNs) ⬜ Spectrograms as images ⬜ Feature Learning ⬜ Successfully applied on images, speech and music ⬛ Cons: ⬜ Confusion of classes when dealing with noisy scenes and blurry spectrograms ⬜ Lack of generalization and overfitting if the training data does not contain various sessions Piczak, K. J., et al "Environmental sound classification with convolutional neural networks.", 2015. photo credit: Yann Lecun's slides at NIPS2016 keynote 4/23
  • 14.   Factor Analysis ⬛ Pros: ⬜ Session Variability reduction ⬜ Use of a Universal Background Model (UBM) ⬜ Better generalization due to the unsupervised methodology ⬜ Successfully applied on sequential data such as Speech and Music ⬛ Cons: ⬜ Relying on engineered features ⬜ Limits to use specialized features for Audio Scene Analysis because of the independence and Gaussian assumptions in FA [1] Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011. [2] Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013). 5/23
  • 15.   A Hybrid system to overcome the complexities ... photo credit: www.imgflip.com
  • 16.   A hybrid approach to ASC ⬛ We combine a CNN with an I-Vector based ASC system: ⬜ A CNN is trained on spectrograms ⬜ I-Vector features (based on FA) are extracted from MFCCs ⬛ Late fusion ⬜ A score fusion technique is used to combine the two methods ⬛ Model averaging for better generalization ⬜ Multiple models are trained and the decision from different models are averaged Brummer, N., et al. "On calibration of language recognition scores." , 2006. 6/23
  • 18.   A hybrid approach to ASC ⬛ A VGG-style fully convolutional architecture ⬜ A well-known model for object recognition Conv layer Pooling layer Average pooling layer Slide... Sumtheprobabilities 30secs Feature Learning part Feed-Forward part Simonyan, K., et al. "Very deep convolutional networks for large-scale image recognition.", 2014. 7/23
  • 20.   I-Vector Features GMM Train I-Vector model Sparse statistics Adapted GMM params = GMM params – unknown matrix . hidden factor Learned via EM Training MFCCs Many components high dimension [1] Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011. [2] Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013). [3] Kenny, Patrick, et al. "Uncertainty Modeling Without Subspace Methods For Text-Dependent Speaker Recognition.", 2016. low dimension I-vector Point estimate low dimension EM 8/23
  • 21.   I-Vector Features GMM MFCCs Sparse statistics I-vector Adapted GMM params = GMM params – unknown matrix . hidden factor Sparse statistics TrainingExtraction Learned via EM Many components high dimension high dimension Train I-Vector model I-vector Point estimate low dimension EM [1] Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011. [2] Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013). [3] Kenny, Patrick, et al. "Uncertainty Modeling Without Subspace Methods For Text-Dependent Speaker Recognition.", 2016. low dimension 9/23
  • 22.   I-Vector Features ⬛ Requires a Universal Background Model (UBM): ⬜ A GMM with 256 Gaussian components ⬜ MFCCs features ⬛ MAP estimation of a hidden factor: ⬜ m: mean from the GMM ⬜ M: adapted GMM mean to MFCCs of an audio segment ⬜ Solving the following factor analysis equation: M = m + T.y ⬜ y is the hidden factor and its MAP estimation is the I-vector Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011. 10/23
  • 23.   Improving I-Vector Features for ASC GMM I-vectorleft right average difference GMM I-vector GMM I-vector GMM I-vector ⬛ Tuning MFCC parameters: ⬛ I-vectors from MFCCs of different channels Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013). 11/23
  • 24.   Post-processing and Scoring I-Vector Features ⬛ Length-Normalization ⬛ Within-class Covariance Normalization (WCCN) ⬛ Linear Discriminant Analysis (LDA) ⬛ Cosine Similarity: ⬜ Average I-vectors of each class in training set (Model I-vector) ⬜ Compute cosine similarity from each test I-vector to model I- vector of each class ⬜ Pick the class with maximum similarity [1] Garcia-Romero, D., et al. "Analysis of i-vector Length Normalization in Speaker Recognition.", 2011. [2] Hatch, A. O.,et al. "Within-class covariance normalization for SVM-based speaker recognition.", 2006. [3] Dehak, Najim, et al. "Cosine similarity scoring without score normalization techniques." 2010. 12/23
  • 26.   Why hybrid? photo credit: www.mechanicalengineeringblog.com www.google.com
  • 27.   Linear Logistic Regression for Score Fusion ⬛ Combining cosine scores of I-vectors with CNN probabilities ⬛ A Linear Logistic Regression (LLR) model is trained on validation set ⬛ A coefficient is learned for each model and a bias term for each class. ⬛ Final score is computed by applying the learned coefficients and the bias terms on the test set scores. 13/23
  • 28.   Model averaging ⬛ 4 separate models trained from each fold ⬛ Average the final score from models in each fold 14/23
  • 30.   TUT Acoustic Scenes 2016 dataset ⬛ 30-seconds audio segments from 15 acoustic scenes: ⬜ Bus - traveling by bus in the city (vehicle) ⬜ Cafe / Restaurant - small cafe/restaurant (indoor) ⬜ Car - driving or traveling as a passenger, in the city (vehicle) ⬜ City center (outdoor) ⬜ Forest path (outdoor) ⬜ Grocery store - medium size grocery store (indoor) ⬜ Home (indoor) ⬜ Lakeside beach (outdoor) ⬜ Library (indoor) ⬜ Metro station (indoor) ⬜ Office - multiple persons, typical work day (indoor) ⬜ Residential area (outdoor) ⬜ Train (traveling, vehicle) ⬜ Tram (traveling, vehicle) ⬜ Urban park (outdoor) ⬛ Development set: ⬜ Each acoustic scene has 78 segments totaling 39 minutes of audio. ⬜ 4 folds cross validation ⬛ Evaluation set: ⬜ 26 segments totaling 13 minutes of audio. Mesaros, A.,et al "TUT database for acoustic scene classification and sound event detection." , 2016. 15/23
  • 32.   CNN vs I-Vectors: Without Calibration CNN Accuracy:83.33% I-Vectors Accuracy:86.41% Class-wise accuracy Confusion matrix Scores 16/23
  • 33.   CNN vs I-Vectors: With Calibration CNN Accuracy:84.10% I-Vectors Accuracy:88.70% Class-wise accuracy Confusion matrix Scores 17/23
  • 36.   CNN vs I-Vectors: Without Calibration CNN Accuracy:83.33% I-Vectors Accuracy:86.41% Class-wise accuracy Confusion matrix Scores 19/23
  • 37.   CNN vs I-Vectors: With Calibration CNN Accuracy:84.10% I-Vectors Accuracy:88.70% Class-wise accuracy Confusion matrix Scores 19/23
  • 39.   CNN vs I-Vectors: Without Calibration CNN Accuracy:83.33% I-Vectors Accuracy:86.41% metro train Class-wise accuracy Confusion matrix Scores 20/23
  • 41.   DCASE 2016 challenge: Results Hybrid Calibrated Multi-channel I-vectors Multi-channel I-vectors Oursingle CNN MFCC+GMM baseline 21/23
  • 42.   DCASE 2016 challenge: Observations 22/23
  • 43.   Challenges in ASC: Session Variability Kippis Kippis photo credit:www.google.com 22/23
  • 45.   Conclusion ⬛ Performance of I-Vectors can be noticeably improved by tuning MFCCs ⬛ Different channels contain different information from a scene that is beneficial to the I-vector system ⬛ I-Vectors and CNNs are complementary ⬛ Score Calibration improved both I-Vectors and CNN ⬛ A late-fusion can efficiently combine the two system’s predictions ⬛ This method is easily adaptable to new conditions 23/23
  • 46.  
  • 47. JOHANNES KEPLER UNIVERSITY LINZ Altenberger Str. 69 4040 Linz, Austria www.jku.at Thank you! For more information about this presentation, don’t hesitate to contact me via: hamid.eghbal-zadeh@jku.at Slides available at: https://www.slideshare.net/heghbalz Code soon will be available at: https://github.com/cpjku