Slides of my presentation at EUSIPCO 2017

A Hybrid Approach with
Multi-channel I-Vectors and
Convolutional Neural Networks for
Acoustic Scene Classification
Hamid Eghbal-zadeh, Bernhard Lehner, Matthias Dorfer and Gerhard Widmer

A closer look into our winning submission
at IEEE DCASE-2016 challenge1
for
Acoustic Scene Classification
1) www.cs.tut.fi/sgn/arg/dcase2016/

What is Acoustic Scene Classification
(ASC)?
photo credit: www.cs.tut.fi/sgn/arg/dcase2016/
1/23

Challenges in ASC: Overlapping events
Park City Center
photo credit: www.cs.tut.fi/sgn/arg/dcase2016/
2/23

Challenges in ASC: Session Variability
Vienna! Athens!
photo credit:www.google.com
3/23

Vienna! Athens!
υγειά μαςProst
3/23

ASC is not that easy ...
photo credit: www.memegen.com

Deeeeeep learning
⬛ Pros:
⬜ A powerful method for supervised learning
⬜ Convolutional Neural Networks (CNNs)
⬜ Spectrograms as images
⬜ Feature Learning
⬜ Successfully applied on images, speech and music
⬛ Cons:
⬜ Confusion of classes when dealing with noisy scenes
and blurry spectrograms
⬜ Lack of generalization and overfitting if the training data does
not contain various sessions
Piczak, K. J., et al "Environmental sound classification with convolutional neural networks.", 2015.
photo credit: Yann Lecun's slides at NIPS2016 keynote 4/23

Factor Analysis
⬛ Pros:
⬜ Session Variability reduction
⬜ Use of a Universal Background Model (UBM)
⬜ Better generalization due to the unsupervised methodology
⬜ Successfully applied on sequential data such as Speech and
Music
⬛ Cons:
⬜ Relying on engineered features
⬜ Limits to use specialized features for Audio Scene Analysis
because of the independence and Gaussian assumptions in FA
[1] Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011.
[2] Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013).
5/23

A Hybrid system to overcome the
complexities ...
photo credit: www.imgflip.com

A hybrid approach to ASC
⬛ We combine a CNN with an I-Vector based ASC system:
⬜ A CNN is trained on spectrograms
⬜ I-Vector features (based on FA) are extracted from MFCCs
⬛ Late fusion
⬜ A score fusion technique is used to combine the two methods
⬛ Model averaging for better generalization
⬜ Multiple models are trained and the decision from different
models are averaged
Brummer, N., et al. "On calibration of language recognition scores." , 2006.
6/23

A hybrid approach to ASC
⬛ A VGG-style fully convolutional architecture
⬜ A well-known model for object recognition
Conv layer Pooling layer Average pooling layer
Slide...
Sumtheprobabilities
30secs
Feature Learning part Feed-Forward part
Simonyan, K., et al. "Very deep convolutional networks for large-scale image recognition.", 2014.
7/23

I-Vector Features
GMM Train
I-Vector
model
Sparse
statistics
Adapted GMM params = GMM params – unknown matrix . hidden factor
Learned via EM
Training
MFCCs
Many components high dimension
[3] Kenny, Patrick, et al. "Uncertainty Modeling Without Subspace Methods For Text-Dependent
Speaker Recognition.", 2016.
low dimension
I-vector
Point estimate
low dimension
EM
8/23

I-Vector Features
GMM
MFCCs
Sparse
statistics
I-vector
Adapted GMM params = GMM params – unknown matrix . hidden factor
Sparse
statistics
TrainingExtraction
Learned via EM
Many components
high dimension
high dimension
Train
I-Vector
model
I-vector
Point estimate
low dimension
EM
[3] Kenny, Patrick, et al. "Uncertainty Modeling Without Subspace Methods For Text-Dependent
Speaker Recognition.", 2016.
low dimension
9/23

I-Vector Features
⬛ Requires a Universal Background Model (UBM):
⬜ A GMM with 256 Gaussian components
⬜ MFCCs features
⬛ MAP estimation of a hidden factor:
⬜ m: mean from the GMM
⬜ M: adapted GMM mean to MFCCs of an audio segment
⬜ Solving the following factor analysis equation:
M = m + T.y
⬜ y is the hidden factor and its MAP estimation is the I-vector
Dehak, Najim, et al. "Front-end factor analysis for speaker verification.",2011.
10/23

Improving I-Vector Features for ASC
GMM I-vectorleft
right
average
difference
GMM I-vector
GMM I-vector
GMM I-vector
⬛ Tuning MFCC parameters:
⬛ I-vectors from MFCCs of different channels
Elizalde, B, et al. "An i-vector based approach for audio scene detection." (2013).
11/23

Post-processing and Scoring I-Vector
Features
⬛ Length-Normalization
⬛ Within-class Covariance Normalization (WCCN)
⬛ Linear Discriminant Analysis (LDA)
⬛ Cosine Similarity:
⬜ Average I-vectors of each class in training set (Model I-vector)
⬜ Compute cosine similarity from each test I-vector to model I-
vector of each class
⬜ Pick the class with maximum similarity
[1] Garcia-Romero, D., et al. "Analysis of i-vector Length Normalization in Speaker Recognition.", 2011.
[2] Hatch, A. O.,et al. "Within-class covariance normalization for SVM-based speaker recognition.", 2006.
[3] Dehak, Najim, et al. "Cosine similarity scoring without score normalization techniques." 2010.
12/23

Why hybrid?
photo credit: www.mechanicalengineeringblog.com
www.google.com

Linear Logistic Regression for Score
Fusion
⬛ Combining cosine scores of I-vectors with CNN probabilities
⬛ A Linear Logistic Regression (LLR) model is trained on validation
set
⬛ A coefficient is learned for each model and a bias term for each
class.
⬛ Final score is computed by applying the learned coefficients and
the bias terms on the test set scores.
13/23

Model averaging
⬛ 4 separate models trained from each fold
⬛ Average the final score from models in each fold
14/23

TUT Acoustic Scenes 2016 dataset
⬛ 30-seconds audio segments from 15 acoustic scenes:
⬜ Bus - traveling by bus in the city (vehicle)
⬜ Cafe / Restaurant - small cafe/restaurant (indoor)
⬜ Car - driving or traveling as a passenger, in the city (vehicle)
⬜ City center (outdoor)
⬜ Forest path (outdoor)
⬜ Grocery store - medium size grocery store (indoor)
⬜ Home (indoor)
⬜ Lakeside beach (outdoor)
⬜ Library (indoor)
⬜ Metro station (indoor)
⬜ Office - multiple persons, typical work day (indoor)
⬜ Residential area (outdoor)
⬜ Train (traveling, vehicle)
⬜ Tram (traveling, vehicle)
⬜ Urban park (outdoor)
⬛ Development set:
⬜ Each acoustic scene has 78 segments totaling 39 minutes of audio.
⬜ 4 folds cross validation
⬛ Evaluation set:
⬜ 26 segments totaling 13 minutes of audio.
Mesaros, A.,et al "TUT database for acoustic scene classification and sound event detection." , 2016.
15/23

CNN vs I-Vectors: Without Calibration
CNN
Accuracy:83.33%
I-Vectors
Accuracy:86.41%
Class-wise accuracy Confusion matrix Scores
16/23

CNN vs I-Vectors: With Calibration
CNN
Accuracy:84.10%
I-Vectors
Accuracy:88.70%
17/23

Hybrid
Hybrid
Accuracy:89.74%
Hybrid
Accuracy:89.74%
18/23

CNN
Accuracy:83.33%
I-Vectors
Accuracy:86.41%
19/23

CNN vs I-Vectors: With Calibration
CNN
Accuracy:84.10%
I-Vectors
Accuracy:88.70%
19/23

CNN
Accuracy:83.33%
I-Vectors
Accuracy:86.41%
metro train
20/23

Hybrid
Hybrid
Accuracy:89.7%
Hybrid
Accuracy:89.7%
20/23

DCASE 2016 challenge: Results
Hybrid
Calibrated
Multi-channel
I-vectors
Multi-channel
I-vectors
Oursingle
CNN
MFCC+GMM
baseline
21/23

DCASE 2016 challenge: Observations
22/23

Kippis Kippis
22/23

Conclusion
⬛ Performance of I-Vectors can be noticeably improved by tuning
MFCCs
⬛ Different channels contain different information from a scene that
is beneficial to the I-vector system
⬛ I-Vectors and CNNs are complementary
⬛ Score Calibration improved both I-Vectors and CNN
⬛ A late-fusion can efficiently combine the two system’s predictions
⬛ This method is easily adaptable to new conditions
23/23

JOHANNES KEPLER
UNIVERSITY LINZ
Altenberger Str. 69
4040 Linz, Austria
www.jku.at
Thank you!
For more information about this presentation, don’t hesitate to
contact me via:
hamid.eghbal-zadeh@jku.at
Slides available at:
https://www.slideshare.net/heghbalz
Code soon will be available at:
https://github.com/cpjku

Slides of my presentation at EUSIPCO 2017

Recommended

Recommended

More Related Content

Similar to Slides of my presentation at EUSIPCO 2017

Similar to Slides of my presentation at EUSIPCO 2017 (20)

Recently uploaded

Recently uploaded (20)

Slides of my presentation at EUSIPCO 2017