Timbral modeling for music artist recognition using i-vectors

TIMBRAL MODELING FOR MUSIC
ARTIST RECOGNITION USING
I-VECTORS
Hamid Eghbal-zadeh, Markus Schedl, Gerhard Widmer
Johannes Kepler University
Linz, Austria
1
EUSIPCO 2015

Overview
• Introduction
o Artist recognition
o I-vector based systems
• I-vector Frontend
o Calculate statistics [GMM supervectors ]
o Factor analysis [estimate hidden factors to extract I-vectors]
• Proposed method:
o Normalization and compensation techniques
o Backends
• Experiments
o Setup
o Evaluation
o Baselines
o Results
• Conclusion
2

Introduction – Artist recognition
• Artist recognition:
Recognizing the artist using a part of a song
Artist refers to the singer or the band of a song.
• Difficulties:
– Musical instruments
– Effects of the genre and Instrumentation
– Singer’s voice + instruments
3
Major Lazer & DJ Snake - Lean On
Singing voice Music

Introduction – I-vector based systems
• I-vectors:
– Introduced in speaker verification in 2010
– Provide a compact and low dimensional representation
• Also used for:
– Emotion recognition ,Language recognition , Audio scene detection
• Use Factor Analysis:
– Estimate hidden factors that can help us recognize an artist from a song
• Introducing Artist and Session factors in a song:
– Artist variability : the variability appears between songs of different artists.
– Session variability : the variability appears within songs of an artist.
4
Song Frame-level features
Song-levelfeatures
Estimatehiddenfactors

I-vector Factor Analysis – Terminology
5
i-vector
GMM supervector
Frame-level feature
Step3:
Factor
Analysis
Total Variability
Space (TVS)
[~400]
GMM space
[~20,000]
Frame-level
feature space
[~20]
Total factors
hidden
hidden
SpacesFeatures Factors
Step2:
Statistics
calculation
Step1:
Feature
extraction
Artist variability : the variability appears between different artists.
Session variability : the variability appears within songs of an artist.
Total variability : Artist + Session variability

𝛄 𝑡(𝑐)
𝑡
𝛄 𝑡 𝑐
𝑡
∗ 𝑋𝑡
𝛄 𝑡 𝑐 : Posterior prob of 𝑋𝑡 by component c
BW: Baum-Welch
0th BW
1st BW
GMM-supervector*
I-vectors – Statistics calculation
6
UBM
Unsupervised
UBM
Song 1
UBM
Song 2
UBM
.
.
.
Development db
Step 2: extract GMM supervectors
* Similar to: Charbuillet et al. , GMM-Supervector for Content based Music Similarity, DAFx 2011.
{MFCCs}{Songs}
Train/Test db

I-vectors - Factor analysis
7
Step 3: estimate hidden factors
Goal:
• Reduce the dimensionality
• Separate desired factors from undesired
factors in feature space
• Estimate hidden variables related to
desired factors
M(s) M s = m + 𝑂𝑠
UBM
Offset vector
Assumption:
GMM supervector
For song s

8
Step 3: estimate hidden factors - previous methods
Residual matrix
Session subspace matrix
M s = m + 𝑉 ∗ 𝑦 + 𝑈 ∗ 𝑥 + 𝐷 ∗ 𝑧
Artist subspace matrix
Joint Factor Analysis (JFA) :
• JFA assumes 𝑂s consists of separated artist and session factors.
• JFA showed better performance than previous FA methods
mean vector of UBM Residual term
GMM supervector
For song s

9
i-vector
~N(0,1)
M s = m + 𝑇 ∗ 𝑦
TVS (low-rank) matrix
Step 3: estimate hidden factors - current method
• TVS: Contains both artist and session factors
• T is initiated randomly and is learned using EM algorithm from
training data
I-vector extraction:
mean vector of UBM
GMM supervector
For song s

I-vectors – Learning T
10
• E step:
For each artist, use the current estimates of T to find the i-vector
which maximizes the likelihood function of the GMM supervector
of song s, 𝑀(𝑠)
y s = arg max
y
𝑃(𝑀(𝑠) |𝑚 + 𝑇𝑦, Σ)
• M step:
Update T by maximizing
𝑃(𝑀(𝑠) |𝑚 + 𝑇𝑦, Σ)
Step 3: estimate hidden factors - expectation
maximization
Covariance matrixUBM mean vector

I-vectors – Proposed system
1. I-vectors are centered by removing the mean
2. I-vectors are length normalized
3. LDA is used for compensation and dimensionality reduction
11
𝑦𝑛 =
𝑦
|𝑦|
i-vector
Length-normalized
i-vector
{I-vector extraction}
{DA,3NN,NB,PLDA}{MFCC}
Extract
features
Extract
GMM
supervectors
Front
end
Compensation/
Normalization
{LDA/Length norm}
Song
Backend

Backends
• Discriminant Analysis classifier
• Nearest neighbor classifier with cosine distance (k=3)
• Naïve Bayes classifier
• Probabilistic Linear Discriminant Analysis
12
𝑦 = 𝑚 + ɸ . 𝑙 + 𝑒
latent factor
Residual termi-vector
mean of training
i-vectors
latent matrix

Experiments – Setup
• 30 seconds is randomly selected from the middle
area of each song
• 13 and 20 dim MFCCs are used as frame-level
features
• 1024 components GMM is trained as UBM
• TVS matrix is trained with 400 factors
• LDA is applied for compensation and dimensionality
reduction
• Development db = Train set
13

Experiments – Evaluation
• “Artist20” dataset: 1413 tracks, mostly rock and pop,
composed of six albums each from 20 artists
• 6-fold cross-validation provided in Artist20 dataset
• In each iteration, 1 album out of 6 albums from artist
is kept out for test.
14

Experiments – Baselines
Best artist recognition performance found on Artist20 db:
1. Single GMM : [D. PW Ellis, 2007]
– Provided with the dataset
2. Signature-based approach: [S. Shirali, 2009]
– Generates compact signatures and compares them using graph matching
3. Sparse modelling: [L. Su, 2013]
– Sparse feature learning method with a ‘bag of features’ using the
magnitude and phase parts of the spectrum
4. Multivariate kernels: [P. Kuksa, 2014]
– Uses multivariate kernels with the direct uniform quantization
5. Alternative:
– Uses the same structure as proposed method, only i-vector extraction block
is switched with PCA
15
{PCA} {DA}
{MFCC}
Extract
features
GMM
supervecto
rs
Front
end
Compensation/
Normalization
{LDA/Length norm}
Song
Backend

I-vectors – Results
16
Best 13
Alt. 13
Best 20
Alt. 20

I-vectors – Results
• Results for different Gaussian numbers with
the proposed method and the DA classifier
17
Best 13
Best 20

Conclusion
18
• Total factors can model an artist
• Compact representation, low dimensionality
• Song-level features
• Robust to multiple backends

Acknowledgement
19
• We would like to acknowledge the tremendous help by Dan Ellis of
Columbia University who provided tools and resources for feature
extraction and shared the details of his work, which enabled us to
reproduce his experiment results
• Thanks also to Pavel Kuksa from University of Pennsylvania for sharing the
details of his work with us.
• We appreciate helpful suggestions of Marko Tkalcic from Johannes
Kepler University of Linz.
• This work was supported by the EU-FP7 project no.601166 “Performances
as Highly Enriched aNd Interactive Concert eXperiences (PHENICX)”.

Questions
20
Thank you for your time!

𝛄 𝑡(𝑐)
𝑡
𝛄 𝑡 𝑐
𝑡
∗ 𝑋𝑡
𝛄 𝑡 𝑐 : Posterior prob of 𝑋𝑡 by component c
BW: Baum-Welch
0th BW
1st BW
GMM-supervector
I-vectors - GMM supervector
21
Example:
UBM: 1024 components
Feature: 20 dim
0th BW=1024 x 1
1st BW=20 x 1024
Step 1

22
𝑦 = (𝐼 + 𝑇 𝑡
Σ−1
𝑁 𝑠 𝑇)−1
. 𝑇−1
Σ−1
𝐹(𝑠)
0th BW 1st BWCovariance matrix of UBM
I−vector of song s:
Step 2: Closed form
𝛄 𝑡(𝑐)
𝑡
𝛄 𝑡 𝑐
𝑡
∗ 𝑋𝑡 𝛄 𝑡 𝑐 : Posterior prob of 𝑋𝑡 by component c
BW: Baum-Welch
𝑁 𝑠 ∶ 0th BW
𝐹 𝑠 ∶ 1st BW
TVS matrixIdentity matrix
(GMM supervectors)
i-vector

I-vector Extraction Routine
– Step 1: Feature extraction
– Step 2: Statistics calculation
• Extract GMM-supervectors from frame-level features (MFCCs)
– Step 3: Factor analysis
• Apply factor analysis to estimate hidden variables in GMM space
23
{I-vector extraction}
{PLDA,…}{MFCC}
Extract
features
Extract
GMM
supervectors
Front
end
Compensation/
Normalization
{LDA/Length norm}
Frames
Backend

Timbral modeling for music artist recognition using i-vectors

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Timbral modeling for music artist recognition using i-vectors

Similar to Timbral modeling for music artist recognition using i-vectors (20)

Recently uploaded

Recently uploaded (20)

Timbral modeling for music artist recognition using i-vectors