Machine Learning
for Music
Faculty of Mathematics and Informatics, SU

Petko Nikolov
April 8, 2015
About Me
Machine Learning
Music Information Retrieval
Machine Learning / Automated Data Science
What’s Music Information Retrieval?


Musicology



Computer Science



Signal Processing


Machine Learning

MIR
Music Recommendations
Recommending tags
Spotify’s Shuffle Mode
● Not really random

● Certainly some processing

● Probably some MIR behind
Pandora’s Music Genome Project
● started in 2000

● 800 000 manually annotated tracks by music
experts

● 450 attributes to describe music

● 25 minutes per track to label
MIREX
Music Information Retrieval Evaluation
eXchange
annual competition featuring more than 20
tasks
state-of-the-art algorithms compete against
each other
Structured
Information
Retrieval
Synthesis
fingerprinting
cover song detection
genre recognition
instrument recognition
mood detection
transcription
playlist generation
beat tracking
key detection
pitch tracking
vocal detection
recommendation
audio similarity
source separation
genre recognition
instrument recognition
mood detection
vocal detection
audio similarity
MIR Architecture
Audio
Segmentation
and
Preprocessing
MIR Architecture
Audio
Segmentation
and
Preprocessing
Feature
Extraction
MIR Architecture
Audio
Segmentation
and
Preprocessing
Feature
Extraction
Machine
Learning
MIR Architecture
Audio
Segmentation
and
Preprocessing
Feature
Extraction
Machine
Learning
classical
piano
romantic
Bethoven
by Daniel Barenboim
2

4
MIR Architecture
Audio
Segmentation
and
Preprocessing
classical
piano
romantic
Bethoven
Deep Learning
by Daniel Barenboim
2

4
MIR Architecture
Audio
Audio signal
Audio signal
human hearing: 20 Hz to 20 KHz
Segmentation
Segmentation
Frame
Segmentation
Frame
52 ms
Segmentation
Frame
52 ms
f1
Segmentation
Frame
52 ms
f1 f2
Segmentation
Frame
52 ms
f1 f2 f3
Segmentation
Frame
52 ms
f1 f2 f3 f4
Segmentation
Frame
52 ms
f1 f2 f3 f4
fn
Spectrum - on frame level
Discrete Fourier Transform

(DFT)
time frequency
Feature extraction
f x
Spectral Centroid
where is the ‘center of mass’ of the spectrum
Spectral Slope
fit linear regression and get the slope coef.
Spectral Slope
fit linear regression and get the slope coef.
Spectral Slope
fit linear regression and get the slope coef.
Spectral Slope
fit linear regression and get the slope coef.
Spectral Correlation is the cosine distance
between the frequency vectors of two
consecutive frames







Variation is (1.0 - correlation) respectively.
Spectral Correlation / Variation
Feature extraction - Result
f11 f12 f13 f14 f15 ……… f1m



f21 f22 f23 f24 f25 ……… f2m



centroid
correlation
Frames
Feature extraction - Result
f11 f12 f13 f14 f15 ……… f1m



f21 f22 f23 f24 f25 ……… f2m



centroid
correlation
Frames
frames number vary across audio recordings
Universal Background Model
Gaussian Mixture Model
frame feature vector
Gaussian Mixture Model
Multivariate Gaussian
Distribution
Gaussian Mixture Model
Gaussian Mixture Model
Gaussian Mixture Model - per track
Gaussian Mixture Model - per track
Gaussian Mixture Model - per track
Gaussian Mixture Model - per track
[𝛍1, 𝛍2, 𝛍3, 𝛍4]
Classification - Example Neural Net
aik
wk
Feature vector
Input Hidden Output
Likelihood of Rock?
Layers:
Classification - Example Neural Net
aik
wk
Feature vector
Input Hidden Output
Likelihood of Rock?
Layers:
Classification - Example Neural Net
aik
wk
Feature vector
Input Hidden Output
Likelihood of Rock?
Layers:
What’s Deep Learning?
(defn deep-learning? [neural-net]

(hidden-layer? neural-net))







we are trying to learn new high-level representation having
many more hidden layers


input is as raw as possible
Mel-spectrum
Deep Neural Network
Deep Neural Network
Backpropagation
Deep Neural Network
Backpropagation
Deep Neural Network
Backpropagation gradient fades quickly
Deep Belief Network
Input (Mel spectrum)
Output
Hidden Layer 3
Hidden Layer 2
Hidden Layer 1Restricted Boltzmann
Machine
RBM
RBM
RBM
Rock Jazz Punk Electronic
Deep Belief Network
Input (Mel spectrum)
Hidden Layer 1Restricted Boltzmann
Machine
Deep Belief Network
Input (Mel spectrum)
Hidden Layer 1Restricted Boltzmann
Machine
Deep Belief Network
Input (Mel spectrum)
Output
Hidden Layer 3
Hidden Layer 2
Hidden Layer 1Restricted Boltzmann
Machine
RBM
RBM
RBM
Rock Jazz Punk Electronic
Deep Auto Encoders
Mel spectrum
Mel spectrumOutput
Input
Deep Auto Encoders
Mel spectrum
Mel spectrumOutput
Input
Used for
denoising
Tools
essentia - audio retrieval algorithms









theano - CPU/GPU symbolic optimization 





scikit-learn - machine learning in Python

Machine learning for Music