Deep learning for music classification, 2016-05-24

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Reference
Deep Learning for Music Classification
GCT634 Spring 2016, KAIST
Keunwoo.Choi
@qmul.ac.uk
Centre for Digital Music, Queen Mary University of London, UK
24 May 2016
1/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Reference
1 Music classification
2 Data-driven approaches
Conventional ML
Deep Learning
3 Reference
2/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Reference
Music Classification
Definition
Classify music items into certain categories
(using audio content)
Genre classification [3]
Rock/Jazz/Hiphop/Classical/...
Instrument identification
Music/Speech segmentation
Emotion recognition
Automatic tagging
3/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Reference
Music Classification
Feasibility
Are these info really extractable from audio signal?
Genre: sound, play style, chord, instrument, melody, ..
Instrument: spectral and/or temporal patterns
Music/Speech: spectral and/or temporal patterns
Emotion: sound/melody/lyrics/..
Tags (instrument/era/emotion/activity/): ...
4/15

Deep Learning
for Music
Classiﬁcation
Keunwoo.Choi
@qmul.ac.uk
Music
classiﬁcation
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Data-driven approaches
Conventional ML
Data-driven + domain knowledge
acoustic/musical features
[3]
”We provide candidates and let machine choose”
5/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Conventional ML
1/2 Feature selection
Any features that might be relevant to the classification
Spectral features
Spectral rolloff, centroid, MFCC, ZCR,
Rhythmic features
Tempo, beat histogram
Tonal features
Key, pitch-class distribution, tonality
6/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Conventional ML
2/2 Classifiers
Classifiers select relevant features
To map (aggregated N-dim feature) to (decision)
Classifiers are trained with data
After training, it is usually possible to score how relevant
each feature is
7/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Conventional ML - Genre classification example
audio signal → → , → →
length=N → 256-by-100 → 30-by-100, 1-by-100 → 31-by-100 → 62-by-1
for x, y in training data, # x: audio signal
1 X = stft(x)
2 x mfccs = mfcc(X)
2 x centroids = spectral centroid(X)
3 x feats = concatenate(x mfccs, x centroids)
# size(x feats) = (31,100), feature vectors for every frame in the track
4 x feat = concatenate(mean(x feats), var(x feats))
# size(x feat) = (62,1), feature vector of the whole track x
Training the classifier with (x feat, y)
* Now, we have a system that maps audio signal → genre
8/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Conventional ML - Genre classification example
audio signal → [feature extraction] → → [trained classifier] → yprediction
for new audio signal s,
1 get s feat
2 predict genre of the signal!
9/15

Deep Learning
for Music
Classiﬁcation
Keunwoo.Choi
@qmul.ac.uk
Music
classiﬁcation
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Even more data-driven approaches
Deep Learning
”Why don’t we optimise/automate more?”
Because designing features is NOT optimised (and boring)
”We provide candidates and let machine choose”
”Let machine design and choose features”
10/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Deep Learning
Deep Learning
Deep == More layers (of Neural Networks) == Some
layers serve as feature extractors, the others as classifiers
11/15

Deep Learning
for Music
Classification
Keunwoo.Choi
@qmul.ac.uk
Music
classification
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Deep Learning
Deep Learning
Machines might do better than humans
they don’t get bored, compute faster, are not biased,..
Machines are more flexible than before
learned classifier AND feature extractor
Machines need more examples to learn from than before
because the number of parameters to learn increases
Human: decides the structure and input types
12/15

Deep Learning
for Music
Classiﬁcation
Keunwoo.Choi
@qmul.ac.uk
Music
classiﬁcation
Data-driven
approaches
Conventional
ML
Deep Learning
Reference
Deep Learning
End-to-end learning for music audio, Sander Dieleman et
al., ICASSP, 2014 [2]
Auto-tagging using deep convolutional neural networks,
Keunwoo Choi et al., ISMIR, 2016 [1]
13/15

Choi, K., Fazekas, G., Sandler, M.: Automatic tagging
using deep convolutional neural networks. In: Proceedings
of the 17th International Society for Music Information
Retrieval Conference (ISMIR 2016), New York, USA (2016)
Dieleman, S., Schrauwen, B.: End-to-end learning for
music audio. In: Acoustics, Speech and Signal Processing
(ICASSP), 2014 IEEE International Conference on. pp.
6964–6968. IEEE (2014)
Tzanetakis, G., Cook, P.: Musical genre classiﬁcation of
audio signals. Speech and Audio Processing, IEEE
transactions on 10(5), 293–302 (2002)

Deep Learning
for Music
Classiﬁcation
Keunwoo.Choi
@qmul.ac.uk
Music
classiﬁcation
Data-driven
approaches
Reference
Bonus 1. 6 selected pages of this slide on deep CNNs
Convolutional
Neural
Networks
Keunwoo.Choi
@qmul.ac.uk
Overview
CNN use-cases
References
Convolutional Neural Networks
A brief explanation
Keunwoo.Choi
@qmul.ac.uk
1/43
14/15

Convolutional
Neural
Networks
Keunwoo.Choi
@qmul.ac.uk
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
CNNs: Convolutional Neural Networks
(Deep) Convolutional Neural Networks
deep = cascaded
convolutional = ﬁlters
neural networks = things are learned
1
2
1
cns.org
2
AlexNet
3/43

Convolutional
Neural
Networks
Keunwoo.Choi
@qmul.ac.uk
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
Hierarchical features
Hierarchical feature learning
Each layer learns features in diﬀerent levels of hierarchy
High-level features are built on low-level features
E.g.
Layer 1: Edges (low-level, concrete)
Layer 2: Simple shapes
Layer 3: Complex shapes
Layer 4: More complex shapes
Layer 5: Shapes of target objects (high-level, abstract)
26/43

Convolutional
Neural
Networks
Keunwoo.Choi
@qmul.ac.uk
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
What is learned in CNNs?
in image recognition task
[11]
27/43

Convolutional
Neural
Networks
Keunwoo.Choi
@qmul.ac.uk
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
[11]
28/43

Convolutional
Neural
Networks
Keunwoo.Choi
@qmul.ac.uk
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
[11]
29/43

Convolutional
Neural
Networks
Keunwoo.Choi
@qmul.ac.uk
Overview
CNN use-cases
Image
Music
References
CNN use-cases
Music information retrieval
Anything people can do by seeing spectrograms
E.g. Auto tagging [1], chord recognition [5], instrument
recognition [7], music-noise segmentation [8], onset
detection [9], boundary detection [10]
+ style change? source separation? eﬀects/de-eﬀects?
39/43

Bonus 2. 11 Selected pages of this slide on Auto-Tagging
with CNNs
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Automatic Tagging using
Deep Convolutional Neural Networks [1]
Keunwoo.Choi
@qmul.ac.uk
1/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Introduction
Tagging
Tags
Descriptive keywords that people put on music
Multi-label nature
E.g. {rock, guitar, drive, 90’s}
Music tags include Genres (rock, pop, alternative, indie),
Instruments (vocalists, guitar, violin), Emotions (mellow,
chill), Activities (party, drive), Eras (00’s, 90’s, 80’s).
Collaboratively created (Last.fm ) → noisy
false negative
synonyms (vocal/vocals/vocalist/vocalists/voice/voices.
guitar/guitars)
popularity bias
typo (harpsicord)
irrelevant tags (abcd, ilikeit, fav)
3/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
TF-
representations
Convolution
Kernels and
Axes
Pooling
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Conclusion
CNNs and Music
TF-representations
Options
STFT / Mel-spectrogram / CQT / raw-audio
STFT: Okay, but why not melgram?
Melgram: Eﬃcient
CQT: only if you’re interested in fundamentals/pitchs
Raw-audio: end-to-end setup (learn the transformation),
have not outperformed melgram (yet) in speech/music
perhaps the way to go in the future?
we lose frequency axis though
7/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Problem definition
Automatic tagging
Automatic tagging is a multi-label classification task
K-dim vector: up to 2K cases
Majority of tags is False (no matter it’s correct or not)
Measured by AUC-ROC
Area Under Curve of Receiver Operating Characteristics
1
1
Image from Kaggle
10/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
The proposed architecture
4-layer fully convolutional network, FCN-4
11/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
The proposed architecture
FCN-5 FCN-6 FCN-7
Mel-spectrogram (input: 96×1366×1)
Conv 3×3×128
MP (2, 4) (output: 48×341×128)
Conv 3×3×256
MP (2, 4) (output: 24×85×256)
Conv 3×3×512
MP (2, 4) (output: 12×21×512)
Conv 3×3×1024
MP (3, 5) (output: 4×4×1024)
Conv 3×3×2048
MP (4, 4) (output: 1×1×2048)
·
Conv 1×1×1024 Conv 1×1×1024
· Conv 1×1×1024
Output 50×1 (sigmoid) 12/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
Overview
MTT MSD
# tracks 25k 1M
# songs 5-6k 1M
Length 29.1s 30-60s
Benchmarks 10+ 0
Labels Tags, genres
Tags, genres,
EchoNest features,
bag-of-word lyrics,...
13/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
MagnaTagATune
Same depth (l=4), melgram>MFCC>STFT
melgram: 96 mel-frequency bins
STFT: 128 frequency bins
MFCC: 90 (30 MFCC, 30 MFCCd, 30 MFCCdd)
Methods AUC
FCN-3, mel-spectrogram .852
FCN-4, STFT .846
FCN-4, MFCC .862
Still, ConvNet may outperform frequency aggregation than
mel-frequency with more data. But not here.
ConvNet outperformed MFCC
15/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
MagnaTagATune
Methods AUC
FCN-4, STFT .846
FCN-4, MFCC .862
FCN-4>FCN-3: Depth worked!
FCN-4>FCN-5 by .004
Deeper model might make it equal after ages of training
Deeper models requires more data
Deeper models take more time (deep residual network[6])
4 layers are enough vs. matter of size(data)?
16/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Million Song Dataset
Methods AUC
FCN-4, — .808
FCN-5, — .848
FCN-6, — .851
FCN-7, — .845
FCN-3<4<5<6 !
Deeper layers pay oﬀ
utill 6-layers in this case.
19/22

Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
deﬁnition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Conclusion
2D fully convolutional networks work well
Mel-spectrogram can be preferred to STFT until
until we have a HUGE dataset so that mel-frequency
aggregation can be replaced
Bye bye, MFCC? In the near future, I guess
MIR can go deeper than now
if we have bigger, better, stronger datasets
Q. How do ConvNets actually deal with spectrograms?
A. Stay tuned to this year’s MLSP paper!
21/22

Deep learning for music classification, 2016-05-24

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning for music classification, 2016-05-24

Similar to Deep learning for music classification, 2016-05-24 (20)

More from Keunwoo Choi

More from Keunwoo Choi (12)

Recently uploaded

Recently uploaded (20)

Deep learning for music classification, 2016-05-24