Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Convolutional Recurrent Neural Networks for
Music Classification
Keunwoo Choi, Gy¨orge Fazekas, Mark Sandler
Centre for Digital Music, Queen Mary University of London, UK
. . . . . . . . . .
Kyunghyun Cho
Center for Data Science, New York University, USA
Ɓ
@keunwoochoi
09 March 2017
1/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
1 Backgrounds
Music Tagging
Motivation
2 Experiment specifications
3 Experiment results and discussions
2/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Backgrounds
3/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Backgrounds
Task: Music Tagging
Tags
WHATEVER KEYWORDS that people think describes music
Multi-label nature
E.g. {rock, guitar, drive, 90’s}
Music tags include Genres (rock, pop, alternative, indie),
Instruments (vocalists, guitar, violin), Emotions (mellow,
chill), Activities (party, drive), Eras (00’s, 90’s, 80’s).
Collaboratively created (Last.fm ) → noisy and
ill-defined (of course)
Ill-defined but useful task - because it’s reality!
Evaluation: AUC, [0.0 - 1.0] but effectively: [0.5 - 1.0]
4/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Backgrounds
Task: Music Tagging
5/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Backgrounds
Motivation
Motivation
Why don’t we try Convolutional Recurrent NN?
We deserve better benchmark
6/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Experiment specifications
7/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Four structures
(a) k1c2 (b) k2c1
(c) k2c2 (d) CRNN
(Details in the paper) 8/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
What to compare
Performance
The first thing that matters
Number of parameters
(GPU’s) Memory - In reality, it’s okay if model fits into
your HW and batch-SGD is available.
Model’s capacity in theory (dimensionality of parameter
space)
Training/Inference time per epoch
(GPU) computation complexity
Related to depth of network
Faster training matters
Faster inference may matter even more
9/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
How to compare?
Parameters: 0.1M, 0.25M. 0.5M. 1.0M, 3.0M
..by only controlling width of layers (=keeping depth,
sub-sampling strategy, kernel sizes)
Width?
Number of feature maps of convolution layer
Number of nodes in dense/rnn layer
10/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Experiment results and discussions
11/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Performance vs number of parameters
(a) k1c2 (b) k2c1 (c) k2c2 (d) CRNN
0.1 0.25 0.5 1.0 3.0
Number of paramteres [x106]
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87AUC-ROC
k1c2
k2c1
k2c2
CRNN
SOTA
k2c2 and CRNN work well
1D conv (k2c1, k1c2): not too good
Why? (I don’t know exactly)
Difference: Flexibility. k2c2 allows small invariances in
every conv layer, for both time/frequency axis, while k2c1
sees the whole frequency range at once, and therefore
don’t allow any distortion invariance.
12/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Performance vs training/inference time
(a) k1c2 (b) k2c1 (c) k2c2 (d) CRNN
9 20 50 100 200 300 400
training time per epoch [s]
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
AUC-ROC
k1c2
k2c1
k2c2
CRNN
1D convolution is SO FAST – the big kernels at first layer
make life easier
time consumption ∝ feature map size
time consumption ∝ 1/depth
k2c2 and CRNN are similar in different time ranges
CRNN for the best performance, k2c2 for time-efficient
performance (or make it wider/shallower) 13/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Performance per tag, per structure
Hip-Hop(34)
metal(12)
heavymetal(43)
punk(25)
House(49)R&B(46)
electro(42)jazz(10)
dance(7)
country(37)
blues(27)soul(16)
hardrock(28)
electronic(5)
indierock(17)funk(41)folk(21)
electronica(19)
alternativerock(9)
acoustic(30)indie(4)
experimental(31)
classicrock(15)
indiepop(47)pop(2)rock(1)
Progressiverock(44)
alternative(3)
oldies(26)
ambient(29)
chillout(13)
Mellow(18)
party(36)
beautiful(11)
easylistening(38)chill(23)sad(48)sexy(39)
catchy(40)
happy(50)
instrumenta
femalevfem
0.6
0.7
0.8
0.9
1.0
AUC-ROC
Genre Mood
k1c2 k2c1 CRNN
Per-tag performances ⊥ network structure
Per-tag performances ⊥ number of training samples
Per-tag performances ∝ ??
Tag difficulty: hard to find the logic
Different difficulty for each tag
Different glass ceiling (ground-truth noise) for each tag
Training samples cooperate to let the network learn useful,
shared representations
14/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Performance per tag, per structure
Hip-Hop(34)
metal(12)
heavymetal(43)
punk(25)
House(49)R&B(46)
electro(42)jazz(10)
dance(7)
country(37)
blues(27)soul(16)
hardrock(28)
electronic(5)
indierock(17)funk(41)folk(21)
electronica(19)
alternativerock(9)
acoustic(30)indie(4)
experimental(31)
classicrock(15)
indiepop(47)pop(2)rock(1)
Progressiverock(44)
alternative(3)
oldies(26)
ambient(29)
chillout(13)
Mellow(18)
partybeea
0.6
0.7
0.8
0.9
1.0
AUC-ROC Genre
k1c2 k2c1 CRNN
7)oul(16)
hardrock(28)
electronic(5)
indierock(17)funk(41)folk(21)
electronica(19)
ternativerock(9)
acoustic(30)indie(4)
xperimental(31)
classicrock(15)
indiepop(47)pop(2)rock(1)
ressiverock(44)
alternative(3)
oldies(26)
ambient(29)
chillout(13)
Mellow(18)
party(36)
beautiful(11)
asylistening(38)chill(23)sad(48)sexy(39)
catchy(40)
happy(50)
instrumental(24)
alevocalist(32)
alevocalists(6)
guitar(33)
alevocalists(14)60s(45)80s(20)70s(35)90s(22)00s(8)
Genre Mood Instrument Era
CRNN
15/16
Convolutional
Recurrent
Neural
Networks for
Music
Classification
Backgrounds
Music Tagging
Motivation
Experiment
specifications
Experiment
results and
discussions
Conclusions
Choose a structure by your time/memory constraints
Scaling by controlling width makes sense
Let the kernels@1 see frequency range gradually.
Performance per tag ⊥ number of sample, probably since
network learn somehow sharable representations
. . . . . . . . . . . . . . . . . . . . .
@keunwoochoi on github, wordpress, twitter
GitHub Repositories
compact cnn: CNN model and weights
music-auto tagging-keras: CNN/CRNN model and
weights
kapre - Keras Audio Preprocessors Layers
16/16

Convolutional recurrent neural networks for music classification