2. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
1 Introduction
2 CNNs and Music
TF-representations
Convolution Kernels and Axes
Pooling
3 Problem definition
4 The proposed architecture
5 Experiments and discussions
Overview
MagnaTagATune
Million Song Dataset
6 Conclusion
7 Reference
2/22
3. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Introduction
Tagging
Tags
Descriptive keywords that people put on music
Multi-label nature
E.g. {rock, guitar, drive, 90’s}
Music tags include Genres (rock, pop, alternative, indie),
Instruments (vocalists, guitar, violin), Emotions (mellow,
chill), Activities (party, drive), Eras (00’s, 90’s, 80’s).
Collaboratively created (Last.fm ) → noisy
false negative
synonyms (vocal/vocals/vocalist/vocalists/voice/voices.
guitar/guitars)
popularity bias
typo (harpsicord)
irrelevant tags (abcd, ilikeit, fav)
3/22
4. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Introduction
Tagging
Somehow multi-task: Genre/instrument/emotion/era can
be in separate tasks
Genres (rock, pop, alternative, indie), Instruments
(vocalists, guitar, violin), Emotions (mellow, chill),
Activities (party, drive), Eras (00’s, 90’s, 80’s).
Although there are many missings
Are they really extractable from audio?
4/22
7. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
TF-
representations
Convolution
Kernels and
Axes
Pooling
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
CNNs and Music
TF-representations
Options
STFT / Mel-spectrogram / CQT / raw-audio
STFT: Okay, but why not melgram?
Melgram: Efficient
CQT: only if you’re interested in fundamentals/pitchs
Raw-audio: end-to-end setup (learn the transformation),
have not outperformed melgram (yet) in speech/music
perhaps the way to go in the future?
we lose frequency axis though
7/22
8. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
TF-
representations
Convolution
Kernels and
Axes
Pooling
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
CNNs and Music
Convolution Kernels and Axes
Kernels
Rule of thumb: deeper > bigger, like vggnet [8] and
residual net [6]
Axes
For tagging, time-axis convolution seems essential
Dieleman’s approach do not apply freq-axis convolution
The proposed method use 2-d conv., i.e. both time and
freq axes
Pros: can see local frequency structure
Cons: we do not know it is really used. (They seem to be
used)[2]
More details in the paper [1]
8/22
10. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Problem definition
Automatic tagging
Automatic tagging is a multi-label classification task
K-dim vector: up to 2K cases
Majority of tags is False (no matter it’s correct or not)
Measured by AUC-ROC
Area Under Curve of Receiver Operating Characteristics
1
1
Image from Kaggle
10/22
13. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
Overview
MTT MSD
# tracks 25k 1M
# songs 5-6k 1M
Length 29.1s 30-60s
Benchmarks 10+ 0
Labels Tags, genres
Tags, genres,
EchoNest features,
bag-of-word lyrics,...
13/22
14. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
The Hopes
To validate the proposed algorithm onto state-of-the-art’s
To find the best setup among FCN-3,4,5,6,7
The Reality
To verify the proposed algorithm is comparable to
state-of-the-art’s
To know Melgram vs STFT vs MFCC
14/22
15. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Same depth (l=4), melgram>MFCC>STFT
melgram: 96 mel-frequency bins
STFT: 128 frequency bins
MFCC: 90 (30 MFCC, 30 MFCCd, 30 MFCCdd)
Methods AUC
FCN-3, mel-spectrogram .852
FCN-4, mel-spectrogram .894
FCN-5, mel-spectrogram .890
FCN-4, STFT .846
FCN-4, MFCC .862
Still, ConvNet may outperform frequency aggregation than
mel-frequency with more data. But not here.
ConvNet outperformed MFCC
15/22
16. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Methods AUC
FCN-3, mel-spectrogram .852
FCN-4, mel-spectrogram .894
FCN-5, mel-spectrogram .890
FCN-4, STFT .846
FCN-4, MFCC .862
FCN-4>FCN-3: Depth worked!
FCN-4>FCN-5 by .004
Deeper model might make it equal after ages of training
Deeper models requires more data
Deeper models take more time (deep residual network[6])
4 layers are enough vs. matter of size(data)?
16/22
17. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Methods AUC
The proposed system, FCN-4 .894
2015, Bag of features and RBM [7] .888
2014, 1-D convolutions[4] .882
2014, Transferred learning [10] .88
2012, Multi-scale approach [3] .898
2011, Pooling MFCC [5] .861
All deep and NN approaches are around .88-.89
Are we touching the glass ceiling?
Perhaps due to the noise of MTT, but tricky to prove it
26K tracks are not enough for millions of parameters
17/22
18. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Summary
Melgram over STFT, MFCC
2d convnet is at least not worse than the previous ones
Keunwoo: (Hesistating)MTT, it has been a great journey
with you. I think we should move on.
MTT: (With tears)No! Are you.. are you gonna be with
MSD?
Keunwoo: ...
MTT: (Falls down)
Keunwoo: (Almost taps MTT, then get a message from
MSD that it has the crawled audio files.)
18/22
19. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
Million Song Dataset
Methods AUC
FCN-3, mel-spectrogram .786
FCN-4, — .808
FCN-5, — .848
FCN-6, — .851
FCN-7, — .845
FCN-3<4<5<6 !
Deeper layers pay off
utill 6-layers in this case.
19/22
21. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Conclusion
2D fully convolutional networks work well
Mel-spectrogram can be preferred to STFT until
until we have a HUGE dataset so that mel-frequency
aggregation can be replaced
Bye bye, MFCC? In the near future, I guess
MIR can go deeper than now
if we have bigger, better, stronger datasets
Q. How do ConvNets actually deal with spectrograms?
A. Stay tuned to this year’s MLSP paper!
21/22
23. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Choi, K., Fazekas, G., Sandler, M.: Automatic tagging
using deep convolutional neural networks. In: Proceedings
of the 17th International Society for Music Information
Retrieval Conference (ISMIR 2016), New York, USA (2016)
Choi, K., Fazekas, G., Sandler, M.: Explaining
convolutional neural networks on music classification
(submitted). In: IEEE International Workshop on Machine
Learning for Signal Processing, Salerno, Italy. IEEE (2016)
Dieleman, S., Schrauwen, B.: Multiscale approaches to
music audio feature learning. In: ISMIR. pp. 3–8 (2013)
Dieleman, S., Schrauwen, B.: End-to-end learning for
music audio. In: Acoustics, Speech and Signal Processing
(ICASSP), 2014 IEEE International Conference on. pp.
6964–6968. IEEE (2014)
22/22
24. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Hamel, P., Lemieux, S., Bengio, Y., Eck, D.: Temporal
pooling and multiscale learning for automatic annotation
and ranking of music audio. In: ISMIR. pp. 729–734 (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning
for image recognition. arXiv preprint arXiv:1512.03385
(2015)
Nam, J., Herrera, J., Lee, K.: A deep bag-of-features
model for music auto-tagging. arXiv preprint
arXiv:1508.04999 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556 (2014)
22/22
25. Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Tzanetakis, G., Cook, P.: Musical genre classification of
audio signals. Speech and Audio Processing, IEEE
transactions on 10(5), 293–302 (2002)
Van Den Oord, A., Dieleman, S., Schrauwen, B.: Transfer
learning by supervised pre-training for audio-based music
classification. In: Conference of the International Society
for Music Information Retrieval (ISMIR 2014) (2014)
22/22