SlideShare a Scribd company logo
1 of 25
Download to read offline
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Automatic Tagging using
Deep Convolutional Neural Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Centre for Digital Music, Queen Mary University of London, UK
1/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
1 Introduction
2 CNNs and Music
TF-representations
Convolution Kernels and Axes
Pooling
3 Problem definition
4 The proposed architecture
5 Experiments and discussions
Overview
MagnaTagATune
Million Song Dataset
6 Conclusion
7 Reference
2/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Introduction
Tagging
Tags
Descriptive keywords that people put on music
Multi-label nature
E.g. {rock, guitar, drive, 90’s}
Music tags include Genres (rock, pop, alternative, indie),
Instruments (vocalists, guitar, violin), Emotions (mellow,
chill), Activities (party, drive), Eras (00’s, 90’s, 80’s).
Collaboratively created (Last.fm ) → noisy
false negative
synonyms (vocal/vocals/vocalist/vocalists/voice/voices.
guitar/guitars)
popularity bias
typo (harpsicord)
irrelevant tags (abcd, ilikeit, fav)
3/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Introduction
Tagging
Somehow multi-task: Genre/instrument/emotion/era can
be in separate tasks
Genres (rock, pop, alternative, indie), Instruments
(vocalists, guitar, violin), Emotions (mellow, chill),
Activities (party, drive), Eras (00’s, 90’s, 80’s).
Although there are many missings
Are they really extractable from audio?
4/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Introduction
Previous approaches
Conventional ML: Feature extraction + Classifier [9]
5/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Introduction
Previous deep approaches
Going deep (and automatic)!
Dieleman and Schrauwen, 2014 IEEE [4]
, and his work at Spotify
Nam et al., 2015 [7]
6/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
TF-
representations
Convolution
Kernels and
Axes
Pooling
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
CNNs and Music
TF-representations
Options
STFT / Mel-spectrogram / CQT / raw-audio
STFT: Okay, but why not melgram?
Melgram: Efficient
CQT: only if you’re interested in fundamentals/pitchs
Raw-audio: end-to-end setup (learn the transformation),
have not outperformed melgram (yet) in speech/music
perhaps the way to go in the future?
we lose frequency axis though
7/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
TF-
representations
Convolution
Kernels and
Axes
Pooling
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
CNNs and Music
Convolution Kernels and Axes
Kernels
Rule of thumb: deeper > bigger, like vggnet [8] and
residual net [6]
Axes
For tagging, time-axis convolution seems essential
Dieleman’s approach do not apply freq-axis convolution
The proposed method use 2-d conv., i.e. both time and
freq axes
Pros: can see local frequency structure
Cons: we do not know it is really used. (They seem to be
used)[2]
More details in the paper [1]
8/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
TF-
representations
Convolution
Kernels and
Axes
Pooling
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
CNNs and Music
Pooling
Source of invariances
Max-pooling ignores where it comes from = ignore small
differences = invariance to small distortions
9/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Problem definition
Automatic tagging
Automatic tagging is a multi-label classification task
K-dim vector: up to 2K cases
Majority of tags is False (no matter it’s correct or not)
Measured by AUC-ROC
Area Under Curve of Receiver Operating Characteristics
1
1
Image from Kaggle
10/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
The proposed architecture
4-layer fully convolutional network, FCN-4
11/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
The proposed architecture
FCN-5 FCN-6 FCN-7
Mel-spectrogram (input: 96×1366×1)
Conv 3×3×128
MP (2, 4) (output: 48×341×128)
Conv 3×3×256
MP (2, 4) (output: 24×85×256)
Conv 3×3×512
MP (2, 4) (output: 12×21×512)
Conv 3×3×1024
MP (3, 5) (output: 4×4×1024)
Conv 3×3×2048
MP (4, 4) (output: 1×1×2048)
·
Conv 1×1×1024 Conv 1×1×1024
· Conv 1×1×1024
Output 50×1 (sigmoid) 12/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
Overview
MTT MSD
# tracks 25k 1M
# songs 5-6k 1M
Length 29.1s 30-60s
Benchmarks 10+ 0
Labels Tags, genres
Tags, genres,
EchoNest features,
bag-of-word lyrics,...
13/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
The Hopes
To validate the proposed algorithm onto state-of-the-art’s
To find the best setup among FCN-3,4,5,6,7
The Reality
To verify the proposed algorithm is comparable to
state-of-the-art’s
To know Melgram vs STFT vs MFCC
14/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Same depth (l=4), melgram>MFCC>STFT
melgram: 96 mel-frequency bins
STFT: 128 frequency bins
MFCC: 90 (30 MFCC, 30 MFCCd, 30 MFCCdd)
Methods AUC
FCN-3, mel-spectrogram .852
FCN-4, mel-spectrogram .894
FCN-5, mel-spectrogram .890
FCN-4, STFT .846
FCN-4, MFCC .862
Still, ConvNet may outperform frequency aggregation than
mel-frequency with more data. But not here.
ConvNet outperformed MFCC
15/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Methods AUC
FCN-3, mel-spectrogram .852
FCN-4, mel-spectrogram .894
FCN-5, mel-spectrogram .890
FCN-4, STFT .846
FCN-4, MFCC .862
FCN-4>FCN-3: Depth worked!
FCN-4>FCN-5 by .004
Deeper model might make it equal after ages of training
Deeper models requires more data
Deeper models take more time (deep residual network[6])
4 layers are enough vs. matter of size(data)?
16/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Methods AUC
The proposed system, FCN-4 .894
2015, Bag of features and RBM [7] .888
2014, 1-D convolutions[4] .882
2014, Transferred learning [10] .88
2012, Multi-scale approach [3] .898
2011, Pooling MFCC [5] .861
All deep and NN approaches are around .88-.89
Are we touching the glass ceiling?
Perhaps due to the noise of MTT, but tricky to prove it
26K tracks are not enough for millions of parameters
17/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
MagnaTagATune
Summary
Melgram over STFT, MFCC
2d convnet is at least not worse than the previous ones
Keunwoo: (Hesistating)MTT, it has been a great journey
with you. I think we should move on.
MTT: (With tears)No! Are you.. are you gonna be with
MSD?
Keunwoo: ...
MTT: (Falls down)
Keunwoo: (Almost taps MTT, then get a message from
MSD that it has the crawled audio files.)
18/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
Million Song Dataset
Methods AUC
FCN-3, mel-spectrogram .786
FCN-4, — .808
FCN-5, — .848
FCN-6, — .851
FCN-7, — .845
FCN-3<4<5<6 !
Deeper layers pay off
utill 6-layers in this case.
19/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Overview
MagnaTagATune
Million Song
Dataset
Conclusion
Reference
Experiments and discussions
Million Song Dataset
Complex models take more time, but at the end they may
outperform
20/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Conclusion
2D fully convolutional networks work well
Mel-spectrogram can be preferred to STFT until
until we have a HUGE dataset so that mel-frequency
aggregation can be replaced
Bye bye, MFCC? In the near future, I guess
MIR can go deeper than now
if we have bigger, better, stronger datasets
Q. How do ConvNets actually deal with spectrograms?
A. Stay tuned to this year’s MLSP paper!
21/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Q&A
22/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Choi, K., Fazekas, G., Sandler, M.: Automatic tagging
using deep convolutional neural networks. In: Proceedings
of the 17th International Society for Music Information
Retrieval Conference (ISMIR 2016), New York, USA (2016)
Choi, K., Fazekas, G., Sandler, M.: Explaining
convolutional neural networks on music classification
(submitted). In: IEEE International Workshop on Machine
Learning for Signal Processing, Salerno, Italy. IEEE (2016)
Dieleman, S., Schrauwen, B.: Multiscale approaches to
music audio feature learning. In: ISMIR. pp. 3–8 (2013)
Dieleman, S., Schrauwen, B.: End-to-end learning for
music audio. In: Acoustics, Speech and Signal Processing
(ICASSP), 2014 IEEE International Conference on. pp.
6964–6968. IEEE (2014)
22/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Hamel, P., Lemieux, S., Bengio, Y., Eck, D.: Temporal
pooling and multiscale learning for automatic annotation
and ranking of music audio. In: ISMIR. pp. 729–734 (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning
for image recognition. arXiv preprint arXiv:1512.03385
(2015)
Nam, J., Herrera, J., Lee, K.: A deep bag-of-features
model for music auto-tagging. arXiv preprint
arXiv:1508.04999 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556 (2014)
22/22
Automatic
Tagging using
Deep
Convolutional
Neural
Networks [1]
Keunwoo.Choi
@qmul.ac.uk
Introduction
CNNs and
Music
Problem
definition
The proposed
architecture
Experiments
and
discussions
Conclusion
Reference
Tzanetakis, G., Cook, P.: Musical genre classification of
audio signals. Speech and Audio Processing, IEEE
transactions on 10(5), 293–302 (2002)
Van Den Oord, A., Dieleman, S., Schrauwen, B.: Transfer
learning by supervised pre-training for audio-based music
classification. In: Conference of the International Society
for Music Information Retrieval (ISMIR 2014) (2014)
22/22

More Related Content

Recently uploaded

ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
Madan Karki
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 

Recently uploaded (20)

Circuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringCircuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineering
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdf
 
Piping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfPiping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdf
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
How to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdfHow to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdf
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Automatic Tagging using Deep Convolutional Neural Networks

  • 1. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Centre for Digital Music, Queen Mary University of London, UK 1/22
  • 2. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference 1 Introduction 2 CNNs and Music TF-representations Convolution Kernels and Axes Pooling 3 Problem definition 4 The proposed architecture 5 Experiments and discussions Overview MagnaTagATune Million Song Dataset 6 Conclusion 7 Reference 2/22
  • 3. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Introduction Tagging Tags Descriptive keywords that people put on music Multi-label nature E.g. {rock, guitar, drive, 90’s} Music tags include Genres (rock, pop, alternative, indie), Instruments (vocalists, guitar, violin), Emotions (mellow, chill), Activities (party, drive), Eras (00’s, 90’s, 80’s). Collaboratively created (Last.fm ) → noisy false negative synonyms (vocal/vocals/vocalist/vocalists/voice/voices. guitar/guitars) popularity bias typo (harpsicord) irrelevant tags (abcd, ilikeit, fav) 3/22
  • 4. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Introduction Tagging Somehow multi-task: Genre/instrument/emotion/era can be in separate tasks Genres (rock, pop, alternative, indie), Instruments (vocalists, guitar, violin), Emotions (mellow, chill), Activities (party, drive), Eras (00’s, 90’s, 80’s). Although there are many missings Are they really extractable from audio? 4/22
  • 5. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Introduction Previous approaches Conventional ML: Feature extraction + Classifier [9] 5/22
  • 6. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Introduction Previous deep approaches Going deep (and automatic)! Dieleman and Schrauwen, 2014 IEEE [4] , and his work at Spotify Nam et al., 2015 [7] 6/22
  • 7. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music TF- representations Convolution Kernels and Axes Pooling Problem definition The proposed architecture Experiments and discussions Conclusion CNNs and Music TF-representations Options STFT / Mel-spectrogram / CQT / raw-audio STFT: Okay, but why not melgram? Melgram: Efficient CQT: only if you’re interested in fundamentals/pitchs Raw-audio: end-to-end setup (learn the transformation), have not outperformed melgram (yet) in speech/music perhaps the way to go in the future? we lose frequency axis though 7/22
  • 8. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music TF- representations Convolution Kernels and Axes Pooling Problem definition The proposed architecture Experiments and discussions Conclusion CNNs and Music Convolution Kernels and Axes Kernels Rule of thumb: deeper > bigger, like vggnet [8] and residual net [6] Axes For tagging, time-axis convolution seems essential Dieleman’s approach do not apply freq-axis convolution The proposed method use 2-d conv., i.e. both time and freq axes Pros: can see local frequency structure Cons: we do not know it is really used. (They seem to be used)[2] More details in the paper [1] 8/22
  • 9. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music TF- representations Convolution Kernels and Axes Pooling Problem definition The proposed architecture Experiments and discussions Conclusion CNNs and Music Pooling Source of invariances Max-pooling ignores where it comes from = ignore small differences = invariance to small distortions 9/22
  • 10. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Problem definition Automatic tagging Automatic tagging is a multi-label classification task K-dim vector: up to 2K cases Majority of tags is False (no matter it’s correct or not) Measured by AUC-ROC Area Under Curve of Receiver Operating Characteristics 1 1 Image from Kaggle 10/22
  • 11. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference The proposed architecture 4-layer fully convolutional network, FCN-4 11/22
  • 12. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference The proposed architecture FCN-5 FCN-6 FCN-7 Mel-spectrogram (input: 96×1366×1) Conv 3×3×128 MP (2, 4) (output: 48×341×128) Conv 3×3×256 MP (2, 4) (output: 24×85×256) Conv 3×3×512 MP (2, 4) (output: 12×21×512) Conv 3×3×1024 MP (3, 5) (output: 4×4×1024) Conv 3×3×2048 MP (4, 4) (output: 1×1×2048) · Conv 1×1×1024 Conv 1×1×1024 · Conv 1×1×1024 Output 50×1 (sigmoid) 12/22
  • 13. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions Overview MTT MSD # tracks 25k 1M # songs 5-6k 1M Length 29.1s 30-60s Benchmarks 10+ 0 Labels Tags, genres Tags, genres, EchoNest features, bag-of-word lyrics,... 13/22
  • 14. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions MagnaTagATune The Hopes To validate the proposed algorithm onto state-of-the-art’s To find the best setup among FCN-3,4,5,6,7 The Reality To verify the proposed algorithm is comparable to state-of-the-art’s To know Melgram vs STFT vs MFCC 14/22
  • 15. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions MagnaTagATune Same depth (l=4), melgram>MFCC>STFT melgram: 96 mel-frequency bins STFT: 128 frequency bins MFCC: 90 (30 MFCC, 30 MFCCd, 30 MFCCdd) Methods AUC FCN-3, mel-spectrogram .852 FCN-4, mel-spectrogram .894 FCN-5, mel-spectrogram .890 FCN-4, STFT .846 FCN-4, MFCC .862 Still, ConvNet may outperform frequency aggregation than mel-frequency with more data. But not here. ConvNet outperformed MFCC 15/22
  • 16. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions MagnaTagATune Methods AUC FCN-3, mel-spectrogram .852 FCN-4, mel-spectrogram .894 FCN-5, mel-spectrogram .890 FCN-4, STFT .846 FCN-4, MFCC .862 FCN-4>FCN-3: Depth worked! FCN-4>FCN-5 by .004 Deeper model might make it equal after ages of training Deeper models requires more data Deeper models take more time (deep residual network[6]) 4 layers are enough vs. matter of size(data)? 16/22
  • 17. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions MagnaTagATune Methods AUC The proposed system, FCN-4 .894 2015, Bag of features and RBM [7] .888 2014, 1-D convolutions[4] .882 2014, Transferred learning [10] .88 2012, Multi-scale approach [3] .898 2011, Pooling MFCC [5] .861 All deep and NN approaches are around .88-.89 Are we touching the glass ceiling? Perhaps due to the noise of MTT, but tricky to prove it 26K tracks are not enough for millions of parameters 17/22
  • 18. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions MagnaTagATune Summary Melgram over STFT, MFCC 2d convnet is at least not worse than the previous ones Keunwoo: (Hesistating)MTT, it has been a great journey with you. I think we should move on. MTT: (With tears)No! Are you.. are you gonna be with MSD? Keunwoo: ... MTT: (Falls down) Keunwoo: (Almost taps MTT, then get a message from MSD that it has the crawled audio files.) 18/22
  • 19. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions Million Song Dataset Methods AUC FCN-3, mel-spectrogram .786 FCN-4, — .808 FCN-5, — .848 FCN-6, — .851 FCN-7, — .845 FCN-3<4<5<6 ! Deeper layers pay off utill 6-layers in this case. 19/22
  • 20. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Overview MagnaTagATune Million Song Dataset Conclusion Reference Experiments and discussions Million Song Dataset Complex models take more time, but at the end they may outperform 20/22
  • 21. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Conclusion 2D fully convolutional networks work well Mel-spectrogram can be preferred to STFT until until we have a HUGE dataset so that mel-frequency aggregation can be replaced Bye bye, MFCC? In the near future, I guess MIR can go deeper than now if we have bigger, better, stronger datasets Q. How do ConvNets actually deal with spectrograms? A. Stay tuned to this year’s MLSP paper! 21/22
  • 22. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Q&A 22/22
  • 23. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Choi, K., Fazekas, G., Sandler, M.: Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR 2016), New York, USA (2016) Choi, K., Fazekas, G., Sandler, M.: Explaining convolutional neural networks on music classification (submitted). In: IEEE International Workshop on Machine Learning for Signal Processing, Salerno, Italy. IEEE (2016) Dieleman, S., Schrauwen, B.: Multiscale approaches to music audio feature learning. In: ISMIR. pp. 3–8 (2013) Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. pp. 6964–6968. IEEE (2014) 22/22
  • 24. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Hamel, P., Lemieux, S., Bengio, Y., Eck, D.: Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: ISMIR. pp. 729–734 (2011) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015) Nam, J., Herrera, J., Lee, K.: A deep bag-of-features model for music auto-tagging. arXiv preprint arXiv:1508.04999 (2015) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 22/22
  • 25. Automatic Tagging using Deep Convolutional Neural Networks [1] Keunwoo.Choi @qmul.ac.uk Introduction CNNs and Music Problem definition The proposed architecture Experiments and discussions Conclusion Reference Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. Speech and Audio Processing, IEEE transactions on 10(5), 293–302 (2002) Van Den Oord, A., Dieleman, S., Schrauwen, B.: Transfer learning by supervised pre-training for audio-based music classification. In: Conference of the International Society for Music Information Retrieval (ISMIR 2014) (2014) 22/22