The effects of noisy labels on deep convolutional neural networks for music tagging

The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
arXiv:1706.02361

@KeunwooChoi
2014--present: PhD, Queen Mary University of London

2016--present: Buzzmusiq lnc.

2016/ 06--12: Visiting PhD, NYU

2015/ 06--09: Intern, Naver Labs

2011--2014: Audio research team, ETRI

2009--2011: Applied Acoustic Lab, EECS, SNU

2005--2009: EECS, SNU

Papers on ISMIR/ICASSP/IEEE Trans./Etc.

Python/Keras/Pytorch

Keunwoo.Choi

@qmul.ac.uk
György Fazekas, Kyunghyun Cho, Mark Sandler
arXiv:1706.02361
1. INTRODUCTION

Tagging
• Anyone can tag any words (or non-words) to any song

• The quality is ****.

• Poor, innocent, (ﬁnancially) poor researchers need to use it

Tagging
(Tag, count)
rock 101071
pop 69159
alternative 55777
indie 48175
electronic 46270
female vocalists 42565
favorites 39921
00s 31432
Awesome 26248
american 22694
seen live 20705
cool 19581
Favorite18864
Favourites 17722
female vocalist 17328
guitar 17302
loved 12483
favorite songs 12392
heard on Pandora 10470
USA 8725
2000s 8671
Favourite Songs 8661
drjazzmrfunkmusic 8364
77davez-all-tracks7278
fav 6155
bass 3364
songs I absolutely love
3293
vocals 2369
drums2281

🤔
Female vocalists
Male vocalist
Guitar
Bass
Vocals
Drums
0% 25% 50% 75% 100%
True False

Questions
How noisy?
Is training
alright?
How about
evaluation?
What are
they
learning?

Keunwoo.Choi

@qmul.ac.uk
arXiv:1706.02361
2. HOW NOISY? 
IS TRAINING OK?

Measuring the noise
• We need strongly-labelled re-annotations
• Instrumentation labels are (sort of) objective

(instrumental, female vocal, male vocal, guitar)
• 242K songs are still a lot → select a subset (or two)!
I can do it!
..but not
all of them

Strongly labelling: Subset100
• Subset100: random 50 from ‘True’  
+ random 50 from ‘False’ (for each label)
Instrumental
Female vocalists
Male vocalist
Guitar
True False
50songs 50songs
50 50
50 50
50 50

Strongly labelling: Subset400
• Subset400: Just random 400 items
242K songs × 50 tags
400 songs
4 tagsSubset400

🎵🖊.......................😭
AFTER
BEFORE

Evaluating groundtruth on Subset100
0
25
50
75
100
+ Error rate Precision
Instrumental female voc
male vocal guitar
0
25
50
75
100
- Error rate Recall
Instrumental female voc
male vocal guitar

#Occurrences estimation
0
20
40
60
80
In all, by GT My estimation 
using S100
My re-annotation 
on S400
Instrumental female voc male vocal guitar

Again, with box plots
{Instrumental, female vocalists}
vs.
{male vocalists, guitar}

Group A vs B, but why?
• Tagging ‘vocals’, ‘drums’, ‘bass’ is like..

→ They’re not tag-worthy

→ Let’s call it ‘taggability’
Female vocalists
Male vocalist
Guitar
Bass
Vocals
Drums
0% 25% 50% 75% 100
True False
***?
What’s on
the desk?

The hypothesis
If unusual → high taggability.
Instrumental, female vocal :
high taggability
Male vocal, guitar:
low taggability

The hypothesis
If high taggability
→ less false negative = higher recall (of GT)
high taggability,
less false neg, higher recall
Male vocal, guitar:
low taggability, 
more false neg, lower recall

The hypothesis
If high taggability
If higher recall (=more reliable GT),
→ ?

[33] Choi et al. 2017, Convolutional recu...
Hypothesis
If high taggability
→ ?
Performance(AUC)
!!!

The hypothesis
If high taggability
high taggability,
less false neg, higher recall,
better classification
Male vocal, guitar:
low taggability, 
more false neg, lower recall,
worse classification
→ better classification

Keunwoo.Choi

@qmul.ac.uk
arXiv:1706.02361
3. IS EVALUATION OK?

Really?
So, we evaluate the classiﬁer based on..
🤔
I need a noise-free groundtruth...

Evaluate the evaluation
242K songs × 50 tags
400 songs
4 tagsSubset400
HAHAHAH!Subset400!

Evaluate the evaluation
Interesting! With such noise,  
the results are still okay.
It’s not perfect though.
HAHAHA!

Keunwoo.Choi

@qmul.ac.uk
arXiv:1706.02361
4. LABEL VECTOR
ANALYSIS

Label vector similarity
• Similarity between labels 
according to the trained convnet.

Label vector vs co-occurrence (GT)

Label vector vs co-occurrence (GT)
• Mostly, LV reproduces the groundtruth.

• Except: similar pairs only by label vector:

(sad, beautiful), (happy, catchy), (rnb, sexy)
‘Sad songs are beautiful.’
‘Catchy songs are often happy songs.’
‘R&B claims to be sexy.’
🤔 Makes sense..

Keunwoo.Choi

@qmul.ac.uk
arXiv:1706.02361
5. CONCLUSIONS

Conclusions
• We quantiﬁed how noisy weakly-labelled groundtruth is.

• We conjectured why some labels are noisier.

• We showed what happens to the noisier labels on training
and evaluation.

• We investigated what a convnet learns.

Keunwoo.Choi

@qmul.ac.uk
arXiv:1706.02361

Links
My blog | blog post 1, blog post 2 | Paper!

The effects of noisy labels on deep convolutional neural networks for music tagging

Recommended

Recommended

More Related Content

More from Keunwoo Choi

More from Keunwoo Choi (10)

Recently uploaded

Recently uploaded (20)

The effects of noisy labels on deep convolutional neural networks for music tagging