Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Audio Chord Recognition
Using Deep Neural Networks
Bohumír Zámečník @bzamecnik
(A Farewell) Data Science Seminar – 2016-05...
Agenda
● what are chords & why recognize them?
● task formulation
● data set
● pre-processing
● model
● evaluation
● futur...
The dream – Beatles: Penny Lane
"multiple tones
being played
at the same time"
~ pitch class sets
group Z12
212
= 4096 possibilities
What are chords?
Motivation – why recognize chords?
● provide rich high-level musical structure
○ → visualization
● difficulty to pick by e...
Representation ● symbolic names
● pitch class sets (unique tones)
[1, 3, 5] [1, 4, 6] [2, 5, 7]
Task formulation – end-to-end task
● segmentation & classification
○ input data: sampled audio recording
○ output: time se...
Task formulation – intermediate task
● multi-label classification of frames
○ input: chromagram
○ output: pitch class labe...
(Isophonics)
● 180 songs
● ~ 8 hours
● human-annotated chord labels
● raw audio possible but hard
to obtain – due to copyr...
Pre-processing
● hard part – cleaning the input data :)
● need to synchronize audio & features
● chromagram features
○ like log-spectrogram
○ bins aligned to musical tones
○ linear translation
○ time-frequency reassig...
Pre-processing – audio
● stereo to mono (mean)
● cut to (overlapping) frames
● apply window (Hann)
● FFT – time-domain to ...
linear
spectrogram
log
spectrogram
reassigned
log
spectrogram
Preprocessing – labels
● symbolic labels to binary pitch class vectors
○ chord-labels parser
● sample to frames (to match ...
Preprocessing – tensor reshaping for the model
● (data points, features)
● cut the sequences to fixed length
○ eg. 100 fra...
Dataset size
● ~630k frames
● 115 features
● ~ 4 GB raw audio
● ~ 300 MB features compressed numpy array
● splits
○ traini...
Model – using deep neural networks
● the current architecture is inspired by what's used in the wild
● convolutions (+ poo...
model = Sequential()
model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'), input_shape=(max_seq_size, feature...
Training
● trained on GPU NVIDIA GTX 980Ti
● model ~260k parameters
● batch size: 32
● 6 GB GPU RAM
● ~ 60 s per epoch
● a...
Evaluation
● classification metrics
○ accuracy
○ hamming distance – for binary vectors
○ AUC
● segmentation metrics
○ WAOR...
Evaluation (validation set)
accuracy hamming score AUC
CNN + dense 0.402 0.873 0.910
CNN + LSTM 0.512 0.899 0.935
Pred. probability
Pred. labels
True labels
Probability error
Label error
"And I Love Her"
predicted
ground-truth
Future work
● prepare for MIREX 2016
● clean up the project
● write down all the stuff to blog
● make interactive demos / ...
Open-source @ GitHub
● bzamecnik/audio-ml – latest ML models & experiments
● bzamecnik/music-processing-experiments – chro...
Thank you!
Audio Chord Recognition
Upcoming SlideShare
Loading in …5
×

Audio chord recognition using deep neural networks

1,777 views

Published on

Experience with using CNN/LSTM nets on reassigned chromagrams to classify chords from Beatles songs.

Published in: Science
  • Be the first to comment

Audio chord recognition using deep neural networks

  1. 1. Audio Chord Recognition Using Deep Neural Networks Bohumír Zámečník @bzamecnik (A Farewell) Data Science Seminar – 2016-05-25
  2. 2. Agenda ● what are chords & why recognize them? ● task formulation ● data set ● pre-processing ● model ● evaluation ● future work
  3. 3. The dream – Beatles: Penny Lane
  4. 4. "multiple tones being played at the same time" ~ pitch class sets group Z12 212 = 4096 possibilities What are chords?
  5. 5. Motivation – why recognize chords? ● provide rich high-level musical structure ○ → visualization ● difficulty to pick by ear ○ lyrics & melody – easy ○ chords – harder
  6. 6. Representation ● symbolic names ● pitch class sets (unique tones) [1, 3, 5] [1, 4, 6] [2, 5, 7]
  7. 7. Task formulation – end-to-end task ● segmentation & classification ○ input data: sampled audio recording ○ output: time segments with symbolic chord labels start end chord 0.440395 1.689818 B 1.689818 2.209188 B/7 2.209188 2.746326 B/6 2.746326 3.280385 B/5 3.280385 3.849274 E:maj6 3.849274 4.406553 C#:min7 4.406553 4.940612 F#:sus4
  8. 8. Task formulation – intermediate task ● multi-label classification of frames ○ input: chromagram ○ output: pitch class labels for each frame 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1
  9. 9. (Isophonics) ● 180 songs ● ~ 8 hours ● human-annotated chord labels ● raw audio possible but hard to obtain – due to copyrights :( ○ torrent to help Data set – The Beatles: Reference Annotations
  10. 10. Pre-processing ● hard part – cleaning the input data :) ● need to synchronize audio & features
  11. 11. ● chromagram features ○ like log-spectrogram ○ bins aligned to musical tones ○ linear translation ○ time-frequency reassignment ■ using phase to "focus" the content position Pre-processing – audio
  12. 12. Pre-processing – audio ● stereo to mono (mean) ● cut to (overlapping) frames ● apply window (Hann) ● FFT – time-domain to frequency-domain → spectrogram ● reassignment – derivative of phase wrt. time & frequency ○ better position ● log scaling of frequency ● requantization ● dynamic range compression of values (log)
  13. 13. linear spectrogram log spectrogram reassigned log spectrogram
  14. 14. Preprocessing – labels ● symbolic labels to binary pitch class vectors ○ chord-labels parser ● sample to frames (to match the audio features) B 0 0 0 1 0 0 1 0 0 0 1 1 B/7 0 0 0 1 0 0 1 0 1 0 0 1 B/6 0 0 0 1 0 0 1 0 1 0 0 1 B/5 0 0 0 1 0 0 1 0 0 0 0 1 E:maj6 0 1 0 0 1 0 0 0 1 0 0 1 C#:min7 0 1 0 0 1 0 0 0 1 0 0 1 F#:sus4 0 1 0 0 0 0 1 0 0 0 0 1
  15. 15. Preprocessing – tensor reshaping for the model ● (data points, features) ● cut the sequences to fixed length ○ eg. 100 frames ○ → (sequence count, sequence length, features) ● reshape for convolution ○ → (sequence count, sequence length, features, channels) ● final shape: (3756, 100, 115, 1)
  16. 16. Dataset size ● ~630k frames ● 115 features ● ~ 4 GB raw audio ● ~ 300 MB features compressed numpy array ● splits ○ training 60%, validation 20 %, test 20 % ○ over whole songs to prevent leakage!
  17. 17. Model – using deep neural networks ● the current architecture is inspired by what's used in the wild ● convolutions (+ pooling) at the beginning to extract local features ● recurrent layers to propagate context in time ● sigmoids at the end for multi-label classification ● dropout & batch normalization for regularization ● ADAM optimizer
  18. 18. model = Sequential() model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'), input_shape=(max_seq_size, feature_count, 1))) model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'))) model.add(TimeDistributed(MaxPooling1D(2, 2))) model.add(Dropout(0.25)) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(MaxPooling1D(2, 2))) model.add(Dropout(0.25)) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(MaxPooling1D(2, 2))) model.add(Dropout(0.25)) model.add(TimeDistributed(Flatten())) model.add(BatchNormalization()) model.add(LSTM(64, return_sequences=True)) model.add(LSTM(64, return_sequences=True)) model.add(Dropout(0.25)) model.add(TimeDistributed(Dense(12, activation='sigmoid'))) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), nb_epoch=10, batch_size=32) implemented in Python using Keras on top of Theano/TensorFlow 6x convolutions 2x recurrent 1x classifier
  19. 19. Training ● trained on GPU NVIDIA GTX 980Ti ● model ~260k parameters ● batch size: 32 ● 6 GB GPU RAM ● ~ 60 s per epoch ● a few epochs to overfit ● 46 °C :)
  20. 20. Evaluation ● classification metrics ○ accuracy ○ hamming distance – for binary vectors ○ AUC ● segmentation metrics ○ WAOR (weighted average overlap ratio)
  21. 21. Evaluation (validation set) accuracy hamming score AUC CNN + dense 0.402 0.873 0.910 CNN + LSTM 0.512 0.899 0.935
  22. 22. Pred. probability Pred. labels True labels Probability error Label error
  23. 23. "And I Love Her" predicted ground-truth
  24. 24. Future work ● prepare for MIREX 2016 ● clean up the project ● write down all the stuff to blog ● make interactive demos / production app ● examine new approaches ○ better frame -> segment post-processing ○ 2D/nD convolutions – using locality in time/octaves ○ bi-directional RNN ○ beat-aligned features ○ language models ○ unsupervised pre-training ○ segmental RNN for direct segmentation
  25. 25. Open-source @ GitHub ● bzamecnik/audio-ml – latest ML models & experiments ● bzamecnik/music-processing-experiments – chromagram features ● bzamecnik/chord-labels – labels -> pitch class vectors ● bzamecnik/harmoneye ○ real-time chromagram features visualization ○ chord timeline visualization (from Penny Lane video) ● bzamecnik/harmoneye-android ● visualmusictheory.com - blog ● bzamecnik/ideas – more ideas :)
  26. 26. Thank you! Audio Chord Recognition

×