Audio Chord Recognition
Using Deep Neural Networks
Bohumír Zámečník @bzamecnik
(A Farewell) Data Science Seminar – 2016-05-25
Agenda
● what are chords & why recognize them?
● task formulation
● data set
● pre-processing
● model
● evaluation
● future work
The dream – Beatles: Penny Lane
"multiple tones
being played
at the same time"
~ pitch class sets
group Z12
212
= 4096 possibilities
What are chords?
Motivation – why recognize chords?
● provide rich high-level musical structure
○ → visualization
● difficulty to pick by ear
○ lyrics & melody – easy
○ chords – harder
Representation ● symbolic names
● pitch class sets (unique tones)
[1, 3, 5] [1, 4, 6] [2, 5, 7]
Task formulation – end-to-end task
● segmentation & classification
○ input data: sampled audio recording
○ output: time segments with symbolic chord labels
start end chord
0.440395 1.689818 B
1.689818 2.209188 B/7
2.209188 2.746326 B/6
2.746326 3.280385 B/5
3.280385 3.849274 E:maj6
3.849274 4.406553 C#:min7
4.406553 4.940612 F#:sus4
Task formulation – intermediate task
● multi-label classification of frames
○ input: chromagram
○ output: pitch class labels for each frame
0 0 0 1 0 0 1 0 0 0 1 1
0 0 0 1 0 0 1 0 1 0 0 1
0 0 0 1 0 0 1 0 0 0 0 1
0 1 0 0 1 0 0 0 1 0 0 1
0 1 0 0 1 0 0 0 1 0 0 1
0 1 0 0 0 0 1 0 0 0 0 1
(Isophonics)
● 180 songs
● ~ 8 hours
● human-annotated chord labels
● raw audio possible but hard
to obtain – due to copyrights :(
○ torrent to help
Data set – The Beatles: Reference Annotations
Pre-processing
● hard part – cleaning the input data :)
● need to synchronize audio & features
● chromagram features
○ like log-spectrogram
○ bins aligned to musical tones
○ linear translation
○ time-frequency reassignment
■ using phase to "focus" the content position
Pre-processing – audio
Pre-processing – audio
● stereo to mono (mean)
● cut to (overlapping) frames
● apply window (Hann)
● FFT – time-domain to frequency-domain → spectrogram
● reassignment – derivative of phase wrt. time & frequency
○ better position
● log scaling of frequency
● requantization
● dynamic range compression of values (log)
linear
spectrogram
log
spectrogram
reassigned
log
spectrogram
Preprocessing – labels
● symbolic labels to binary pitch class vectors
○ chord-labels parser
● sample to frames (to match the audio features)
B 0 0 0 1 0 0 1 0 0 0 1 1
B/7 0 0 0 1 0 0 1 0 1 0 0 1
B/6 0 0 0 1 0 0 1 0 1 0 0 1
B/5 0 0 0 1 0 0 1 0 0 0 0 1
E:maj6 0 1 0 0 1 0 0 0 1 0 0 1
C#:min7 0 1 0 0 1 0 0 0 1 0 0 1
F#:sus4 0 1 0 0 0 0 1 0 0 0 0 1
Preprocessing – tensor reshaping for the model
● (data points, features)
● cut the sequences to fixed length
○ eg. 100 frames
○ → (sequence count, sequence length, features)
● reshape for convolution
○ → (sequence count, sequence length, features, channels)
● final shape: (3756, 100, 115, 1)
Dataset size
● ~630k frames
● 115 features
● ~ 4 GB raw audio
● ~ 300 MB features compressed numpy array
● splits
○ training 60%, validation 20 %, test 20 %
○ over whole songs to prevent leakage!
Model – using deep neural networks
● the current architecture is inspired by what's used in the wild
● convolutions (+ pooling) at the beginning to extract local features
● recurrent layers to propagate context in time
● sigmoids at the end for multi-label classification
● dropout & batch normalization for regularization
● ADAM optimizer
model = Sequential()
model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'), input_shape=(max_seq_size, feature_count, 1)))
model.add(TimeDistributed(Convolution1D(32, 3, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(2, 2)))
model.add(Dropout(0.25))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(2, 2)))
model.add(Dropout(0.25))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(Convolution1D(64, 3, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(2, 2)))
model.add(Dropout(0.25))
model.add(TimeDistributed(Flatten()))
model.add(BatchNormalization())
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.25))
model.add(TimeDistributed(Dense(12, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), nb_epoch=10, batch_size=32)
implemented in Python
using Keras on top of
Theano/TensorFlow
6x convolutions
2x recurrent
1x classifier
Training
● trained on GPU NVIDIA GTX 980Ti
● model ~260k parameters
● batch size: 32
● 6 GB GPU RAM
● ~ 60 s per epoch
● a few epochs to overfit
● 46 °C :)
Evaluation
● classification metrics
○ accuracy
○ hamming distance – for binary vectors
○ AUC
● segmentation metrics
○ WAOR (weighted average overlap ratio)
Evaluation (validation set)
accuracy hamming score AUC
CNN + dense 0.402 0.873 0.910
CNN + LSTM 0.512 0.899 0.935
Pred. probability
Pred. labels
True labels
Probability error
Label error
"And I Love Her"
predicted
ground-truth
Future work
● prepare for MIREX 2016
● clean up the project
● write down all the stuff to blog
● make interactive demos / production app
● examine new approaches
○ better frame -> segment post-processing
○ 2D/nD convolutions – using locality in time/octaves
○ bi-directional RNN
○ beat-aligned features
○ language models
○ unsupervised pre-training
○ segmental RNN for direct segmentation
Open-source @ GitHub
● bzamecnik/audio-ml – latest ML models & experiments
● bzamecnik/music-processing-experiments – chromagram features
● bzamecnik/chord-labels – labels -> pitch class vectors
● bzamecnik/harmoneye
○ real-time chromagram features visualization
○ chord timeline visualization (from Penny Lane video)
● bzamecnik/harmoneye-android
● visualmusictheory.com - blog
● bzamecnik/ideas – more ideas :)
Thank you!
Audio Chord Recognition

Audio chord recognition using deep neural networks

  • 1.
    Audio Chord Recognition UsingDeep Neural Networks Bohumír Zámečník @bzamecnik (A Farewell) Data Science Seminar – 2016-05-25
  • 2.
    Agenda ● what arechords & why recognize them? ● task formulation ● data set ● pre-processing ● model ● evaluation ● future work
  • 3.
    The dream –Beatles: Penny Lane
  • 4.
    "multiple tones being played atthe same time" ~ pitch class sets group Z12 212 = 4096 possibilities What are chords?
  • 5.
    Motivation – whyrecognize chords? ● provide rich high-level musical structure ○ → visualization ● difficulty to pick by ear ○ lyrics & melody – easy ○ chords – harder
  • 6.
    Representation ● symbolicnames ● pitch class sets (unique tones) [1, 3, 5] [1, 4, 6] [2, 5, 7]
  • 7.
    Task formulation –end-to-end task ● segmentation & classification ○ input data: sampled audio recording ○ output: time segments with symbolic chord labels start end chord 0.440395 1.689818 B 1.689818 2.209188 B/7 2.209188 2.746326 B/6 2.746326 3.280385 B/5 3.280385 3.849274 E:maj6 3.849274 4.406553 C#:min7 4.406553 4.940612 F#:sus4
  • 8.
    Task formulation –intermediate task ● multi-label classification of frames ○ input: chromagram ○ output: pitch class labels for each frame 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1
  • 9.
    (Isophonics) ● 180 songs ●~ 8 hours ● human-annotated chord labels ● raw audio possible but hard to obtain – due to copyrights :( ○ torrent to help Data set – The Beatles: Reference Annotations
  • 10.
    Pre-processing ● hard part– cleaning the input data :) ● need to synchronize audio & features
  • 11.
    ● chromagram features ○like log-spectrogram ○ bins aligned to musical tones ○ linear translation ○ time-frequency reassignment ■ using phase to "focus" the content position Pre-processing – audio
  • 12.
    Pre-processing – audio ●stereo to mono (mean) ● cut to (overlapping) frames ● apply window (Hann) ● FFT – time-domain to frequency-domain → spectrogram ● reassignment – derivative of phase wrt. time & frequency ○ better position ● log scaling of frequency ● requantization ● dynamic range compression of values (log)
  • 13.
  • 14.
    Preprocessing – labels ●symbolic labels to binary pitch class vectors ○ chord-labels parser ● sample to frames (to match the audio features) B 0 0 0 1 0 0 1 0 0 0 1 1 B/7 0 0 0 1 0 0 1 0 1 0 0 1 B/6 0 0 0 1 0 0 1 0 1 0 0 1 B/5 0 0 0 1 0 0 1 0 0 0 0 1 E:maj6 0 1 0 0 1 0 0 0 1 0 0 1 C#:min7 0 1 0 0 1 0 0 0 1 0 0 1 F#:sus4 0 1 0 0 0 0 1 0 0 0 0 1
  • 15.
    Preprocessing – tensorreshaping for the model ● (data points, features) ● cut the sequences to fixed length ○ eg. 100 frames ○ → (sequence count, sequence length, features) ● reshape for convolution ○ → (sequence count, sequence length, features, channels) ● final shape: (3756, 100, 115, 1)
  • 16.
    Dataset size ● ~630kframes ● 115 features ● ~ 4 GB raw audio ● ~ 300 MB features compressed numpy array ● splits ○ training 60%, validation 20 %, test 20 % ○ over whole songs to prevent leakage!
  • 17.
    Model – usingdeep neural networks ● the current architecture is inspired by what's used in the wild ● convolutions (+ pooling) at the beginning to extract local features ● recurrent layers to propagate context in time ● sigmoids at the end for multi-label classification ● dropout & batch normalization for regularization ● ADAM optimizer
  • 18.
    model = Sequential() model.add(TimeDistributed(Convolution1D(32,3, activation='relu'), input_shape=(max_seq_size, feature_count, 1))) model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'))) model.add(TimeDistributed(MaxPooling1D(2, 2))) model.add(Dropout(0.25)) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(MaxPooling1D(2, 2))) model.add(Dropout(0.25)) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(Convolution1D(64, 3, activation='relu'))) model.add(TimeDistributed(MaxPooling1D(2, 2))) model.add(Dropout(0.25)) model.add(TimeDistributed(Flatten())) model.add(BatchNormalization()) model.add(LSTM(64, return_sequences=True)) model.add(LSTM(64, return_sequences=True)) model.add(Dropout(0.25)) model.add(TimeDistributed(Dense(12, activation='sigmoid'))) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), nb_epoch=10, batch_size=32) implemented in Python using Keras on top of Theano/TensorFlow 6x convolutions 2x recurrent 1x classifier
  • 19.
    Training ● trained onGPU NVIDIA GTX 980Ti ● model ~260k parameters ● batch size: 32 ● 6 GB GPU RAM ● ~ 60 s per epoch ● a few epochs to overfit ● 46 °C :)
  • 20.
    Evaluation ● classification metrics ○accuracy ○ hamming distance – for binary vectors ○ AUC ● segmentation metrics ○ WAOR (weighted average overlap ratio)
  • 21.
    Evaluation (validation set) accuracyhamming score AUC CNN + dense 0.402 0.873 0.910 CNN + LSTM 0.512 0.899 0.935
  • 22.
    Pred. probability Pred. labels Truelabels Probability error Label error
  • 23.
    "And I LoveHer" predicted ground-truth
  • 24.
    Future work ● preparefor MIREX 2016 ● clean up the project ● write down all the stuff to blog ● make interactive demos / production app ● examine new approaches ○ better frame -> segment post-processing ○ 2D/nD convolutions – using locality in time/octaves ○ bi-directional RNN ○ beat-aligned features ○ language models ○ unsupervised pre-training ○ segmental RNN for direct segmentation
  • 25.
    Open-source @ GitHub ●bzamecnik/audio-ml – latest ML models & experiments ● bzamecnik/music-processing-experiments – chromagram features ● bzamecnik/chord-labels – labels -> pitch class vectors ● bzamecnik/harmoneye ○ real-time chromagram features visualization ○ chord timeline visualization (from Penny Lane video) ● bzamecnik/harmoneye-android ● visualmusictheory.com - blog ● bzamecnik/ideas – more ideas :)
  • 26.