Audio chord recognition using deep neural networks

Audio Chord Recognition
Using Deep Neural Networks
Bohumír Zámečník @bzamecnik
(A Farewell) Data Science Seminar – 2016-05-25

Agenda
● what are chords & why recognize them?
● task formulation
● data set
● pre-processing
● model
● evaluation
● future work

The dream – Beatles: Penny Lane

"multiple tones
being played
at the same time"
~ pitch class sets
group Z12
212
= 4096 possibilities
What are chords?

Motivation – why recognize chords?
● provide rich high-level musical structure
○ → visualization
● difficulty to pick by ear
○ lyrics & melody – easy
○ chords – harder

Representation ● symbolic names
● pitch class sets (unique tones)
[1, 3, 5] [1, 4, 6] [2, 5, 7]

Task formulation – end-to-end task
● segmentation & classification
○ input data: sampled audio recording
○ output: time segments with symbolic chord labels
start end chord
0.440395 1.689818 B
1.689818 2.209188 B/7
2.209188 2.746326 B/6
2.746326 3.280385 B/5
3.280385 3.849274 E:maj6
3.849274 4.406553 C#:min7
4.406553 4.940612 F#:sus4

Task formulation – intermediate task
● multi-label classification of frames
○ input: chromagram
○ output: pitch class labels for each frame
0 0 0 1 0 0 1 0 0 0 1 1
0 0 0 1 0 0 1 0 1 0 0 1
0 0 0 1 0 0 1 0 0 0 0 1
0 1 0 0 1 0 0 0 1 0 0 1
0 1 0 0 1 0 0 0 1 0 0 1
0 1 0 0 0 0 1 0 0 0 0 1

(Isophonics)
● 180 songs
● ~ 8 hours
● human-annotated chord labels
● raw audio possible but hard
to obtain – due to copyrights :(
○ torrent to help
Data set – The Beatles: Reference Annotations

Pre-processing
● hard part – cleaning the input data :)
● need to synchronize audio & features

● chromagram features
○ like log-spectrogram
○ bins aligned to musical tones
○ linear translation
○ time-frequency reassignment
■ using phase to "focus" the content position
Pre-processing – audio

Pre-processing – audio
● stereo to mono (mean)
● cut to (overlapping) frames
● apply window (Hann)
● FFT – time-domain to frequency-domain → spectrogram
● reassignment – derivative of phase wrt. time & frequency
○ better position
● log scaling of frequency
● requantization
● dynamic range compression of values (log)

linear
spectrogram
log
spectrogram
reassigned
log
spectrogram

Preprocessing – labels
● symbolic labels to binary pitch class vectors
○ chord-labels parser
● sample to frames (to match the audio features)
B 0 0 0 1 0 0 1 0 0 0 1 1
B/7 0 0 0 1 0 0 1 0 1 0 0 1
B/6 0 0 0 1 0 0 1 0 1 0 0 1
B/5 0 0 0 1 0 0 1 0 0 0 0 1
E:maj6 0 1 0 0 1 0 0 0 1 0 0 1
C#:min7 0 1 0 0 1 0 0 0 1 0 0 1
F#:sus4 0 1 0 0 0 0 1 0 0 0 0 1

Preprocessing – tensor reshaping for the model
● (data points, features)
● cut the sequences to fixed length
○ eg. 100 frames
○ → (sequence count, sequence length, features)
● reshape for convolution
○ → (sequence count, sequence length, features, channels)
● final shape: (3756, 100, 115, 1)

Dataset size
● ~630k frames
● 115 features
● ~ 4 GB raw audio
● ~ 300 MB features compressed numpy array
● splits
○ training 60%, validation 20 %, test 20 %
○ over whole songs to prevent leakage!

Model – using deep neural networks
● the current architecture is inspired by what's used in the wild
● convolutions (+ pooling) at the beginning to extract local features
● recurrent layers to propagate context in time
● sigmoids at the end for multi-label classification
● dropout & batch normalization for regularization
● ADAM optimizer

model = Sequential()
model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'), input_shape=(max_seq_size, feature_count, 1)))
model.add(TimeDistributed(Convolution1D(32, 3, activation='relu')))
model.add(TimeDistributed(MaxPooling1D(2, 2)))
model.add(Dropout(0.25))
model.add(TimeDistributed(Flatten()))
model.add(BatchNormalization())
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(64, return_sequences=True))
model.add(TimeDistributed(Dense(12, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), nb_epoch=10, batch_size=32)
implemented in Python
using Keras on top of
Theano/TensorFlow
6x convolutions
2x recurrent
1x classifier

Training
● trained on GPU NVIDIA GTX 980Ti
● model ~260k parameters
● batch size: 32
● 6 GB GPU RAM
● ~ 60 s per epoch
● a few epochs to overfit
● 46 °C :)

Evaluation
● classification metrics
○ accuracy
○ hamming distance – for binary vectors
○ AUC
● segmentation metrics
○ WAOR (weighted average overlap ratio)

Evaluation (validation set)
accuracy hamming score AUC
CNN + dense 0.402 0.873 0.910
CNN + LSTM 0.512 0.899 0.935

Pred. probability
Pred. labels
True labels
Probability error
Label error

"And I Love Her"
predicted
ground-truth

Future work
● prepare for MIREX 2016
● clean up the project
● write down all the stuff to blog
● make interactive demos / production app
● examine new approaches
○ better frame -> segment post-processing
○ 2D/nD convolutions – using locality in time/octaves
○ bi-directional RNN
○ beat-aligned features
○ language models
○ unsupervised pre-training
○ segmental RNN for direct segmentation

Open-source @ GitHub
● bzamecnik/audio-ml – latest ML models & experiments
● bzamecnik/music-processing-experiments – chromagram features
● bzamecnik/chord-labels – labels -> pitch class vectors
● bzamecnik/harmoneye
○ real-time chromagram features visualization
○ chord timeline visualization (from Penny Lane video)
● bzamecnik/harmoneye-android
● visualmusictheory.com - blog
● bzamecnik/ideas – more ideas :)

Thank you!
Audio Chord Recognition

Audio chord recognition using deep neural networks

More Related Content

Similar to Audio chord recognition using deep neural networks

Recently uploaded

Audio chord recognition using deep neural networks