SlideShare a Scribd company logo
1 of 73
TEACHING COMPUTERS TO LISTEN TO MUSIC
Eric Battenberg http://ericbattenberg.com
Gracenote, Inc.
Previously:
UC Berkeley, Dept. of EECS
Parallel Computing Laboratory
CNMAT (Center for New Music and Audio Technologies)
MACHINE LISTENING
 Speech processing
 Speech processing makes up the vast majority of funded
machine listening research.
 Just as there‟s more to Computer Vision than OCR, there‟s more
to machine listening than speech recognition!
 Audio content analysis
 Audio fingerprinting (e.g. Shazam, Gracenote)
 Audio classification (music, speech, noise, laughter, cheering)
 Audio event detection (new song, channel change, hotword)
 Content-based Music Information Retrieval (MIR)
 Today‟s topic
2
GETTING COMPUTERS TO
“LISTEN” TO MUSIC
 Not trying to get computers to “listen” for enjoyment.
 More accurate: Analyzing music with computers.
 What kind of information to get out of the analysis?
 What instruments are playing?
 What is the mood?
 How fast/slow is it (tempo)?
 What does the singer sound like?
 How can I play this song on the guitar/drums?
3
Ben Harper James Brown
CONTENT-BASED MUSIC INFORMATION
RETRIEVAL
 Many tasks:
 Genre, mood classification, auto-tagging
 Beat tracking, tempo detection
 Music similarity, playlisting, recommendation
 Heavily aided by collaborative filtering
 Automatic music transcription
 Source separation (voice, drums, bass…)
 Music segmentation (verse, chorus)
4
TALK SUMMARY
 Introduction to Music Information Retrieval
 Some common techniques
 Exciting new research directions
 Live Drum Understanding
 Drum detection/transcription
 Drum pattern analysis
5
TALK SUMMARY
 Introduction to Music Information Retrieval
 Some common techniques
 Exciting new research directions
 Live Drum Understanding
 Drum detection/transcription
 Drum pattern analysis
6
QUICK LESSON: THE SPECTROGRAM
 The spectrogram: Very common feature used in audio analysis.
 Time-frequency representation of audio.
 Take FFT of adjacent frames of audio samples, put them in a
matrix.
 Each column shows frequency content at a particular time.
7
Time
Frequency
GENRE/MOOD CLASSIFICATION:
TYPICAL APPROACHES
 Typical approach:
 Extract a bunch of hand-designed features describing small
windows of the signal (e.g., spectral
centroid, kurtosis, harmonicity, percussiveness, MFCCs, 100‟s
more…).
 Train a GMM or SVM to predict genre/mood/tags by either:
 Summarizing a song using mean/variance of each feature
 Log-likelihood sum across frames (GMM) or frame-wise voting (SVM)
 Pros:
 Works fairly well, was state of the art for a while.
 Well understood models, implementations widely available.
 Cons:
 Bag-of-frames style approach lacks ability to describe rhythm and
temporal dynamics.
 Getting further improvements requires hand designing more features.
8
LEARNING FEATURES:
NEURAL NETWORKS
 Each layer computes a non-linear transformation of
the previous layer.
 Linear transformation (weight matrices)
 + non-linearity (e.g. sigmoid [σ])
 h = σ(W1v), o = σ(W2h)
 Train to minimize output error.
 Each hidden layer can be thought of as a set of
features.
 Train using backpropagation.
 Iterative steps:
 Compute activations
 Compute output error
 Backprop. error signal
 Compute gradients
 Update all weights.
 Resurgence of neural networks:
 More compute
 More data
 A few new tricks…
9
Feedfordward
Backpropagate
Sigmoid non-linearity
1.0
0.0
DEEP NEURAL NETWORKS
 Deep Neural Networks
 Millions to billions of parameters
 Many layers of “features”
 Achieving state of the art
performance in vision and speech
tasks.
 Problem: Vanishing error signal.
 Weights of lower layers do not
change much.
 Solutions:
 Train for a really long time. 
 Pre-train each hidden layer as an
autoencoder. [Hinton, 2006]
 Rectified Linear Units
[Krizhevsky, 2012]
10
Feedfordward
Backpropagate
Sigmoid non-linearity Rectifier non-linearity
AUTOENCODERS AND
UNSUPERVISED FEATURE LEARNING
 Many ways to learn features in an
unsupervised way:
 Autoencoders – train a network to
reconstruct the input
 Restricted Boltzmann Machine (RBM)
[Hinton, 2006]
 Denoising Autoencoders [Vincent, 2008]
 Sparse Autoencoders
 Clustering – K-Means, mixture
models, etc.
 Sparse Coding – learn overcomplete
dictionary of features with sparsity
constraint
11
Autoencoder:
Train to reconstruct input
Hiddens
(Features)
Data Reconstructed
Data
GENRE/MOOD CLASSIFICATION:
NEWER APPROACHES
 Newer approaches to feature extraction:
 Learn spectral features using Restricted Boltzmann Machines
(RBMs) and Deep Neural Networks (DNN) [Hamel, 2010] – good
genre performance.
 Learn sparse features using Predictive Sparse Decomposition (PSD)
[Henaff, 2011] – good genre performance
 Learn beat-synchronous rhythm and timbre features with RBMs and
DNNs [Schmidt, 2013] – improved mood performance
 Tune multi-layer wavelet features called Deep Scattering Spectrum
[Anden, 2013]* - state-of-the-art genre performance
 Pros:
 Hand-designing individual features is not required.
 Computers can learn complex high-order features that humans
cannot hand code.
 Further work:
 More work on incorporating context, rhythm, and temporal dynamics
into feature learning
12
* Current state-of-the art on GTZAN genre dataset
[Henaff, 2011]
Some learned
frequency features
ONSET DETECTION
 Onset detection is important for music
transcription, beat tracking, tempo detection, and
rhythm summarization.
 Describing an onset:
 Transient
 Short interval when music signal evolves unpredictably.
 Attack
 Amplitude envelope increasing.
 Decay
 Amplitude envelope decreasing.
 Onset
 Point in time chosen to represent beginning of transient.
 Onset detection can be hard for certain instruments
with ambiguous attacks or when a note changes
without a new attack (legato).
13
from (Bello, 2005)
Audio Signal
Onset Envelope
ONSET DETECTION: TYPICAL APPROACHES
 Computing an Onset Detection Function (ODF):
 Derivative of energy envelope
 Derivative of band-wise log-energies [Klapuri, 2006]
 Complex spectral domain (difference with predictions of phase and
magnitude) [Bello, 2004]
 Choosing onsets using the ODF:
 Peak pick local maxima above a dynamic threshold [Bello, 2005].
 Pros:
 Simple to implement.
 Works fairly well (60-80% accurate)
 Cons
 Lots of hand tuning of thresholds
 No machine learning
14
RECURRENT NEURAL NETWORKS
 Non-linear sequence model
 Hidden units have connections to previous time step
 Unlike HMMs, can model long-term dependencies using
distributed hidden state.
 Recent developments (+ more compute) have made them much
more feasible to train.
15
from [Sutskever, 2013]
Output units
Distributed
hidden state
Input units
Standard Recurrent Neural Network
TRAINING AN RNN ON WIKIPEDIA
 Train RNN to predict next character (not word)
 Multiplicative RNN [Sutskever, 2013]
 Text generation demo: http://www.cs.toronto.edu/~ilya/rnn.html
 The machine learning meetup in San Francisco is considered enormously
emphasised. While as a consequence of these messages are allocated in the
environment to see the ideas and pollentium changes possible with the
Machinese gamma integrals increase, then the prefix is absent by a variety of
fresh deeperwater or matter level on 2 and 14, yet the… 16
ONSET DETECTION: STATE-OF-THE-ART
 Using Recurrent Neural Networks (RNN)
[Eyben, 2010], [Böck, 2012]
 RNN output trained to predict onset locations.
 80-90% accurate
 Can improve with more labeled training data, or possibly more
unsupervised training.
17
3 Layer RNN
3 Layer RNN
Onset Detection
Function
3 layer RNN
Spectrogram + Δ‟s
@ 3 time-scales
OTHER EXAMPLES OF RNNS IN MUSIC
 Examples:
 Blues improvisation (with LSTM RNNs) [Eck, 2002]
 Polyphonic piano note transcription [Böck, 2012]
 Classical music generation and transcription [Boulanger-
Lewandowski, 2012]
 Other distributed-state sequence models include:
 Recurrent Temporal Restricted Boltzmann Machine
[Sutskever, 2013]
 Conditional Restricted Boltzmann Machine [Taylor, 2011]
 Used in drum pattern analysis later.
 RNNs are a promising way to model longer-term contextual
and temporal dependencies present in music.
18
TALK SUMMARY
 Introduction to MIR
 Common techniques:
 Genre/Mood Classification, Onset Detection
 Exciting new research directions:
 Feature learning, temporal models (RNNs)
 Live Drum Understanding
 Drum detection/transcription
 Drum pattern analysis
19
TALK SUMMARY
 Introduction to MIR
 Common techniques:
 Genre/Mood Classification, Onset Detection
 Exciting new research directions:
 Feature learning, temporal models (RNNs)
 Live Drum Understanding
 Drum detection/transcription
 Drum pattern analysis
20
TOWARD COMPREHENSIVE RHYTHMIC
UNDERSTANDING
 Or “Live Drum Understanding”
 Goal: Go beyond simple beat tracking to provide context-
aware, instrument-aware information in real-time, e.g.
 “This rhythm is in 5/4 time”
 “This drummer is playing syncopated notes on the hi-hat”
 “The ride cymbal pattern has a swing feel”
 “This is a Samba rhythm”
21
LIVE DRUM UNDERSTANDING SYSTEM
Drum
Detection
Beat Tracking
Drum Pattern
Analysis
Drum
audio
Drum-wise
activations
Beat grid
Beat
locations,
Pattern
analysis
•Gamma Mixture Model
training of drum templates
•Non-negative decomposition
onto templates.
•Generative deep learning of drum
patterns
•Stacked Conditional Restricted
Boltzmann Machines
22
REQUIREMENTS FOR DRUM DETECTION
 Real-Time/Live operation
 Useful with any percussion setup.
 Before a performance, we can quickly train the system for a
particular percussion setup.
 Amplitude (dynamics) information.
23
DRUM DETECTION: MAIN POINTS
 Gamma Mixture Model
 For learning spectral drum templates.
 Cheaper to train than GMM
 More stable than GMM
 Non-negative Vector Decomposition (NVD)
 For computing template activations from drum onsets.
 Learning multiple templates per drum improves separation.
 The use of “tail” templates reduces false positives.
24
DRUM DETECTION SYSTEM
Onset
Detection
Spectrogram
Slice
Extraction
Gamma
Mixture Model
Non-negative
Vector
Decomposition
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
drum
activations
Onset
Detection
Gamma
Mixture Model
Non-negative
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio) Training
Performance
25
DRUM DETECTION SYSTEM
Onset
Detection
Spectrogram
Slice
Extraction
Gamma
Mixture Model
Non-negative
Vector
Decomposition
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
drum
activations
Onset
Detection
Gamma
Mixture Model
Non-negative
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio) Training
Performance
26
SPECTROGRAM SLICES
 Extracted at onsets.
 Each slice contains 100ms (~17 frames) of audio
 80 bark-spaced bands per channel [Battenberg 2008]
 During training, both “head” and “tail” slices are extracted.
 Tail templates serve as decoys during non-negative vector decomposition.
Head Slice Tail Slice
33ms 67ms 100ms
27
80bands
Detected onset
DRUM DETECTION SYSTEM
Onset
Detection
Spectrogram
Slice
Extraction
Gamma
Mixture Model
Non-negative
Vector
Decomposition
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
drum
activations
Onset
Detection
Gamma
Mixture Model
Non-negative
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio) Training
Performance
28
TRAINING DRUM TEMPLATES
 Instead of taking an “average” of
all training slices for a single
drum…
 Cluster them and use the cluster
centers as the drum templates.
 This gives us multiple
templates per drum…
 Which helps represent the variety
of sounds that can be made by a
single drum.
29
CLUSTERING USING MIXTURE MODELS
 Train using the Expectation-Maximization (EM)
algorithm.
 Gaussian Mixture Model (GMM)
 Covariance matrix – expensive to compute, possibly
unstable when data is lacking.
 Enforces a (scaled,squared) Euclidean distance
measure.
 Gamma Mixture Model
 Single mean vector per component
 Variance increases with mean (like human hearing).
 Enforces an Itakura-Saito (IS) divergence measure
 A scale-invariant perceptual distance between audio
spectra. 30
AGGLOMERATIVE CLUSTERING
 How many clusters to train?
 We use Minimum Description Length (MDL), aka BIC, to choose the
number of clusters.
 Negative log-likelihood
 + penalty term for number of clusters.
 1. Run EM to convergence.
 2. Merge the two most similar clusters.
 3. Repeat 1,2 until we have a single cluster.
 4. Choose parameter set with smallest MDL. 31
MDL= 187.8 MDL= -739.8 MDL= -835.6 MDL= -201.5 MDL= 1373
DRUM DETECTION SYSTEM
Onset
Detection
Spectrogram
Slice
Extraction
Gamma
Mixture Model
Non-negative
Vector
Decomposition
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
drum
activations
Onset
Detection
Gamma
Mixture Model
Non-negative
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio)
Onset
Detection
Gamma
Mixture Model
cluster parameters
(drum templates)
performance
(raw audio)
training data
(drum-wise audio) Training
Performance
32
DECOMPOSING ONSETS ONTO TEMPLATES
 Non-negative Vector Decomposition (NVD)
 A simplification of Non-negative Matrix Factorization (NMF)
 W matrix contains drum templates in its columns.
 Adding a sparsity penalty (L1) on h improves NVD.
33Bass Snare
Hi-Hat
(closed)
Hi-Hat
(open)
Ride
Head Templates Tail Templates
x =
h
*
W
DECOMPOSING ONSETS ONTO TEMPLATES
 What do we do with the output of NVD?
 The head template activations for a single drum are summed to
get the total activation of that drum.
 The tail template activations are discarded.
 They simply serve as “decoys” so that the long decay of a previous
onset does not affect the current decomposition as drastically.
34
Bass Snare
Hi-Hat
(closed)
Hi-Hat
(open) Ride
Head Templates
Tail Templates
NVD output: h
Drum Activations
Template matrix: W
EVALUATION
 Test Data:
 23 minutes total, 8022 drum onsets
 8 different drums/cymbals:
 Bass, snare, hi-hat (open/closed), ride, 2 toms, crash.
 Recorded to stereo using multiple
microphones.
 50-100 training hits per drum.
 Parameters to vary for testing:
 Maximum number of templates per drum
{0,1,30}
 Result Metrics
 Detection accuracy:
 F-Score
 Amplitude Fidelity
 Cosine similarity 35
DETECTION RESULTS
 Varying maximum templates per drum.
 Adding tail templates helps the most.
 > 1 head template helps too.
36
AUDIO EXAMPLES
 100 BPM, rock, syncopated snare drum, fills/flams.
 Note the messy fills in the KH=1, KT=0 version
Original Performance
KH=30, KT=30
KH=1, KT=0
37
AUDIO EXAMPLES
 181 BPM, fast rock, open hi-hat
 Note the extra hi-hat notes in the KH=1, KT=0 version.
Original Performance
KH=30, KT=30
KH=1, KT=0
38
AUDIO EXAMPLES
 94 BPM, snare drum march, accented notes
 Note the extra bass drum notes and many extra cymbals in the KH=1, KT=0 version.
Original Performance
KH=30, KT=30
KH=1, KT=0
39
DRUM DETECTION SUMMARY
 Drum detection front end for a complete drum understanding
system.
 Gamma Mixture Model
 Cheaper to train than GMM (no covariance matrix)
 More stable than GMM (no covariance)
 Allows soft clustering with perceptual Itakura-Saito distance in the
linear domain (important for learning NVD templates).
 Non-negative Vector Decomposition
 Greatly improved with tail templates and multiple head templates
per drum.
 Next steps
 Online training of templates.
 Training a more general model using many templates 40
LIVE DRUM UNDERSTANDING SYSTEM
Drum
Detection
Beat Tracking
Drum Pattern
Analysis
Drum
audio
Drum-wise
activations
Candidate
beat
periods/phas
es
Beat
locations,
Pattern
analysis
•Gamma Mixture Model
training of drum templates
•Non-negative decomposition
onto templates.
•Generative deep learning of drum
patterns
•Stacked Conditional Restricted
Boltzmann Machines
41
LIVE DRUM UNDERSTANDING SYSTEM
Drum
Detection
Beat Tracking
Drum Pattern
Analysis
Drum
audio
Drum-wise
activations
Beat grid
Beat
locations,
Pattern
analysis
•Gamma Mixture Model
training of drum templates
•Non-negative decomposition
onto templates.
•Generative deep learning of drum
patterns
•Stacked Conditional Restricted
Boltzmann Machines
42
DRUM PATTERN ANALYSIS
 Desired information:
 What style is this?
 What is the meter? (4/4,3/4…)
 Double/half time feel?
 Where is the “one”?
 Typical approach:
 Drum pattern template correlation.
 Align one or more templates with drum onsets. 43
Deep Neural Network
Instead…
Past notes Current Notes
 Modeling motion, language models, other sequences.
 Models: P(v|y)
 For music:
 P(current notes|past notes)
 Useful for generating drum patterns.
 Used to pre-train neural network.
 Intuition for drums
 Hidden units model the drummer‟s options given the recent past.
 And therefore, code information about the state of the drumming.
CONDITIONAL RESTRICTED BOLTZMANN
MACHINE
44
TEST SETUP
 173 twelve-measure sequences
 Recorded on Roland V-Drums
 33,216 beat subdivisions (16 per measure).
 Rock, funk, drum „n‟ bass, metal, Brazilian rhythms.
 Network configurations
 Each with 2 measures of context
 ~90,000 weights each
 Conditional RBM (CRBM) is a variation on the RBM
 1. CRBM (3 layers)
 800 hidden units
 2. CRBMRBM (4 layers)
 600, 50 hidden units
 3. CRBMCRBM (4 layers)
 200, 50 hidden units
 4. CRBMCRBMRBM (5 layers)
 200, 25, 25 hidden units
45
TRAINING
 Training tricks
 L2 weight decay, momentum, dynamic learning rate.
 Implementation
 Python with Gnumpy (numerical routines on GPU)
 GPU computing very important to contemporary neural network
research.
 Around 30 minutes to train one model.
46
MEASURE ALIGNMENT RESULTS
 3-fold cross-validated results.
 Neural network models eliminate half the errors of template correlation.
 Not a significant difference between NN models.
 This will change with a larger, more diverse dataset.
47
QUARTER NOTE ALIGNMENT RESULTS
 Can compute quarter note alignment from full measure alignment
probabilities.
 Quarter note alignment can help correct a beat tracker when it gets out of
phase.
 Again, half the errors are eliminated.
48
EXAMPLE OUTPUT
 Label Probabilities (Red = 1, Blue = 0)
 White lines denote measure boundaries
 Each column shows the probability of each label at that time.
49
EXAMPLE OUTPUT
 Label Estimates (Red denotes estimate)
 White lines denote measure boundaries
50
HMM FILTERING
 Can use label probabilities as observation posteriors in HMM.
 Assign high probability to sequential label transitions.
 Models producing lower cross-entropy improve more when using
HMM-filtering.
51
EXAMPLE OUTPUT
 HMM-Filtered Label Probabilities
52
ANALYSIS OF RESULTS
 Generatively pre-trained neural network models eliminate
about half the errors compared to a baseline template
correlation method.
 HMM-filtering can be used to improve accuracy.
 Overfitting present in backpropagation.
 Address overfitting with:
 Larger dataset
 “Dropout” [Hinton, 2013]
 Many other regularization techniques
 Next steps:
 Important next step is evaluation with much larger, more diverse
dataset.
 Evaluate ability to do style, meter classification. 53
SUMMARY
 Content-Based Music Information Retrieval
 Mood, genre, onsets, beats, transcription, recommendation, and
much, much more!
 Exciting directions include feature learning and RNNs.
 Drum Detection
 Learn multiple templates per drum using agglomerative gamma
mixture model training.
 The use of “tail” templates reduce false positives.
 Drum Pattern Analysis
 A deep neural network can be used to make all kinds of rhythmic
inferences.
 For measure alignment, reduces errors by ~50% compared to
template correlation.
 Generalization can be improved using more data and regularization
techniques during backpropagation.
54
GETTING INVOLVED IN MUSIC INFORMATION
RETRIEVAL
 Check out the proceedings of ISMIR (free online):
 http://www.ismir.net/
 Participate in MIREX (annual MIR eval):
 http://www.music-ir.org/mirex/wiki/MIREX_HOME
 Join the Music-IR mailing list:
 http://listes.ircam.fr/wws/info/music-ir
 Join the Music Information Retrieval Google Plus community
(just started it):
 https://plus.google.com/communities/109771668656894350107
55
REFERENCES
 Genre/Mood Classification:
 [1] P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” Proc. of the
11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 339–344,
2010.
 [2] M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features
for scalable audio classification,” Proceedings of International Symposium on Music Information
Retrieval (ISMIR‟11), 2011.
 [3] J. Anden and S. Mallat, “Deep Scattering Spectrum,” arXiv.org. 2013.
 [4] E. Schmidt and Y. Kim, “Learning rhythm and melody features with deep belief networks,” ISMIR,
2013.
56
REFERENCES
 Onset Detection:
 [1] J. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. Davies, and M. Sandler, “A tutorial on onset
detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, p.
1035, 2005.
 [2] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acoustic musical signals,” IEEE
Transactions on Speech and Audio Processing, vol. 14, no. 1, p. 342, 2006.
 [3] J. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the use of phase and energy for musical
onset detection in the complex domain,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553–
556, 2004.
 [4] S. Böck, A. Arzt, F. Krebs, and M. Schedl, “Online realtime onset detection with recurrent neural
networks,” presented at the International Conference on Digital Audio Effects (DAFx-12), 2012.
 [5] F. Eyben, S. Böck, B. Schuller, and A. Graves, “Universal onset detection with bidirectional long-
short term memory neural networks,” Proc. of ISMIR, 2010.
57
REFERENCES
 Neural Networks :
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional
neural networks,” Advances in neural information processing systems, 2012.
 Unsupervised Feature Learning:
 [2] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural
Computation, 2006.
 [3] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust
features with denoising autoencoders,” pp. 1096–1103, 2008.
 Recurrent Neural Networks:
 [4] I. Sutskever, “Training Recurrent Neural Networks,” 2013.
 [6] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues improvisation with LSTM
recurrent networks,” pp. 747–756, 2002.
 [7] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” pp.
121–124, 2012.
 [8] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in high-
dimensional sequences: Application to polyphonic music generation and transcription,” arXiv preprint
arXiv:1206.6392, 2012.
 [9] G. Taylor and G. E. Hinton, “Two Distributed-State Models For Generating High-Dimensional Time
Series,” Journal of Machine Learning Research, 2011.
58
REFERENCES
 Drum Understanding:
 [1] E. Battenberg, “Techniques for Machine Understanding of Live Drum Performances,” PhD Thesis,
University of California, Berkeley, 2012.
 [2] E. Battenberg and D. Wessel, “Analyzing drum patterns using conditional deep belief networks,”
presented at the International Society for Music Information Retrieval Conference, Porto, Portugal,
2012.
 [3] E. Battenberg, V. Huang, and D. Wessel, “Toward live drum separation using probabilistic spectral
clustering based on the Itakura-Saito divergence,” presented at the Audio Engineering Society
Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio,
2012, vol. 3.
59
THANK YOU!
60
EXTRA SLIDES
61
•Onset Detection Function (ODF): Differentiated log-energy of
multiple perceptual sub-bands.
•Onsets are located using ODF and a dynamic, causal peak-
picking threshold.
ONSET DETECTION
audio in
o(n)
onset detection function
ci(n)
si(n)
zi(n)
dzi(n)
Sub-band energy
20 Bark bands/ channel
Mu-Law
compression
µ = 108
Smoothing
Hann window
20 Hz cutoff
Half-wave
rectified
derivative
Mean across
subbands
FFT
window size = 23ms
hop size = 5.8ms
62
DRUM SEPARATION
 Some Approaches to “Drum Transcription”
 Feature-based classification [Gouyon 2001]
 NMF with general drum templates [Paulus 2005]
 General “average” drum templates.
 Match and Adapt [Yoshii 2007]
 Offline, iterative
 Requirements for “Drum Separation”
 Online/Live operation
 Useful with any percussion setup.
 Which drum is playing when?
And at what dynamic level?
63
GAMMA DISTRIBUTION
 Mixture model is composed of gamma distributions.
 The gamma distribution models the sum of k independent
exponential distributions.
64
GAMMA MIXTURE MODEL
 Multivariate Gamma (independent components):
 Mixture density:
65
THE EM ALGORITHM: GAMMA EDITION
 E-step: (compute posteriors)
 M-step: (update parameters)
66
AGGLOMERATIVE CLUSTERING
 How many clusters to train?
 We use Minimum Description Length (MDL) to choose the number of
clusters.
 Negative log-likelihood
 + penalty term for number of clusters.
 1. Run EM to convergence.
 2. Merge the two most similar clusters.
 3. Repeat 1,2 until we have a single cluster.
 4. Choose parameter set with smallest MDL. 67
DECOMPOSING ONSETS ONTO TEMPLATES
 To solve this problem (add L1 penalty):
 I use the IS distance as the cost function in the above.
 While the IS distance is not strictly convex, in practice it is non-
increasing under the following update rule:
68
RESTRICTED BOLTZMANN MACHINE
 Stochastic autoencoder [Hinton 2006]
 Probabilistic graphical model
 Weights/biases define:

 Factorial conditionals:
 Logistic activation probabilities
 Binary units (as opposed to real-valued units)
 Act as a strong regularizer (prevent overfitting)
 Training: maximize likelihood of data under the model 69
CONTRASTIVE DIVERGENCE
 ML learning can be done using Gibbs
sampling to compute samples from:
 The joint
 By taking alternating samples from:
 But, this can take a very long time to
converge.
 Approximation:
“Contrastive Divergence” [Hinton 2006]
 Take only a few alternating samples (k)
for each update. (CD-k)
70
DEEP BELIEF NETWORKS
(STACKING RBMS)
 Deep Belief Network (4+ layers)
 Pre-train each layer as an RBM
 Greedy Pre-training
 Train first level RBM
 Use real-valued hidden unit
activations of an RBM as input to
subsequent RBM.
 Train next RBM
 Repeat until deep enough.
 Fine-Tuning
 After pre-training, network is
discriminatively fine-tuned using
backpropagation of the cross-
entropy error. 71
EXAMPLE HMM FILTERING
 White lines denote measure boundaries.
72
Label
Probabilities
HMM-Filtered Label Probabilities: LCRBM-800 [test sequence 88]
Label
Probability
Classifier
HMM-
Filtered
Probabilities
HMM
Classifier
0 16 32 48 64 80 96 112 128 144
Beat subdivision [16 per measure]
Ground
Truth
EXAMPLE OUTPUT
 HMM-Filtered Label Estimates
73
LCRBM-800
HMM-Filtered Label Estimates [test sequence 67]
LRBM-50
CRBM-600
LCRBM-50
CRBM-200
LRBM-25
CRBM-25
CRBM-200
2-Template
Correlation
0 16 32 48 64 80 96 112 128 144
Beat subdivision [16 per measure]
Ground
Truth

More Related Content

What's hot

Speech based password authentication system on FPGA
Speech based password authentication system on FPGASpeech based password authentication system on FPGA
Speech based password authentication system on FPGARajesh Roshan
 
Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systemsNamratha Dcruz
 
Speaker identification using mel frequency
Speaker identification using mel frequency Speaker identification using mel frequency
Speaker identification using mel frequency Phan Duy
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Ahmed Ayman
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
MLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicMLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicEric Battenberg
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...MediaEval2012
 
딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)Keunwoo Choi
 
Fun with MATLAB
Fun with MATLABFun with MATLAB
Fun with MATLABritece
 
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Keunwoo Choi
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAbdullah al Mamun
 
44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognitionsunnysyed
 
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...François Rigaud
 

What's hot (20)

Speech based password authentication system on FPGA
Speech based password authentication system on FPGASpeech based password authentication system on FPGA
Speech based password authentication system on FPGA
 
SPEAKER VERIFICATION
SPEAKER VERIFICATIONSPEAKER VERIFICATION
SPEAKER VERIFICATION
 
Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systems
 
Et25897899
Et25897899Et25897899
Et25897899
 
Speaker identification using mel frequency
Speaker identification using mel frequency Speaker identification using mel frequency
Speaker identification using mel frequency
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
Mini Project- Audio Enhancement
Mini Project-  Audio EnhancementMini Project-  Audio Enhancement
Mini Project- Audio Enhancement
 
Speaker recognition.
Speaker recognition.Speaker recognition.
Speaker recognition.
 
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
MLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to MusicMLConf2013: Teaching Computer to Listen to Music
MLConf2013: Teaching Computer to Listen to Music
 
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene...
 
딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)딥러닝 개요 (2015-05-09 KISTEP)
딥러닝 개요 (2015-05-09 KISTEP)
 
Fun with MATLAB
Fun with MATLABFun with MATLAB
Fun with MATLAB
 
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approach
 
44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
 
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
 

Similar to Teaching Computers to Listen to Music

Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMLconf
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitShubham Verma
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
Te electronics & telecommunication
Te electronics & telecommunicationTe electronics & telecommunication
Te electronics & telecommunicationkapilkotangale
 
Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...SungminYou
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningVarad Meru
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningJulien TREGUER
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Lviv Startup Club
 
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET Journal
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 
NMR Automation
NMR AutomationNMR Automation
NMR Automationcknoxrun
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 

Similar to Teaching Computers to Listen to Music (20)

Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_share
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
ISSCS2011
ISSCS2011ISSCS2011
ISSCS2011
 
Te electronics & telecommunication
Te electronics & telecommunicationTe electronics & telecommunication
Te electronics & telecommunication
 
Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
AC overview
AC overviewAC overview
AC overview
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learning
 
T26123129
T26123129T26123129
T26123129
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
 
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
 
som
somsom
som
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
NMR Automation
NMR AutomationNMR Automation
NMR Automation
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Teaching Computers to Listen to Music

  • 1. TEACHING COMPUTERS TO LISTEN TO MUSIC Eric Battenberg http://ericbattenberg.com Gracenote, Inc. Previously: UC Berkeley, Dept. of EECS Parallel Computing Laboratory CNMAT (Center for New Music and Audio Technologies)
  • 2. MACHINE LISTENING  Speech processing  Speech processing makes up the vast majority of funded machine listening research.  Just as there‟s more to Computer Vision than OCR, there‟s more to machine listening than speech recognition!  Audio content analysis  Audio fingerprinting (e.g. Shazam, Gracenote)  Audio classification (music, speech, noise, laughter, cheering)  Audio event detection (new song, channel change, hotword)  Content-based Music Information Retrieval (MIR)  Today‟s topic 2
  • 3. GETTING COMPUTERS TO “LISTEN” TO MUSIC  Not trying to get computers to “listen” for enjoyment.  More accurate: Analyzing music with computers.  What kind of information to get out of the analysis?  What instruments are playing?  What is the mood?  How fast/slow is it (tempo)?  What does the singer sound like?  How can I play this song on the guitar/drums? 3 Ben Harper James Brown
  • 4. CONTENT-BASED MUSIC INFORMATION RETRIEVAL  Many tasks:  Genre, mood classification, auto-tagging  Beat tracking, tempo detection  Music similarity, playlisting, recommendation  Heavily aided by collaborative filtering  Automatic music transcription  Source separation (voice, drums, bass…)  Music segmentation (verse, chorus) 4
  • 5. TALK SUMMARY  Introduction to Music Information Retrieval  Some common techniques  Exciting new research directions  Live Drum Understanding  Drum detection/transcription  Drum pattern analysis 5
  • 6. TALK SUMMARY  Introduction to Music Information Retrieval  Some common techniques  Exciting new research directions  Live Drum Understanding  Drum detection/transcription  Drum pattern analysis 6
  • 7. QUICK LESSON: THE SPECTROGRAM  The spectrogram: Very common feature used in audio analysis.  Time-frequency representation of audio.  Take FFT of adjacent frames of audio samples, put them in a matrix.  Each column shows frequency content at a particular time. 7 Time Frequency
  • 8. GENRE/MOOD CLASSIFICATION: TYPICAL APPROACHES  Typical approach:  Extract a bunch of hand-designed features describing small windows of the signal (e.g., spectral centroid, kurtosis, harmonicity, percussiveness, MFCCs, 100‟s more…).  Train a GMM or SVM to predict genre/mood/tags by either:  Summarizing a song using mean/variance of each feature  Log-likelihood sum across frames (GMM) or frame-wise voting (SVM)  Pros:  Works fairly well, was state of the art for a while.  Well understood models, implementations widely available.  Cons:  Bag-of-frames style approach lacks ability to describe rhythm and temporal dynamics.  Getting further improvements requires hand designing more features. 8
  • 9. LEARNING FEATURES: NEURAL NETWORKS  Each layer computes a non-linear transformation of the previous layer.  Linear transformation (weight matrices)  + non-linearity (e.g. sigmoid [σ])  h = σ(W1v), o = σ(W2h)  Train to minimize output error.  Each hidden layer can be thought of as a set of features.  Train using backpropagation.  Iterative steps:  Compute activations  Compute output error  Backprop. error signal  Compute gradients  Update all weights.  Resurgence of neural networks:  More compute  More data  A few new tricks… 9 Feedfordward Backpropagate Sigmoid non-linearity 1.0 0.0
  • 10. DEEP NEURAL NETWORKS  Deep Neural Networks  Millions to billions of parameters  Many layers of “features”  Achieving state of the art performance in vision and speech tasks.  Problem: Vanishing error signal.  Weights of lower layers do not change much.  Solutions:  Train for a really long time.   Pre-train each hidden layer as an autoencoder. [Hinton, 2006]  Rectified Linear Units [Krizhevsky, 2012] 10 Feedfordward Backpropagate Sigmoid non-linearity Rectifier non-linearity
  • 11. AUTOENCODERS AND UNSUPERVISED FEATURE LEARNING  Many ways to learn features in an unsupervised way:  Autoencoders – train a network to reconstruct the input  Restricted Boltzmann Machine (RBM) [Hinton, 2006]  Denoising Autoencoders [Vincent, 2008]  Sparse Autoencoders  Clustering – K-Means, mixture models, etc.  Sparse Coding – learn overcomplete dictionary of features with sparsity constraint 11 Autoencoder: Train to reconstruct input Hiddens (Features) Data Reconstructed Data
  • 12. GENRE/MOOD CLASSIFICATION: NEWER APPROACHES  Newer approaches to feature extraction:  Learn spectral features using Restricted Boltzmann Machines (RBMs) and Deep Neural Networks (DNN) [Hamel, 2010] – good genre performance.  Learn sparse features using Predictive Sparse Decomposition (PSD) [Henaff, 2011] – good genre performance  Learn beat-synchronous rhythm and timbre features with RBMs and DNNs [Schmidt, 2013] – improved mood performance  Tune multi-layer wavelet features called Deep Scattering Spectrum [Anden, 2013]* - state-of-the-art genre performance  Pros:  Hand-designing individual features is not required.  Computers can learn complex high-order features that humans cannot hand code.  Further work:  More work on incorporating context, rhythm, and temporal dynamics into feature learning 12 * Current state-of-the art on GTZAN genre dataset [Henaff, 2011] Some learned frequency features
  • 13. ONSET DETECTION  Onset detection is important for music transcription, beat tracking, tempo detection, and rhythm summarization.  Describing an onset:  Transient  Short interval when music signal evolves unpredictably.  Attack  Amplitude envelope increasing.  Decay  Amplitude envelope decreasing.  Onset  Point in time chosen to represent beginning of transient.  Onset detection can be hard for certain instruments with ambiguous attacks or when a note changes without a new attack (legato). 13 from (Bello, 2005) Audio Signal Onset Envelope
  • 14. ONSET DETECTION: TYPICAL APPROACHES  Computing an Onset Detection Function (ODF):  Derivative of energy envelope  Derivative of band-wise log-energies [Klapuri, 2006]  Complex spectral domain (difference with predictions of phase and magnitude) [Bello, 2004]  Choosing onsets using the ODF:  Peak pick local maxima above a dynamic threshold [Bello, 2005].  Pros:  Simple to implement.  Works fairly well (60-80% accurate)  Cons  Lots of hand tuning of thresholds  No machine learning 14
  • 15. RECURRENT NEURAL NETWORKS  Non-linear sequence model  Hidden units have connections to previous time step  Unlike HMMs, can model long-term dependencies using distributed hidden state.  Recent developments (+ more compute) have made them much more feasible to train. 15 from [Sutskever, 2013] Output units Distributed hidden state Input units Standard Recurrent Neural Network
  • 16. TRAINING AN RNN ON WIKIPEDIA  Train RNN to predict next character (not word)  Multiplicative RNN [Sutskever, 2013]  Text generation demo: http://www.cs.toronto.edu/~ilya/rnn.html  The machine learning meetup in San Francisco is considered enormously emphasised. While as a consequence of these messages are allocated in the environment to see the ideas and pollentium changes possible with the Machinese gamma integrals increase, then the prefix is absent by a variety of fresh deeperwater or matter level on 2 and 14, yet the… 16
  • 17. ONSET DETECTION: STATE-OF-THE-ART  Using Recurrent Neural Networks (RNN) [Eyben, 2010], [Böck, 2012]  RNN output trained to predict onset locations.  80-90% accurate  Can improve with more labeled training data, or possibly more unsupervised training. 17 3 Layer RNN 3 Layer RNN Onset Detection Function 3 layer RNN Spectrogram + Δ‟s @ 3 time-scales
  • 18. OTHER EXAMPLES OF RNNS IN MUSIC  Examples:  Blues improvisation (with LSTM RNNs) [Eck, 2002]  Polyphonic piano note transcription [Böck, 2012]  Classical music generation and transcription [Boulanger- Lewandowski, 2012]  Other distributed-state sequence models include:  Recurrent Temporal Restricted Boltzmann Machine [Sutskever, 2013]  Conditional Restricted Boltzmann Machine [Taylor, 2011]  Used in drum pattern analysis later.  RNNs are a promising way to model longer-term contextual and temporal dependencies present in music. 18
  • 19. TALK SUMMARY  Introduction to MIR  Common techniques:  Genre/Mood Classification, Onset Detection  Exciting new research directions:  Feature learning, temporal models (RNNs)  Live Drum Understanding  Drum detection/transcription  Drum pattern analysis 19
  • 20. TALK SUMMARY  Introduction to MIR  Common techniques:  Genre/Mood Classification, Onset Detection  Exciting new research directions:  Feature learning, temporal models (RNNs)  Live Drum Understanding  Drum detection/transcription  Drum pattern analysis 20
  • 21. TOWARD COMPREHENSIVE RHYTHMIC UNDERSTANDING  Or “Live Drum Understanding”  Goal: Go beyond simple beat tracking to provide context- aware, instrument-aware information in real-time, e.g.  “This rhythm is in 5/4 time”  “This drummer is playing syncopated notes on the hi-hat”  “The ride cymbal pattern has a swing feel”  “This is a Samba rhythm” 21
  • 22. LIVE DRUM UNDERSTANDING SYSTEM Drum Detection Beat Tracking Drum Pattern Analysis Drum audio Drum-wise activations Beat grid Beat locations, Pattern analysis •Gamma Mixture Model training of drum templates •Non-negative decomposition onto templates. •Generative deep learning of drum patterns •Stacked Conditional Restricted Boltzmann Machines 22
  • 23. REQUIREMENTS FOR DRUM DETECTION  Real-Time/Live operation  Useful with any percussion setup.  Before a performance, we can quickly train the system for a particular percussion setup.  Amplitude (dynamics) information. 23
  • 24. DRUM DETECTION: MAIN POINTS  Gamma Mixture Model  For learning spectral drum templates.  Cheaper to train than GMM  More stable than GMM  Non-negative Vector Decomposition (NVD)  For computing template activations from drum onsets.  Learning multiple templates per drum improves separation.  The use of “tail” templates reduces false positives. 24
  • 25. DRUM DETECTION SYSTEM Onset Detection Spectrogram Slice Extraction Gamma Mixture Model Non-negative Vector Decomposition cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) drum activations Onset Detection Gamma Mixture Model Non-negative cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Training Performance 25
  • 26. DRUM DETECTION SYSTEM Onset Detection Spectrogram Slice Extraction Gamma Mixture Model Non-negative Vector Decomposition cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) drum activations Onset Detection Gamma Mixture Model Non-negative cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Training Performance 26
  • 27. SPECTROGRAM SLICES  Extracted at onsets.  Each slice contains 100ms (~17 frames) of audio  80 bark-spaced bands per channel [Battenberg 2008]  During training, both “head” and “tail” slices are extracted.  Tail templates serve as decoys during non-negative vector decomposition. Head Slice Tail Slice 33ms 67ms 100ms 27 80bands Detected onset
  • 28. DRUM DETECTION SYSTEM Onset Detection Spectrogram Slice Extraction Gamma Mixture Model Non-negative Vector Decomposition cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) drum activations Onset Detection Gamma Mixture Model Non-negative cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Training Performance 28
  • 29. TRAINING DRUM TEMPLATES  Instead of taking an “average” of all training slices for a single drum…  Cluster them and use the cluster centers as the drum templates.  This gives us multiple templates per drum…  Which helps represent the variety of sounds that can be made by a single drum. 29
  • 30. CLUSTERING USING MIXTURE MODELS  Train using the Expectation-Maximization (EM) algorithm.  Gaussian Mixture Model (GMM)  Covariance matrix – expensive to compute, possibly unstable when data is lacking.  Enforces a (scaled,squared) Euclidean distance measure.  Gamma Mixture Model  Single mean vector per component  Variance increases with mean (like human hearing).  Enforces an Itakura-Saito (IS) divergence measure  A scale-invariant perceptual distance between audio spectra. 30
  • 31. AGGLOMERATIVE CLUSTERING  How many clusters to train?  We use Minimum Description Length (MDL), aka BIC, to choose the number of clusters.  Negative log-likelihood  + penalty term for number of clusters.  1. Run EM to convergence.  2. Merge the two most similar clusters.  3. Repeat 1,2 until we have a single cluster.  4. Choose parameter set with smallest MDL. 31 MDL= 187.8 MDL= -739.8 MDL= -835.6 MDL= -201.5 MDL= 1373
  • 32. DRUM DETECTION SYSTEM Onset Detection Spectrogram Slice Extraction Gamma Mixture Model Non-negative Vector Decomposition cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) drum activations Onset Detection Gamma Mixture Model Non-negative cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) training data (drum-wise audio) Training Performance 32
  • 33. DECOMPOSING ONSETS ONTO TEMPLATES  Non-negative Vector Decomposition (NVD)  A simplification of Non-negative Matrix Factorization (NMF)  W matrix contains drum templates in its columns.  Adding a sparsity penalty (L1) on h improves NVD. 33Bass Snare Hi-Hat (closed) Hi-Hat (open) Ride Head Templates Tail Templates x = h * W
  • 34. DECOMPOSING ONSETS ONTO TEMPLATES  What do we do with the output of NVD?  The head template activations for a single drum are summed to get the total activation of that drum.  The tail template activations are discarded.  They simply serve as “decoys” so that the long decay of a previous onset does not affect the current decomposition as drastically. 34 Bass Snare Hi-Hat (closed) Hi-Hat (open) Ride Head Templates Tail Templates NVD output: h Drum Activations Template matrix: W
  • 35. EVALUATION  Test Data:  23 minutes total, 8022 drum onsets  8 different drums/cymbals:  Bass, snare, hi-hat (open/closed), ride, 2 toms, crash.  Recorded to stereo using multiple microphones.  50-100 training hits per drum.  Parameters to vary for testing:  Maximum number of templates per drum {0,1,30}  Result Metrics  Detection accuracy:  F-Score  Amplitude Fidelity  Cosine similarity 35
  • 36. DETECTION RESULTS  Varying maximum templates per drum.  Adding tail templates helps the most.  > 1 head template helps too. 36
  • 37. AUDIO EXAMPLES  100 BPM, rock, syncopated snare drum, fills/flams.  Note the messy fills in the KH=1, KT=0 version Original Performance KH=30, KT=30 KH=1, KT=0 37
  • 38. AUDIO EXAMPLES  181 BPM, fast rock, open hi-hat  Note the extra hi-hat notes in the KH=1, KT=0 version. Original Performance KH=30, KT=30 KH=1, KT=0 38
  • 39. AUDIO EXAMPLES  94 BPM, snare drum march, accented notes  Note the extra bass drum notes and many extra cymbals in the KH=1, KT=0 version. Original Performance KH=30, KT=30 KH=1, KT=0 39
  • 40. DRUM DETECTION SUMMARY  Drum detection front end for a complete drum understanding system.  Gamma Mixture Model  Cheaper to train than GMM (no covariance matrix)  More stable than GMM (no covariance)  Allows soft clustering with perceptual Itakura-Saito distance in the linear domain (important for learning NVD templates).  Non-negative Vector Decomposition  Greatly improved with tail templates and multiple head templates per drum.  Next steps  Online training of templates.  Training a more general model using many templates 40
  • 41. LIVE DRUM UNDERSTANDING SYSTEM Drum Detection Beat Tracking Drum Pattern Analysis Drum audio Drum-wise activations Candidate beat periods/phas es Beat locations, Pattern analysis •Gamma Mixture Model training of drum templates •Non-negative decomposition onto templates. •Generative deep learning of drum patterns •Stacked Conditional Restricted Boltzmann Machines 41
  • 42. LIVE DRUM UNDERSTANDING SYSTEM Drum Detection Beat Tracking Drum Pattern Analysis Drum audio Drum-wise activations Beat grid Beat locations, Pattern analysis •Gamma Mixture Model training of drum templates •Non-negative decomposition onto templates. •Generative deep learning of drum patterns •Stacked Conditional Restricted Boltzmann Machines 42
  • 43. DRUM PATTERN ANALYSIS  Desired information:  What style is this?  What is the meter? (4/4,3/4…)  Double/half time feel?  Where is the “one”?  Typical approach:  Drum pattern template correlation.  Align one or more templates with drum onsets. 43 Deep Neural Network Instead…
  • 44. Past notes Current Notes  Modeling motion, language models, other sequences.  Models: P(v|y)  For music:  P(current notes|past notes)  Useful for generating drum patterns.  Used to pre-train neural network.  Intuition for drums  Hidden units model the drummer‟s options given the recent past.  And therefore, code information about the state of the drumming. CONDITIONAL RESTRICTED BOLTZMANN MACHINE 44
  • 45. TEST SETUP  173 twelve-measure sequences  Recorded on Roland V-Drums  33,216 beat subdivisions (16 per measure).  Rock, funk, drum „n‟ bass, metal, Brazilian rhythms.  Network configurations  Each with 2 measures of context  ~90,000 weights each  Conditional RBM (CRBM) is a variation on the RBM  1. CRBM (3 layers)  800 hidden units  2. CRBMRBM (4 layers)  600, 50 hidden units  3. CRBMCRBM (4 layers)  200, 50 hidden units  4. CRBMCRBMRBM (5 layers)  200, 25, 25 hidden units 45
  • 46. TRAINING  Training tricks  L2 weight decay, momentum, dynamic learning rate.  Implementation  Python with Gnumpy (numerical routines on GPU)  GPU computing very important to contemporary neural network research.  Around 30 minutes to train one model. 46
  • 47. MEASURE ALIGNMENT RESULTS  3-fold cross-validated results.  Neural network models eliminate half the errors of template correlation.  Not a significant difference between NN models.  This will change with a larger, more diverse dataset. 47
  • 48. QUARTER NOTE ALIGNMENT RESULTS  Can compute quarter note alignment from full measure alignment probabilities.  Quarter note alignment can help correct a beat tracker when it gets out of phase.  Again, half the errors are eliminated. 48
  • 49. EXAMPLE OUTPUT  Label Probabilities (Red = 1, Blue = 0)  White lines denote measure boundaries  Each column shows the probability of each label at that time. 49
  • 50. EXAMPLE OUTPUT  Label Estimates (Red denotes estimate)  White lines denote measure boundaries 50
  • 51. HMM FILTERING  Can use label probabilities as observation posteriors in HMM.  Assign high probability to sequential label transitions.  Models producing lower cross-entropy improve more when using HMM-filtering. 51
  • 52. EXAMPLE OUTPUT  HMM-Filtered Label Probabilities 52
  • 53. ANALYSIS OF RESULTS  Generatively pre-trained neural network models eliminate about half the errors compared to a baseline template correlation method.  HMM-filtering can be used to improve accuracy.  Overfitting present in backpropagation.  Address overfitting with:  Larger dataset  “Dropout” [Hinton, 2013]  Many other regularization techniques  Next steps:  Important next step is evaluation with much larger, more diverse dataset.  Evaluate ability to do style, meter classification. 53
  • 54. SUMMARY  Content-Based Music Information Retrieval  Mood, genre, onsets, beats, transcription, recommendation, and much, much more!  Exciting directions include feature learning and RNNs.  Drum Detection  Learn multiple templates per drum using agglomerative gamma mixture model training.  The use of “tail” templates reduce false positives.  Drum Pattern Analysis  A deep neural network can be used to make all kinds of rhythmic inferences.  For measure alignment, reduces errors by ~50% compared to template correlation.  Generalization can be improved using more data and regularization techniques during backpropagation. 54
  • 55. GETTING INVOLVED IN MUSIC INFORMATION RETRIEVAL  Check out the proceedings of ISMIR (free online):  http://www.ismir.net/  Participate in MIREX (annual MIR eval):  http://www.music-ir.org/mirex/wiki/MIREX_HOME  Join the Music-IR mailing list:  http://listes.ircam.fr/wws/info/music-ir  Join the Music Information Retrieval Google Plus community (just started it):  https://plus.google.com/communities/109771668656894350107 55
  • 56. REFERENCES  Genre/Mood Classification:  [1] P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” Proc. of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 339–344, 2010.  [2] M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features for scalable audio classification,” Proceedings of International Symposium on Music Information Retrieval (ISMIR‟11), 2011.  [3] J. Anden and S. Mallat, “Deep Scattering Spectrum,” arXiv.org. 2013.  [4] E. Schmidt and Y. Kim, “Learning rhythm and melody features with deep belief networks,” ISMIR, 2013. 56
  • 57. REFERENCES  Onset Detection:  [1] J. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. Davies, and M. Sandler, “A tutorial on onset detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, p. 1035, 2005.  [2] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acoustic musical signals,” IEEE Transactions on Speech and Audio Processing, vol. 14, no. 1, p. 342, 2006.  [3] J. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the use of phase and energy for musical onset detection in the complex domain,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553– 556, 2004.  [4] S. Böck, A. Arzt, F. Krebs, and M. Schedl, “Online realtime onset detection with recurrent neural networks,” presented at the International Conference on Digital Audio Effects (DAFx-12), 2012.  [5] F. Eyben, S. Böck, B. Schuller, and A. Graves, “Universal onset detection with bidirectional long- short term memory neural networks,” Proc. of ISMIR, 2010. 57
  • 58. REFERENCES  Neural Networks :  [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in neural information processing systems, 2012.  Unsupervised Feature Learning:  [2] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural Computation, 2006.  [3] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” pp. 1096–1103, 2008.  Recurrent Neural Networks:  [4] I. Sutskever, “Training Recurrent Neural Networks,” 2013.  [6] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues improvisation with LSTM recurrent networks,” pp. 747–756, 2002.  [7] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” pp. 121–124, 2012.  [8] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in high- dimensional sequences: Application to polyphonic music generation and transcription,” arXiv preprint arXiv:1206.6392, 2012.  [9] G. Taylor and G. E. Hinton, “Two Distributed-State Models For Generating High-Dimensional Time Series,” Journal of Machine Learning Research, 2011. 58
  • 59. REFERENCES  Drum Understanding:  [1] E. Battenberg, “Techniques for Machine Understanding of Live Drum Performances,” PhD Thesis, University of California, Berkeley, 2012.  [2] E. Battenberg and D. Wessel, “Analyzing drum patterns using conditional deep belief networks,” presented at the International Society for Music Information Retrieval Conference, Porto, Portugal, 2012.  [3] E. Battenberg, V. Huang, and D. Wessel, “Toward live drum separation using probabilistic spectral clustering based on the Itakura-Saito divergence,” presented at the Audio Engineering Society Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio, 2012, vol. 3. 59
  • 62. •Onset Detection Function (ODF): Differentiated log-energy of multiple perceptual sub-bands. •Onsets are located using ODF and a dynamic, causal peak- picking threshold. ONSET DETECTION audio in o(n) onset detection function ci(n) si(n) zi(n) dzi(n) Sub-band energy 20 Bark bands/ channel Mu-Law compression µ = 108 Smoothing Hann window 20 Hz cutoff Half-wave rectified derivative Mean across subbands FFT window size = 23ms hop size = 5.8ms 62
  • 63. DRUM SEPARATION  Some Approaches to “Drum Transcription”  Feature-based classification [Gouyon 2001]  NMF with general drum templates [Paulus 2005]  General “average” drum templates.  Match and Adapt [Yoshii 2007]  Offline, iterative  Requirements for “Drum Separation”  Online/Live operation  Useful with any percussion setup.  Which drum is playing when? And at what dynamic level? 63
  • 64. GAMMA DISTRIBUTION  Mixture model is composed of gamma distributions.  The gamma distribution models the sum of k independent exponential distributions. 64
  • 65. GAMMA MIXTURE MODEL  Multivariate Gamma (independent components):  Mixture density: 65
  • 66. THE EM ALGORITHM: GAMMA EDITION  E-step: (compute posteriors)  M-step: (update parameters) 66
  • 67. AGGLOMERATIVE CLUSTERING  How many clusters to train?  We use Minimum Description Length (MDL) to choose the number of clusters.  Negative log-likelihood  + penalty term for number of clusters.  1. Run EM to convergence.  2. Merge the two most similar clusters.  3. Repeat 1,2 until we have a single cluster.  4. Choose parameter set with smallest MDL. 67
  • 68. DECOMPOSING ONSETS ONTO TEMPLATES  To solve this problem (add L1 penalty):  I use the IS distance as the cost function in the above.  While the IS distance is not strictly convex, in practice it is non- increasing under the following update rule: 68
  • 69. RESTRICTED BOLTZMANN MACHINE  Stochastic autoencoder [Hinton 2006]  Probabilistic graphical model  Weights/biases define:   Factorial conditionals:  Logistic activation probabilities  Binary units (as opposed to real-valued units)  Act as a strong regularizer (prevent overfitting)  Training: maximize likelihood of data under the model 69
  • 70. CONTRASTIVE DIVERGENCE  ML learning can be done using Gibbs sampling to compute samples from:  The joint  By taking alternating samples from:  But, this can take a very long time to converge.  Approximation: “Contrastive Divergence” [Hinton 2006]  Take only a few alternating samples (k) for each update. (CD-k) 70
  • 71. DEEP BELIEF NETWORKS (STACKING RBMS)  Deep Belief Network (4+ layers)  Pre-train each layer as an RBM  Greedy Pre-training  Train first level RBM  Use real-valued hidden unit activations of an RBM as input to subsequent RBM.  Train next RBM  Repeat until deep enough.  Fine-Tuning  After pre-training, network is discriminatively fine-tuned using backpropagation of the cross- entropy error. 71
  • 72. EXAMPLE HMM FILTERING  White lines denote measure boundaries. 72 Label Probabilities HMM-Filtered Label Probabilities: LCRBM-800 [test sequence 88] Label Probability Classifier HMM- Filtered Probabilities HMM Classifier 0 16 32 48 64 80 96 112 128 144 Beat subdivision [16 per measure] Ground Truth
  • 73. EXAMPLE OUTPUT  HMM-Filtered Label Estimates 73 LCRBM-800 HMM-Filtered Label Estimates [test sequence 67] LRBM-50 CRBM-600 LCRBM-50 CRBM-200 LRBM-25 CRBM-25 CRBM-200 2-Template Correlation 0 16 32 48 64 80 96 112 128 144 Beat subdivision [16 per measure] Ground Truth

Editor's Notes

  1. GMM = Gaussian Mixture ModelSVM = Support Vector Machine
  2. In order to accomplish this we first need to detect when each drum is playing.
  3. Snare, bass, hi-hat is not all there is to drumming
  4. 80 bark bands down from 513 positive FFT coefficientsWork in 2008 shows that this type of spectral dimensionality reduction speeds up NMF while retaining/improving separation performance
  5. Sound example is clearly not just amplitude scaling of a single sound[snare training sound clip]
  6. N – training pointsM – dimensionality of training dataK – number of clustersTheta – model parametersWe can afford to use K0 = N (number of training points)
  7. \vec{h}_i are template activations
  8. ----- Meeting Notes (3/2/12 01:01) -----add animations!
  9. Superior drummer 2.0 is highly multi-sampled so it is a good approximation of a real-world signal (thought it’s probably a little too professional sounding for most people)
  10. Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  11. Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  12. Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  13. Dataset should bigger
  14. Not a big difference between modelsWhich one to pick?Need bigger dataset.
  15. Incorporate new results
  16. Snare, bass, hi-hat is not all there is to drumming
  17. Interesting to note that: as the mean increases, the variance increases.For constant mean, as k increases, variances decreases.As k->inf, approaches a gaussian distribution
  18. Posterior is weighted mean of exponentially scaled IS distance between cluster center and data pointNotice that there’s no complex covariance update (or even a variance update)
  19. N – training pointsM – dimensionality of training dataK – number of clustersTheta – model parametersWe can afford to use K0 = N (number of training points)
  20. \vec{h}_i are template activations
  21. Computing Z is intractable so computing the joint is difficultStrong regularizer:Learn binary features in a distributed representation