MLConf2013: Teaching Computer to Listen to Music

1,696
-1

Published on

Shortened version of my previous talk for MLConf2013, Nov 15 in San Francisco.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,696
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • And ultimately, transform how people experience their favorite movies, TV shows and music
  • GMM = Gaussian Mixture ModelSVM = Support Vector Machine
  • Snare, bass, hi-hat is not all there is to drumming
  • Feature extraction -> machine learning
  • Feature extraction -> machine learning
  • 80 bark bands down from 513 positive FFT coefficientsWork in 2008 shows that this type of spectral dimensionality reduction speeds up NMF while retaining/improving separation performance
  • Feature extraction -> machine learning
  • Feature extraction -> machine learning
  • \vec{h}_i are template activations
  • Superior drummer 2.0 is highly multi-sampled so it is a good approximation of a real-world signal (thought it’s probably a little too professional sounding for most people)
  • Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  • Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  • Actual optimal number of templates:Drum:Bass,snare,cHihat,oHihat,RideHead:{3,4,3,2,2}Tail:{2,4,3,2,2}
  • MLConf2013: Teaching Computer to Listen to Music

    1. 1. TEACHING COMPUTERS TO LISTEN TO MUSIC Eric Battenberg ebattenberg@gracenote.com http://ericbattenberg.com Gracenote, Inc. Previously: UC Berkeley, Dept. of EECS Parallel Computing Laboratory CNMAT (Center for New Music and Audio Technologies)
    2. 2. Business Verticals Rich, Diverse Multimedia Metadata Cloud Music Services, Mobile Apps and Devices Smart TVs, Second Screen Apps, Targeted Ads Connected Infotainment Systems on the Road 2
    3. 3. SOME OF GRACENOTE’S PRODUCTS  Original product: CDDB (1998)  Recognize CDs and retrieve associated metadata  Scan & Match to precisely identify digital music quickly in large catalogs  ACR (Automatic Content Recognition)    Audio/Video Fingerprinting, Enabling second screen experiences Real-time audio classification systems for mobile devices. Contextually aware cross-modal personalization   Taste profiling for Music, TV, Movies, Social networks Music Mood descriptors that combine machine learning and editorial training data. 3
    4. 4. GRACENOTE’S CUSTOMERS Music Video Auto 4
    5. 5. 5
    6. 6. MACHINE LISTENING  Speech processing    Speech processing makes up the vast majority of funded machine listening research. Just as there’s more to Computer Vision than OCR, there’s more to machine listening than speech recognition! Audio content analysis   Audio classification (music, speech, noise, laughter, cheering)   Audio fingerprinting (e.g. Gracenote, Shazam) Audio event detection (new song, channel change, hotword) Content-based Music Information Retrieval (MIR)  Today’s topic 6
    7. 7. GETTING COMPUTERS TO “LISTEN” TO MUSIC  Not trying to get computers to “listen” for enjoyment.  More accurate: Analyzing music with computers.  What kind of information to get out of the analysis?  What instruments are playing?  What is the mood?  How fast/slow is it (tempo)?  What does the singer sound like?  How can I play this song on my instrument? Ben Harper James Brown 7
    8. 8. CONTENT-BASED MUSIC INFORMATION RETRIEVAL  Many tasks:  Genre, mood classification, auto-tagging  Beat tracking, tempo detection  Music similarity, playlisting, recommendation  Heavily aided by collaborative filtering  Automatic music transcription  Source separation (voice, drums, bass…)  Music segmentation (verse, chorus) 8
    9. 9. TALK SUMMARY  Introduction to Music Information Retrieval (MIR)  Some common techniques   Exciting new research directions   Auto-tagging, onset detection Deep learning! Example: Live Drum Understanding  Drum detection/transcription 9
    10. 10. TALK SUMMARY  Introduction to Music Information Retrieval (MIR)  Some common techniques   Exciting new research directions   Auto-tagging, onset detection Deep learning! Example: Live Drum Understanding  Drum detection/transcription 10
    11. 11. QUICK LESSON: THE SPECTROGRAM The spectrogram: Very common feature used in audio analysis.  Time-frequency representation of audio.  Take FFT of adjacent frames of audio samples, put them in a matrix.  Each column shows frequency content at a particular time. Frequency  11 Time
    12. 12. MUSIC AUTO-TAGGING: TYPICAL APPROACHES  Typical approach:    Extract a bunch of hand-designed features describing small windows of the signal (e.g., spectral centroid, kurtosis, harmonicity, percussiveness, MFCCs, 100’s more…). Train a GMM or SVM to predict genre/mood/tags. Pros:    Works fairly well, was state of the art for a while. Well understood models, implementations widely available. Cons:  Bag-of-frames style approach lacks ability to describe rhythm and temporal dynamics.  Getting further improvements requires hand designing more features. 12
    13. 13. LEARNING FEATURES: NEURAL NETWORKS Each layer computes a non-linear transformation of the previous layer.  Train to minimize output error.  Each hidden layer can be thought of as a set of features.  Train using backpropagation.  Iterative steps:  Compute output error  Backprop. error signal  Compute gradients   Compute activations 0.0 Backpropagate  1.0 Feedfordward  Sigmoid non-linearity Update all weights. Resurgence of neural networks:  More compute (GPUs, Google, etc.)  More data  A few new tricks… 13
    14. 14. DEEP NEURAL NETWORKS Deep Neural Networks    Problem: Vanishing error signal.   Weights of lower layers do not change much. Solutions:    Train for a really long time.  Pre-train each hidden layer as an autoencoder. [Hinton, 2006] Rectified Linear Units [Krizhevsky, 2012] Sigmoid non-linearity Backpropagate  Millions to billions of parameters Many layers of “features” Achieving state of the art performance in vision and speech tasks. Feedfordward  Rectifier non-linearity 14
    15. 15. AUTOENCODERS AND UNSUPERVISED FEATURE LEARNING   Many ways to learn features in an unsupervised way: Autoencoders – train a network to reconstruct the input  Reconstructed Hiddens Data (Features) Denoising Autoencoders [Vincent, 2008]  Data Restricted Boltzmann Machine (RBM) [Hinton, 2006]  Autoencoder: Train to reconstruct input Sparse Autoencoders  Clustering – K-Means, mixture models, etc.  Sparse Coding – learn overcomplete dictionary of features with sparsity constraint 15
    16. 16. MUSIC AUTO-TAGGING: NEWER APPROACHES  Newer approaches to feature extraction:     Learn spectral features using Restricted Boltzmann Machines (RBMs) and Deep Neural Networks (DNN) [Hamel, 2010] – good genre performance. Learn sparse features using Predictive Sparse Decomposition (PSD) [Henaff, 2011] – good genre performance Learn beat-synchronous rhythm and timbre features with RBMs and DNNs [Schmidt, 2013] – improved mood performance Pros:    Some learned frequency features Hand-designing individual features is not required. Computers can learn complex high-order features that humans cannot hand code. [Henaff, 2011] Missing pieces:  More work on incorporating time: context, rhythm, and longterm structure into feature learning 16
    17. 17. RECURRENT NEURAL NETWORKS  Non-linear sequence model  Hidden units have connections to previous time step  Unlike HMMs, can model long-term dependencies using distributed hidden state.  Recent developments (+ more compute) have made them much more feasible to train. Standard Recurrent Neural Network Output units Distributed hidden state Input units 17 from [Sutskever, 2013]
    18. 18. APPLYING RNNS TO ONSET DETECTION  Onset Detection:  Audio Signal Detect the beginning of notes  Important for music transcription, beat tracking, etc.  Onset detection can be hard for certain instruments with ambiguous attacks or in the presence of background interference. Onset Envelope 18 from (Bello, 2005)
    19. 19. ONSET DETECTION: STATE-OF-THE-ART  Using Recurrent Neural Networks (RNN) [Eyben, 2010], [Böck, 2012] Spectrogram + Δ’s @ 3 time-scales 3 layer RNN Onset Detection Function 3 Layer RNN 3 Layer RNN  RNN output trained to predict onset locations.  80-90% accurate (state-of-the art), compared to 60-80%  Can improve with more labeled training data, or possibly more unsupervised training. 19
    20. 20. OTHER EXAMPLES OF RNNS IN MUSIC  Examples:   Classical music generation and transcription [BoulangerLewandowski, 2012]   Polyphonic piano note transcription [Böck, 2012] NMF-based source separation with predictive constraints from RNN. [ICASSP 2014 “Deep Learning in Music”]. RNNs are a promising way to model long-term contextual and temporal dependencies present in music. 20
    21. 21. TALK SUMMARY  Introduction to Music Information Retrieval (MIR)  Some common techniques   Exciting new research directions   Auto-tagging, onset detection Deep learning! Example: Live Drum Understanding  Drum detection/transcription 21
    22. 22. LIVE DRUM TRANSCRIPTION  Real-Time/Live operation  Useful with any percussion setup.    Before a performance, we can quickly train the system for a particular percussion setup. Or train a more general model for common drum types. Amplitude (dynamics) information.  Very important for musical understanding 22
    23. 23. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 23
    24. 24. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 24
    25. 25. SPECTROGRAM SLICES  Extracted at onsets.  Each slice contains 100ms of audio 33ms 67ms 80 bands Detected onset 25 Head Slice
    26. 26. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 26
    27. 27. MODELING DRUM SOUNDS WITH GAMMA MIXTURE MODELS  Instead of taking an “average” of all training slices for a single drum…  Model each drum using means from Gamma Mixture Model.  Captures range of sounds produced by each drum.  Cheaper to train than GMM (no covariance matrix)  More stable than GMM (no covariance matrix)  Cluster according to human auditory perception.   Not Euclidean distance Keep audio spectrogram in linear domain (not logdomain)  Important for linear source separation 27
    28. 28. DRUM DETECTION SYSTEM training data (drum-wise audio) Drum Modeling Onset Detection Gamma Mixture Model cluster parameters (drum templates) performance (raw audio) Spectrogram Slice Extraction training data (drum-wisedata training audio) (drum-wise audio) performance (raw audio) performance (raw audio) Feature Extraction Onset Training Onset Detection Detection Performance Non-negative Vector Decomposition drum activations Source Separation Gamma Gamma Mixture Model Mixture Model cluster parameters (drum parameters cluster templates) (drum templates) Non-negative 28
    29. 29. DECOMPOSING ONSETS ONTO TEMPLATES  Non-negative Vector Decomposition (NVD)  A simplification of Non-negative Matrix Factorization (NMF)  W matrix contains drum templates in its columns.  Adding a sparsity penalty (L1) on h improves NVD. Solve: Input x Drum Templates = W * Bass Snare Hi-Hat (closed) Hi-Hat (open) Ride 29 h Output
    30. 30. EVALUATION  Test Data:   Recorded to stereo using multiple microphones.   8 different drums/cymbals 50-100 training hits per drum. Vary maximum number of templates per drum 30
    31. 31. DETECTION RESULTS  Varying maximum templates per drum.  More templates = better accuracy Detection Accuracy (F-score) Amplitude Accuracy (Cosine Similarity) 31 More Templates Per Drum More Templates Per Drum
    32. 32. DRUM DETECTION: MAIN POINTS  Gamma Mixture Model   Cheaper to train than GMM  More stable than GMM   For learning spectral drum templates. Keep audio spectrogram in linear domain (not log-domain) Non-negative Vector Decomposition (NVD)  For computing template activations from drum onsets.  Learning multiple templates per drum improves source separation and detection. 32
    33. 33. AUDIO EXAMPLES  100 BPM, rock, syncopated snare drum, fills/flams.  Note the messy fills in the single template version. Original Performance Multiple Templates per Drum Single Template per Drum 33
    34. 34. AUDIO EXAMPLES  181 BPM, fast rock, open hi-hat  Note the extra hi-hat notes in the single template version. Original Performance Multiple Templates per Drum Single Template per Drum 34
    35. 35. AUDIO EXAMPLES  94 BPM, snare drum march, accented notes  Note the extra bass drum notes and many extra cymbals in the single template version. Original Performance Multiple Templates per Drum Single Template per Drum 35
    36. 36. SUMMARY  Content-Based Music Information Retrieval    Mood, genre, onsets, beats, transcription, recommendation, and much, much more! Exciting directions include deep/feature learning and RNNs. Drum Detection  Gamma mixture model template learning.  Learning multiple templates = better source separation performance 36
    37. 37. TEACHING COMPUTERS TO LISTEN TO MUSIC Eric Battenberg ebattenberg@gracenote.com http://ericbattenberg.com Gracenote, Inc. Previously: UC Berkeley, Dept. of EECS Parallel Computing Laboratory CNMAT (Center for New Music and Audio Technologies)
    38. 38. GETTING INVOLVED IN MUSIC INFORMATION RETRIEVAL  Check out the proceedings of ISMIR (free online):   Participate in MIREX (annual MIR eval):   http://www.music-ir.org/mirex/wiki/MIREX_HOME Join the Music-IR mailing list:   http://www.ismir.net/ http://listes.ircam.fr/wws/info/music-ir Join the Music Information Retrieval Google Plus community (just started it):  https://plus.google.com/communities/109771668656894350107 38
    39. 39. REFERENCES  Genre/Mood Classification:  [1] P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” Proc. of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 339–344, 2010.  [2] M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features for scalable audio classification,” Proceedings of International Symposium on Music Information Retrieval (ISMIR’11), 2011.  [3] J. Anden and S. Mallat, “Deep Scattering Spectrum,” arXiv.org. 2013.  [4] E. Schmidt and Y. Kim, “Learning rhythm and melody features with deep belief networks,” ISMIR, 2013. 39
    40. 40. REFERENCES  Onset Detection:  [1] J. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. Davies, and M. Sandler, “A tutorial on onset detection in music signals,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, p. 1035, 2005.  [2] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acoustic musical signals,” IEEE Transactions on Speech and Audio Processing, vol. 14, no. 1, p. 342, 2006.  [3] J. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the use of phase and energy for musical onset detection in the complex domain,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553–556, 2004.  [4] S. Böck, A. Arzt, F. Krebs, and M. Schedl, “Online realtime onset detection with recurrent neural networks,” presented at the International Conference on Digital Audio Effects (DAFx-12), 2012.  [5] F. Eyben, S. Böck, B. Schuller, and A. Graves, “Universal onset detection with bidirectional longshort term memory neural networks,” Proc. of ISMIR, 2010. 40
    41. 41. REFERENCES  Neural Networks :   [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in neural information processing systems, 2012. Unsupervised Feature Learning:    [2] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural Computation, 2006. [3] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” pp. 1096–1103, 2008. Recurrent Neural Networks:  [4] I. Sutskever, “Training Recurrent Neural Networks,” 2013.  [6] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues improvisation with LSTM recurrent networks,” pp. 747–756, 2002.  [7] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” pp. 121–124, 2012.  [8] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription,” arXiv preprint arXiv:1206.6392, 2012.  [9] G. Taylor and G. E. Hinton, “Two Distributed-State Models For Generating High-Dimensional Time Series,” Journal of Machine Learning Research, 2011. 41
    42. 42. REFERENCES  Drum Understanding:  [1] E. Battenberg, “Techniques for Machine Understanding of Live Drum Performances,” PhD Thesis, University of California, Berkeley, 2012.  [2] E. Battenberg and D. Wessel, “Analyzing drum patterns using conditional deep belief networks,” presented at the International Society for Music Information Retrieval Conference, Porto, Portugal, 2012.  [3] E. Battenberg, V. Huang, and D. Wessel, “Toward live drum separation using probabilistic spectral clustering based on the Itakura-Saito divergence,” presented at the Audio Engineering Society Conference: 45th International Conference: Applications of Time-Frequency Processing in Audio, 2012, vol. 3. 42
    43. 43. THANK YOU! 43
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×