A Computational Framework for Sound Segregation in Music Signals using Marsyas

1,052
-1

Published on

Slides of my talk @ Google - Auditory Modeling workshop, Nov. 19. 2010

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,052
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

A Computational Framework for Sound Segregation in Music Signals using Marsyas

  1. 1. A Computational Framework for Sound Segregation in Music Signals Luís Gustavo Martins CITAR / Escola das Artes da UCP lmartins@porto.ucp.pt Porto, Portugal Auditory Modeling Workshop Google, MountainView, CA, USA 19.11.2010
  2. 2. Acknowledgments A Computational Framework for Sound Segregation in Music Signals2 }  This work is the result of the collaboration with: }  University ofVictoria, BC, Canada }  GeorgeTzanetakis, Mathieu Lagrange, Jennifer Murdock }  All the Marsyas team }  INESC Porto }  Luis Filipe Teixeira }  Jaime Cardoso }  Fabien Gouyon }  Technical University of Berlin, Germany }  Juan José Burred }  FEUP PhD Advisor Professor }  Aníbal Ferreira }  Supporting entities }  Fundação para a Ciência e aTecnologia - FCT }  Fundação Calouste Gulbenkian }  VISNET II, NoE European Project
  3. 3. Research Project A Computational Framework for Sound Segregation in Music Signals3 }  FCT R&D Project (APPROVED FOR FUNDING) }  A Computational Auditory Scene Analysis Framework for Sound Segregation in Music Signals  }  3-year project (starting Jan. 2011) }  Partners: }  CITAR (Porto, Portugal)   Luís Gustavo Martins (PI), Álvaro Barbosa, Daniela Coimbra }  INESC Porto (Porto, Portugal)   Fabien Gouyon }  UVic (Victoria, BC, Canada)   George Tzanetakis }  IRCAM (Paris, France)   Mathieu Lagrange }  Consultants }  FEUP (Porto, Portugal)   Prof. Aníbal Ferreira, Prof. Jaime Cardoso }  McGill University / CIRMMT (Montreal, QC, Canada)   Prof. Stephan McAdams
  4. 4. Summary A Computational Framework for Sound Segregation in Music Signals4 }  Problem Statement }  The Main Challenges }  Current State }  Related Research Areas }  Main Contributions }  Proposed Approach }  Results }  Software Implementation }  Conclusions and Future Work
  5. 5. Problem Statement A Computational Framework for Sound Segregation in Music Signals5 }  Propose a computational sound segregation framework }  Focused on music signals }  But not necessarily limited to music signals }  Perceptually inspired }  So it can build upon the current knowledge of how listeners perceive sound events in music signals }  Causal }  So it mimics the human auditory system and allows online processing of sounds }  Flexible }  So it can accommodate different perceptually inspired grouping cues }  Generic }  So it can be used in different audio and MIR application scenarios }  Effective }  So it can improve the extraction of perceptually relevant information from musical mixtures }  Efficient }  So it can find practical use in audio processing and MIR tasks
  6. 6. MUSIC LISTENING ABSTRACT KNOWLEDGE STRUCTURES EVENT STRUCTURE PROCESSING EXTRACTION OF ATTRIBUTES AUDITORY GROUPING PROCESSES MENTAL REPRESENTATION OF SOUND ENVIRONMENT TRANSDUCTION TRANSDUCTION ATTENTIONAL PROCESSES Figure 2: The main types of auditory processing and their interactions (adapted from [McAdams and Bigand, 1993]). possible to extract perceptual attributes which provide a representation of each element in the auditory system. }  Human listeners are able to perceive individual sound events in complex mixtures }  Even if listening to: }  Monaural music recordings }  Unknown sounds, timbres or instruments }  Perception is influenced by several complex factors }  Listener’s prior knowledge, context, attention, … }  Based on both low-level and high-level cues }  Difficult to replicate computationally… The Main Challenges A Computational Framework for Sound Segregation in Music Signals8
  7. 7. The Main Challenges A Computational Framework for Sound Segregation in Music Signals9 }  Why Music Signals? }  Music sound is, in some senses, more challenging to analyse than non-musical sounds }  High time-frequency overlap of sources and sound events   Music composition and orchestration   Sources that often play simultaneously  polyphony   Favor consonant pitch intervals   Sound sources are highly correlated }  High variety of spectral and temporal characteristics   Musical instruments present a wide range of sound production mechanisms }  Techniques traditionally used for monophonic, non-musical or speech signals perform poorly }  Yet, music signals are usually well organized and structured
  8. 8. Current State A Computational Framework for Sound Segregation in Music Signals10 }  Typical systems in MIR }  Represent statistically the entire sound mixture }  Analysis and retrieval performance reached a “glass ceiling” [Aucouturier and Pachet, 2004] }  New Paradigm }  Attempt to individually characterize the different sound events in a sound mixture }  Performance still quite limited when compared to human auditory system }  But already provides alternative and improved approaches to common sound analysis and MIR tasks
  9. 9. Applications A Computational Framework for Sound Segregation in Music Signals11 }  “Holy grail” applications }  “The Listening Machine” }  “The Robotic Ear” }  “Down to earth” applications }  Sound and Music Description }  Sound Manipulation }  Robust Speech and Speaker Recognition }  Object-based Audio Coding }  Automatic Music Transcription }  Audio and Music Information Retrieval }  Auditory Scene Reconstruction }  Hearing Prostheses }  Up-mixing }  …
  10. 10. Related Research Areas A Computational Framework for Sound Segregation in Music Signals12 }  Sound and Music Computing (SMC) [Serra et al., 2007] }  Computational Auditory Scene Analysis (CASA) [Wang and Brown, 2006] }  Perception Research }  Psychoacoustics [Stevens, 1957] }  Auditory Scene Analysis (ASA) [Bregman, 1990] }  Digital Signal Processing [Oppenheim and Schafer, 1975] }  Music Information Retrieval (MIR) [Downie, 2003] }  Machine Learning [Duda et al., 2000] }  ComputerVision [Marr, 1982]
  11. 11. Related Areas A Computational Framework for Sound Segregation in Music Signals13 }  Auditory Scene Analysis (ASA) [Bregman, 1990] }  How do humans “understand” sound mixtures? }  Find packages of acoustic evidence such that each package has arisen from a single sound source }  Grouping Cues }  Integration   Simultaneous vs. Sequential   Primitive vs. schema-based }  Cues   Common amplitude, frequency, fate   Harmonicity   Time continuity   … Time
  12. 12. Related Areas A Computational Framework for Sound Segregation in Music Signals14 }  Computational Auditory Scene Analysis (CASA) [Wang and Brown, 2006] }  “Field of computational study that aims to achieve human performance in ASA by using one or two microphone recordings of the acoustic scene.” [Wang and Brown, 2006] MUSIC LISTENING SOURCE MODELS ANALYSIS FRONT-END MID-LEVEL REPRESENTATION SCENE ORGANIZATION GROUPING CUES STREAM RESYNTHESIS ACOUSTIC MIXTURE SEGREGATED SIGNALS Figure 3: System Architecture of a typical CASA system. reference in the development of sound source separation systems, since it is the only ex-
  13. 13. Main Contributions A Computational Framework for Sound Segregation in Music Signals15 }  Proposal and experimental validation of a flexible and efficient framework for sound segregation }  Focused on “real-world” polyphonic music }  Inspired by ideas from CASA }  Causal and data-driven }  Definition of a novel harmonicity cue }  Termed HarmonicallyWrapped Peak Similarity (HWPS) }  Experimentally shown as a good grouping criteria }  Software implementation of the proposed sound segregation framework }  Modular, extensible and efficient }  Made available as free and open source software (FOSS) }  Based on the MARSYAS framework
  14. 14. Proposed Approach A Computational Framework for Sound Segregation in Music Signals16 }  Assumptions }  Perception primarily depends on the use of low-level sensory information }  Does not necessarily require prior knowledge (i.e. training) }  Still able to perform primitive identification and segregation of sound events in a sound mixture }  Prior knowledge and high-level information can still be used }  To award additional meaning to the primitive observations }  To consolidate primitive observations as relevant sound events }  To modify the listener’s focus of attention
  15. 15. Proposed Approach A Computational Framework for Sound Segregation in Music Signals19 }  System overview Sinusoidal Synthesis Texture Window Spectral Peaks (over Texture Window) 150ms Spectral Peaks 46ms Sinusoidal Analysis Spectral Peaks 46ms Cluster Selection Similarity Computation Normalized Cut
  16. 16. Analysis Front-end A Computational Framework for Sound Segregation in Music Signals22 }  Sinusoidal Modeling }  Sum of highest amplitude sinusoids at each frame  peaks }  Maximum of 20 peaks/frame }  Window = 46ms ; hop = 11ms }  Parametric model: Estimate Amplitude, Frequency, Phase of each peak frequency Spectral Peaks Sinusoidal Analysis Spectral Peaks 46ms
  17. 17. Time Segmentation A Computational Framework for Sound Segregation in Music Signals23 }  Texture Windows }  Construct a graph over a texture window of the sound mixture }  Provides time integration   Approaches partial tracking and source separation jointly   Traditionally two separated, consecutive stages Spectral Peaks Sinusoidal Analysis time frequency Spectral Peaks Sinusoidal Analysis Texture Window
  18. 18. Time Segmentation A Computational Framework for Sound Segregation in Music Signals24 }  Fixed length texture windows }  E.g. 150 ms }  Dynamically adjusted texture windows }  Onset detector }  Perceptually more relevant }  50ms ~ 300ms AmplitudeFrequency 0 0.8 1.6 Time (secs) SpectralFlux 1 TEXTURE WINDOW 2 TEXTURE WINDOW 3 TEXTURE WINDOW 4 TEXTURE WINDOW 5 6 TEXTURE WIN. 7
  19. 19. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals25 Similarity Computation AMPLITUDE SIMILARITY FREQUENCY SIMILARITY HARMONIC SIMILARITY (HWPS) AZIMUTH PROXIMITY COMMON ONSET OFFSET SOURCE MODELS COMBINER Spectral Peaks (over Texture Window) 150ms OVERALL SIMILARITY MATRIX Normalized Cut ...
  20. 20. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals26 }  Grouping Cues (inspired from ASA) }  Similarity between time-frequency components in a texture window }  Frequency proximity }  Amplitude proximity }  Harmonicity proximity (HWPS) }  … }  Encode topological knowledge into a similarity graph/matrix }  Simultaneous integration (peaks within the same frame) }  Sequential integration over the texture window Similarity Matrix A0 A1 A2 A3 B3, A4 B0 B1 B2 B4 A0 A1 A2 A3 B3, A4 B0 B1 B2 B4 xi xj xk wij = wji xq xp xl
  21. 21. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals27 }  Defining a Generic Similarity Function }  Fully connected graphs }  Gaussian similarity function   How to define neighborhood width (σ)?   Local statistics from data in a Texture Window   Use prior knowledge (e.g. JNDs)    Use σ as weights (after normalizing the Sim. Fun. to [0,1]) 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 0.25 0.5 0.75 1 d(xi, xj) wij σ=0.4 σ = 1.0 σ = 1.2 wij = e − “ d(xi,xj ) σ ”2 xi xj wij = wji
  22. 22. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals28 }  Amplitude and Frequency Similarity }  Amplitude }  Gaussian function of the Euclidean distances   In dB  more perceptually relevant }  Frequency }  Gaussian function of the Euclidean distances   In Bark  more perceptually relevant }  Not sufficient to segregate harmonic events }  Nevertheless are important to group peaks from:   Inharmonic or noisy frequency components in harmonic sounds   Non-harmonic sounds (unpitched sounds) Two of the most basic similarities explored by the auditory system a frequency and amplitude features of the sound components in a sound m tion 2.3.1). Accordingly, the edge weight connecting two peaks pk l and pk+n m will frequency and amplitude proximities. Following the generic considerati the definition of a similarity function for spectral clustering in Section and frequency similarities, Wa and Wf respectively, are defined as follow Wa(pk l , pk+n m ) = e − „ ak l −ak+n m σa «2 Wf (pk l , pk+n m ) = e − „ fk l −fk+n m σf «2 where the Euclidean distances are modeled as two Gaussian functions, fined in Equation 8. The amplitudes are measured in Decibels (dB) an are measured in Barks (a frequency scale approximately linear below 500 mic above), since these scales have shown to better model the the sensib the human ear [Hartmann, 1998]. 79 frequency and amplitude features of the sound components in a sound m tion 2.3.1). Accordingly, the edge weight connecting two peaks pk l and pk+n m will frequency and amplitude proximities. Following the generic considerati the definition of a similarity function for spectral clustering in Section and frequency similarities, Wa and Wf respectively, are defined as follow Wa(pk l , pk+n m ) = e − „ ak l −ak+n m σa «2 Wf (pk l , pk+n m ) = e − „ fk l −fk+n m σf «2 where the Euclidean distances are modeled as two Gaussian functions, fined in Equation 8. The amplitudes are measured in Decibels (dB) an are measured in Barks (a frequency scale approximately linear below 500 mic above), since these scales have shown to better model the the sensib the human ear [Hartmann, 1998]. 79
  23. 23. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals29 }  Harmonically Wrapped Peak Similarity (HWPS) }  Harmonicity is one of the most powerful ASA cues [Wang and Brown, 2006] }  Proposal of a novel harmonicity similarity function }  Does not rely on the prior knowledge of f0 in the signal }  Takes into account spectral information in a global manner (spectral patterns)   For peaks in a same frame or in different frames in a Texture Window   Takes into consideration the amplitudes of the spectral peaks }  3 step algorithm   Shifted Spectral Pattern   Wrapped Frequency Space  Histogram computation   Discrete Cosine Similarity  [0,1] STEP 3 – Discrete Cosine Similarity The last step is now to correlate the two shifted and harmonically wrapped spec- tral patterns ( ˆF k l and ˆF k+n m ) to obtain the HWPS measure between the two correspond- ing peaks. This correlation can be done using an algorithmic approach as proposed in [Lagrange and Marchand, 2006], but this was found not to be reliable or robust in prac- tice. Alternatively, the proposal is to discretize each shifted and harmonically wrapped spectral pattern into an amplitude weighted histogram, Hk l , corresponding to each spec- tral pattern ˆF k l . The contribution of each peak to the histogram is equal to its amplitude and the range between 0 and 1 of the Harmonically-Wrapped Frequency is divided into 20 equal-size bins (a 12 or a 24 bin histogram would provide a more musically meaning- ful chroma-based representation, but preliminary and empirical tests have shown better results when using 20 bin histograms). In addition, the harmonically wrapped spectral patterns are also folded into an octave to form a pitch-invariant “chroma” profile. For example, in Figure 19, the energy of the spectral pattern in wrapped frequency 1 (all integer multiples of the wrapping frequency) is mapped to histogram bin 0. The HWPS similarity between the peaks pk l and pk+n m is then defined based on the cosine distance between the two corresponding discretized histograms as follows: Wh(pk l , pk+n m ) = HWPS(pk l , pk+n m ) = e 0 @ c(Hk l ,Hk+n m ) r c(Hk l ,Hk l )·c(Hk+n m ,Hk+n m ) 1 A 2 (28) where c(Hb a, Hd c ) = i Hb a(i) × Hd c (i) . (29) One may notice that due to the wrapping operation of Equation 25, the size of the histograms can be relatively small (e.g. 20 bins), thus being computationally efficient. A Gaussian function is also used for controlling the neighborhood width of the harmonicity cue, where σh = 1 is implicitly used in the current system implementation. Wh(pk l , pk+n m ) = HWPS(pk l , pk+n m ) = e − 1− c(Hk l ,Hk+n m ) √ c(Hk l ,Hk l )×c(H k+n m ,H k+n m ) 2
  24. 24. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals30 }  HWPS }  Between peaks of a same harmonic “source” }  In a same frame  High similarity (~1.0) A0 B0 A1 B1 B2 A2 f0A f0B 2f0A 3f0A 3f0B2f0B0 frame k 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A fk A1 = 2f0A SHIFTING SHIFTING fk A0 = f0A A1 A0 HWPS(A1, A0)|h=f0A ¯Fk A1 ˜Fk A1 ˜Fk A0 ¯Fk A0 ˆFk A0 ˆFk A1 dB High HWPS(A1, A0)|h=f0A = = 0 1 A1 A0 Fk A1 = = Fk A0 ˜A1 ˜A0
  25. 25. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals31 }  HWPS }  Between peaks of different harmonic “sources” }  In a same frame  Low similarity (~0.0) A0 B0 A1 B1 B2 A2 f0A f0B 2f0A 3f0A 3f0B2f0B0 frame k 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A Fk A1 = = Fk B0 fk A1 = 2f0A SHIFTING SHIFTING fk B0 = f0B A1 HWPS(A1, B0)|h=f0A ¯Fk A1 ˜Fk A1 ˜Fk B0 ¯Fk B0 ˆFk B0 ˆFk A1 dB B0 ! A1 B0 ˜A1 ˜B0 Low HWPS(A1, B0)|h=f0A = 0 1
  26. 26. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals32 }  HWPS }  Between peaks of a same harmonic “source” }  In different frames  Mid-High similarity   Interfering spectral content may be different   Degrades HWPS…   Only consider bin 0? A0 B0 A1 B1 B2 A2 f0A f0B 2f0A 3f0A 3f0B2f0B0 frame k Fk A1 = = Fk+n A0 dB A0 A1 A2 f0A 2f0A 3f0A0 dB frame k + n C0 C1 C2 f0C 2f0C 3f0C 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A 0 1 3f0 0 −f0A f0A 2f0A 3f0A 4f0A fk A1 = 2f0A SHIFTING SHIFTING Ak 1 HWPS(Ak 1, Ak+n 0 )|h=f0A ¯Fk A1 ˜Fk A1 ˜Fk+n A0 ¯Fk+n A0 ˆFk+n A0 ˆFk A1 Ak+n 0 Ak 1 Ak+n 0 fk+n A0 = f0A ˜Ak 1 ˜Ak+n 0 Mid-High HWPS(Ak 1, Ak+n 0 )|h=f0A = 0 1 =
  27. 27. Perceptual Cues as Similarity Functions A Computational Framework for Sound Segregation in Music Signals33 }  HWPS }  Impact of f0 estimates (h’) }  Ideal }  Min peak frequency }  Highest amplitude peak }  Histogram-based f0 estimates  pitch estimates == nr. Sources? A FRAMEWORK FOR SOUND SEGREGATION IN MUSIC SIGNALS wrapping operation would be perfect with the prior knowledge of the fundamental fre- quency. With this knowledge it would be possible to parametrize the wrapping operation h as: h = min(f0 k l , f0 k+n m ) (26) where f0 k l is the fundamental frequency of the source of the peak pk l . Without such prior, a conservative approach h is considered instead, although it will tend to over estimate the fundamental frequency: h = min(fk l , fk+n m ) (27) Notice that the value of the wrapping frequency function h is the same for both pat- terns corresponding to the peaks under consideration. Therefore the resulting shifted and wrapped frequency pattern will be more similar if the peaks belong to the same harmonic “source”. The resulting shifted and wrapped patterns are pitch invariant and can be seen in the middle plot of Figures 19 and 20. Different approaches could have been taken for the definition of the fundamental fre- quency estimation function h. One possibility would be to select the highest amplitude peak in the union of the two spectral patterns under consideration as the f0 estimate (i.e. h = {fi|i = argmaxi(Ai), ∀i ∈ [1, #A], where A = Ak l ∪ Ak+n m , #A is its number of elements and Ak l is the set of amplitudes corresponding to the spectral pattern Fk l ). The motivation for this approach is the fact that the highest amplitude partial in musical signals often corresponds to the fundamental frequency of the most prominent harmonic ‘source” active in that frame, although this assumption will not always hold. A more robust approach, though more computationally expensive, would be to calcu- late all the frequency differences between all peaks in each spectral pattern and compute a A FRAMEWORK FOR SOUND SEGREGATION IN MUSIC SIGNALS wrapping operation would be perfect with the prior knowledge of the fundamental fre- quency. With this knowledge it would be possible to parametrize the wrapping operation h as: h = min(f0 k l , f0 k+n m ) (26) where f0 k l is the fundamental frequency of the source of the peak pk l . Without such prior, a conservative approach h is considered instead, although it will tend to over estimate the fundamental frequency: h = min(fk l , fk+n m ) (27) Notice that the value of the wrapping frequency function h is the same for both pat- terns corresponding to the peaks under consideration. Therefore the resulting shifted and wrapped frequency pattern will be more similar if the peaks belong to the same harmonic “source”. The resulting shifted and wrapped patterns are pitch invariant and can be seen in the middle plot of Figures 19 and 20. Different approaches could have been taken for the definition of the fundamental fre- quency estimation function h. One possibility would be to select the highest amplitude peak in the union of the two spectral patterns under consideration as the f0 estimate (i.e. h = {fi|i = argmaxi(Ai), ∀i ∈ [1, #A], where A = Ak l ∪ Ak+n m , #A is its number of elements and Ak l is the set of amplitudes corresponding to the spectral pattern Fk l ). The motivation for this approach is the fact that the highest amplitude partial in musical signals often corresponds to the fundamental frequency of the most prominent harmonic ‘source” active in that frame, although this assumption will not always hold. A more robust approach, though more computationally expensive, would be to calcu- late all the frequency differences between all peaks in each spectral pattern and compute a histogram. The peaks in these histograms would be good candidates for the fundamental frequencies in each frame (in order to avoid octave ambiguities, a second histogram with the differences between all the candidate f0 values could be again computed, where the highest peaks would be selected as the final f0 candidates). The HWPS could then be where f0l is the fundamental frequency of the source of the peak pl . Without such prio a conservative approach h is considered instead, although it will tend to over estima the fundamental frequency: h = min(fk l , fk+n m ) (2 Notice that the value of the wrapping frequency function h is the same for both pa terns corresponding to the peaks under consideration. Therefore the resulting shifted an wrapped frequency pattern will be more similar if the peaks belong to the same harmon “source”. The resulting shifted and wrapped patterns are pitch invariant and can be se in the middle plot of Figures 19 and 20. Different approaches could have been taken for the definition of the fundamental fr quency estimation function h. One possibility would be to select the highest amplitud peak in the union of the two spectral patterns under consideration as the f0 estima (i.e. h = {fi|i = argmaxi(Ai), ∀i ∈ [1, #A], where A = Ak l ∪ Ak+n m , #A is its numb of elements and Ak l is the set of amplitudes corresponding to the spectral pattern Fk l The motivation for this approach is the fact that the highest amplitude partial in music signals often corresponds to the fundamental frequency of the most prominent harmon ‘source” active in that frame, although this assumption will not always hold. A more robust approach, though more computationally expensive, would be to calc late all the frequency differences between all peaks in each spectral pattern and compute histogram. The peaks in these histograms would be good candidates for the fundament frequencies in each frame (in order to avoid octave ambiguities, a second histogram wi the differences between all the candidate f0 values could be again computed, where th highest peaks would be selected as the final f0 candidates). The HWPS could then b iteratively calculated using each f0 candidate in this short list, and select the one wi the best value as the final choice. In fact, this technique could prove an interesting way robustly estimate the number of harmonic “sources” in each frame, including their pitche but experimental evaluations are still required to validate these approaches. ————— 0 500 1000 1500 2000 2500 3000 0 0.2 0.4 0.6 0.8 1 A0 A1 A2 A3 A4 , B3 B0 B1 B2 B4 Frequency (Hz) Amplitude
  28. 28. Similarity Combination A Computational Framework for Sound Segregation in Music Signals36 Similarity Computation AMPLITUDE SIMILARITY FREQUENCY SIMILARITY HARMONIC SIMILARITY (HWPS) AZIMUTH PROXIMITY COMMON ONSET OFFSET SOURCE MODELS COMBINER Spectral Peaks (over Texture Window) 150ms OVERALL SIMILARITY MATRIX Normalized Cut ...
  29. 29. Similarity Combination A Computational Framework for Sound Segregation in Music Signals38 }  Combining cues }  Product operator [ShiMalik2000]   High overall similarity only if all cues are high… }  More expressive operators? to represent the different sound events in a complex mixture. Therefore, the combination of different similarity cues could allow to make the best use of their isolated grouping abilities towards a more meaningful segregation of a sound mixture. Following the work of Shi and Malik [Shi and Malik, 2000], who proposed to compute the overall similarity function as the product of the individual similarity cues used for image segmentation, the current system combines the amplitude, frequency and HWPS grouping cues presented in the previous sections into a combined similarity function W as follows: W(pl, pm) = Wafh(pl, pm) = Wa(pl, pm) × Wf (pl, pm) × Wh(pl, pm) (30) Plots g in Figures 15 and 16 show the histogram of the values resulting from the com- bined similarity functions for the two sound examples, Tones A+B and Jazz1, respectively. 5 Audio clips of the signals plotted in Figures 17 and 18 are available at http://www.inescporto. pt/˜lmartins/Research/Phd/Phd.htmXXX 105Wafh = [(Wf ∧ Wa) ∨ Wh] ∧ Ws
  30. 30. Segregating Sound Events A Computational Framework for Sound Segregation in Music Signals39 }  Segregation task }  Carried out by clustering components that are close in the similarity space }  Novel method based on Spectral Clustering }  Normalized Cut (Ncut) criterion   Originally proposed for ComputerVision   Takes cues as pair-wise similarities   Cluster the peaks into groups taking into account simultaneously all cues Similarity Computation AMPLITUDE SIMILARITY FREQUENCY SIMILARITY HARMONIC SIMILARITY (HWPS) AZIMUTH PROXIMITY COMMON ONSET OFFSET SOURCE MODELS COMBINER Spectral Peaks (over Texture Window) 150ms OVERALL SIMILARITY MATRIX Normalized Cut ...
  31. 31. Segregating Sound Events A Computational Framework for Sound Segregation in Music Signals40 }  Segregation Task }  Normalized Cut criterion }  Achieves a balanced clustering of elements }  Relies on the eigenstructure of a similarity matrix to partition points into disjoint clusters   Points in the same cluster  high similarity   Points in different clusters  low similarity xi xj xk wij = wji better cut mincut xq xp xl
  32. 32. Segregating Sound Events A Computational Framework for Sound Segregation in Music Signals41 }  Spectral Clustering }  Alternative to the EM and k-means traditional algorithms: }  Does not assume a convex shaped data representation }  Does not assume Gaussian distribution of data }  Does not present multiple minima in log-likelihood   Avoids multiple restarts of the iterative process }  Correctly handles complex and unknown shapes }  Usual in audio signals [Bach and Jordan 2004]
  33. 33. Segregating Sound Events A Computational Framework for Sound Segregation in Music Signals42 }  Divisive clustering approach }  Recursive two-way cut }  Hierarchical partition of the data   Recursively partitions the data into two sets   Until pre-defined number of clusters is reached (requires prior knowledge!)   Until a stopping criteria is met }  Current implementation   Requires definition of number of clusters [Martins et al., 2007]   Or alternatively partitions data into 5 clusters and selects the 2 “denser” ones    Segregation of the dominant clusters in the mixture [Lagrange et al., 2008a]
  34. 34. Segregation Results A Computational Framework for Sound Segregation in Music Signals43 a) Jazz1 b) AMPLITUDE SIMILARITY CLUSTER 1 c) AMPLITUDE SIMILARITY CLUSTER 2 d) FREQUENCY SIMILARITY CLUSTER 1 e) FREQUENCY SIMILARITY CLUSTER 2 f) HWPS SIMILARITY CLUSTER 1 g) HWPS SIMILARITY CLUSTER 2 h) COMBINED SIMILARITIES CLUSTER 1 i) COMBINED SIMILARITIES CLUSTER 2 FREQUENCY(Hz) TIME (secs) TIME (secs) TIME (secs) FREQUENCY(Hz)FREQUENCY(Hz)FREQUENCY(Hz)FREQUENCY(Hz) a) Tones A+B b) AMPLITUDE SIMILARITY CLUSTER 1 c) AMPLITUDE SIMILARITY CLUSTER 2 d) FREQUENCY SIMILARITY CLUSTER 1 e) FREQUENCY SIMILARITY CLUSTER 2 f) HWPS SIMILARITY CLUSTER 1 g) HWPS SIMILARITY CLUSTER 2 h) COMBINED SIMILARITIES CLUSTER 1 i) COMBINED SIMILARITIES CLUSTER 2 FREQUENCY(Hz) TIME (secs) TIME (secs) TIME (secs) FREQUENCY(Hz)FREQUENCY(Hz)FREQUENCY(Hz)FREQUENCY(Hz) B0 B1 B2 A4 + B3 A3 A2 A1 A0 0 500 1000 1500 2000 2500 3000 0 0.2 0.4 0.6 0.8 1 A0 A1 A2 A3 A4 , B3 B0 B1 B2 B4 Frequency (Hz) Amplitude
  35. 35. Results A Computational Framework for Sound Segregation in Music Signals45 }  Predominant Melodic Source Segregation }  Dataset of real-world polyphonic music recordings }  Availability of the original isolated tracks (ground truth) }  Results (the higher the better)   HWPS improves results   When combined with other similarity features   When compared with other state-of-the-art harmonicity features [Srinivasan and Kankanhalli, 2003] [Virtanen and Klapuri, 2000] 0 1 2 3 4 5 6 7 Mean SDR (dB) for a 10 song dataset A+F+HWPS A+F+rHWPS A+F+HV A+F+HS A+F
  36. 36. Results A Computational Framework for Sound Segregation in Music Signals47 }  Predominant Melodic Source Segregation }  On the use of Dynamic Texture Windows }  Results (the higher the better)   Smaller improvement (0.15 dB) than expected   Probably due to the cluster selection approach being used…   More computationally intensive (for longer texture windows)
  37. 37. Results A Computational Framework for Sound Segregation in Music Signals51 }  Main Melody Pitch Estimation }  Resynthesize the segregated main voice clusters }  Perform pitch estimation using well known monophonic pitch estimation technique (Praat) }  Comparison with two techniques: }  Monophonic pitch estimation applied to mixture audio (from Praat) }  State-of-the-Art multi-pitch and main melody estimation algorithm applied to mixture audio [Klapuri, 2006] }  Results (the lower the better)
  38. 38. Results A Computational Framework for Sound Segregation in Music Signals56 }  Voicing Detection }  Identifying portions of a music file containing vocals }  Evaluated three feature sets:   MFCC features extracted from the polyphonic signal   MFCC features extracted from the segregated main voice   Cluster Peak Ratio (CPR) feature   extracted from the segregated main voice clusters
  39. 39. Results A Computational Framework for Sound Segregation in Music Signals57 }  Timbre Identification in polyphonic music signals [Martins et al., 2007] }  Polyphonic, multi-instrumental audio signals }  Artificial mixtures of 2-, 3- and 4-notes from real instruments }  Automatic separation of the sound sources }  Sound sources and events are reasonably captured, corresponding in most cases to played notes }  Matching of the separated events to a collection of 6 timbre models note 1 note n ... Sound Source Formation note 1 / inst 1 note n / inst i ... Timbre Models Matching Matching Peak Picking Sinusoidal Analysis ...... ...
  40. 40. Results A Computational Framework for Sound Segregation in Music Signals58 }  Timbre Identification in polyphonic music signals [Martins et al., 2007] }  Sound sources and events are reasonably captured, corresponding in most cases to played notes
  41. 41. Results A Computational Framework for Sound Segregation in Music Signals59 }  Timbre Identification in polyphonic music signals [Martins et al., 2007] }  6 instruments modeled [Burred et al., 2006]: }  Piano, violin, oboe, clarinet, trumpet and alto sax }  Modeled as a set of time-frequency templates   Describe the typical evolution in time of the spectral envelope of a note   Matches the salient peaks of the spectrum 0 0.2 0.4 0.6 0.8 1 2000 4000 6000 8000 10000 -80 -60 -40 -20 0 Frequency (Hz) Time(normalized) Amplitude(dB) PIANO 0.2 0.4 0.6 0.8 1 2000 4000 6000 8000 10000 -80 -60 -40 -20 0 Frequency (Hz) Time(normalized) Amplitude(dB) OBOE
  42. 42. Results A Computational Framework for Sound Segregation in Music Signals60 }  Timbre Identification in polyphonic music signals [Martins et al., 2007] }  Instrument presence detection in mixtures of notes }  56% of instruments occurrences correctly detected, with a precision of 64% [Martins et al., 2007] Weak Matching Alto sax cluster  piano prototype Strong Matching Piano cluster  piano prototype
  43. 43. Software Implementation A Computational Framework for Sound Segregation in Music Signals62 }  Modular, flexible and efficient software implementation }  Based on Marsyas }  Free and Open Source framework for audio analysis and processing http://marsyas.sourceforge.net peakClustering myAudio.wav
  44. 44. Software Implementation A Computational Framework for Sound Segregation in Music Signals63 }  Marsyas }  peakClustering Overview Series/mainNet frameMaxNumPeaks totalNumPeaks PeakViewSink/ peSink PeakLabeler/ labeler PeakConvert/ conv Accumulator/textWinNet ... ... ... 1 FlowThru/clustNet ... ... ... Shredder/synthNet ... ... ... 2 3 nTimes A B peakLabels nTimestotalNumPeaks frameMaxNumPeaks innerOut B
  45. 45. Software Implementation A Computational Framework for Sound Segregation in Music Signals64 }  Marsyas }  Sinusoidal analysis front-end Accumulator/textWinNet Series/analysisNet Series/peakExtract ShiftInput/ si Fanout/stereoFo Series/stereoSpkNet Parallel/LRnet Series/spkL Windowing/ win Spectrum/ spk Series/spkR Windowing/ win Spectrum/ spk EnhADRessStereoSpectrum/ stereoSpk EnhADRess/ ADRess Series/spectrumNet Stereo2Mono/ s2m Shifter/ sh Windowing/ wi Parallel/par Spectrum/ spk1 Spectrum/ spk2 FlowThru/onsetdetector ... ... ... 1a FanOutIn/mixer + Series/mixSeries Delay/ noiseDelay SoundFileSource/ src Gain/ noiseGain Series/oriNet SoundFileSource/ src Gain/ oriGain A 1 onsetDetected flush FlowThru/onsetdetector Windowing/ wi Spectrum/ spk PowerSpectrum/ pspk Flux/ flux ShiftInput/ sif Filter/ filt1 Filter/ filt2 Reverse/ rev1 Reverse/ rev2 PeakerOnset/ peaker 1a onsetDetected I S
  46. 46. Software Implementation A Computational Framework for Sound Segregation in Music Signals65 }  Marsyas }  Onset detection ShiftInput/ si Series/stereoSpkNet Parallel/LRnet Series/spkL Windowing/ win Spectrum/ spk Series/spkR Windowing/ win Spectrum/ spk EnhADRessStereoSpectrum/ stereoSpk EnhADRess/ ADRess s2m sh wi Spectrum/ spk2 ... ... ... FanOutIn/mixer + Series/mixSeries Delay/ noiseDelay SoundFileSource/ src Gain/ noiseGain Series/oriNet SoundFileSource/ src Gain/ oriGain A onsetDetected flush FlowThru/onsetdetector Windowing/ wi Spectrum/ spk PowerSpectrum/ pspk Flux/ flux ShiftInput/ sif Filter/ filt1 Filter/ filt2 Reverse/ rev1 Reverse/ rev2 PeakerOnset/ peaker 1a onsetDetected I
  47. 47. Software Implementation A Computational Framework for Sound Segregation in Music Signals66 }  Marsyas }  Similarity matrix computation and Clustering PeakConvert /conv FlowThru/clustNet frameMaxNumPeaks totalNumPeaks FanOutIn/simNet x Series/freqSim SimilarityMatrix/FREQsimMat Metric/ FreqL2Norm RBF/ FREQrbf Series/ampSim SimilarityMatrix/AMPsimMat Metric/ AmpL2Norm RBF/ AMPrbf Series/HWPSim SimilarityMatrix/HWPSsimMat HWPS/ hwps RBF/ HWPSrbf Series/panSim SimilarityMatrix/PANsimMat Metric/ PanL2Norm RBF/ PANrbf PeakFeatureSelect/ FREQfeatSelect 2 B D D Series/NCutNet Fanout/stack NormCut/ NCut Gain/ ID PeakClusterSelect/ clusterSelect E innerOut PeakLabeler/ labeler B labels D D D PeakFeatureSelect/ AMPfeatSelect PeakFeatureSelect/ PANfeatSelect PeakFeatureSelect/ HWPSfeatSelect F C1 C2 C3
  48. 48. Software Implementation A Computational Framework for Sound Segregation in Music Signals67 }  Marsyas }  More flexible Similarity expression FanOutIn/simNet Series/panSim SimilarityMatrix/PANsimMat Metric/ PanL2Norm RBF/ PANrbf PeakFeatureSelect/ PANfeatSelect .* FanOutIn/ORnet FanOutIn/ANDnet .* Series/freqSim SimilarityMatrix/FREQsimMat Metric/ FreqL2Norm RBF/ FREQrbf PeakFeatureSelect/ FREQfeatSelect Series/ampSim SimilarityMatrix/AMPsimMat Metric/ AmpL2Norm RBF/ AMPrbf PeakFeatureSelect/ AMPfeatSelect max Series/HWPSim SimilarityMatrix/HWPSsimMat HWPS/ hwps RBF/ HWPSrbf PeakFeatureSelect/ HWPSfeatSelect
  49. 49. Software Implementation A Computational Framework for Sound Segregation in Music Signals68 }  Marsyas }  Cluster Resynthesis Shredder/synthNet Series/postNet Gain/ outGain PeakSynthOsc/ pso Windowing/ wiSyn OverlapAdd/ ov SoundFileSink/ dest 3 B
  50. 50. Software Implementation A Computational Framework for Sound Segregation in Music Signals69 }  Marsyas }  Data structures D totalnumbe intextureSIMILARITY C1 f2 f5f4f1 f3 f6 peaks' frequency total number of peaks A Re(0) Re(N/2) Re(1) Im(1) Im(N/2-1) Re(N/2-1) ... ... ... ... ... ... ... Re(0) Re(N/2) Re(1) Im(1) Im(N/2-1) Re(N/2-1) ... ... ... ... ... ... ... complexspectrum1 (Npoints) Pan(0) Pan(1) Pan(N/2) ... ... ... ... ... ... ... stereo spectrum (N/2+1points) texture window frames complexspectrum2 (Npoints) B peaks FREQUENCY peaks AMPLITUDE peaks PHASE peaks GROUP ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... frameMaxNumPeaks texture window frames peaks TRACK ... ... ... ... ... ... ... ... ... ... ... ... ... audio frame (N+1 samples) I 31 42 50 1 430 2 5 Ch1 samples Ch2 samples analysis window (N samples) S 1 30 2 5 Audio Samples 430 2 5 Shifted Audio Samples 1 4
  51. 51. Software Implementation A Computational Framework for Sound Segregation in Music Signals70 }  Marsyas }  Data structures D total number of peaks in texture window totalnumberofpeaks intexturewindow SIMILARITY MATRIX E total number of peaks in texture window totalnumberofpeaks intexturewindow 3 221 1 3 NCUT indicator SIMILARITY MATRIX F 3 -1-11 1 3 cluster selection indicator C1 f2 f5f4f1 f3 f6 peaks' frequency total number of peaks in texture window C2 a2 a5a4a1 a3 a6 peaks' amplitude total number of peaks in texture window C3 3 21 2 1 3 f2 f4f1 f3 f5 f6peaks' frequency XX aa XX a aX XX X X aa aa X aX Xf a f f fa f a f f ff f f f NumPeaks in frame peak spectralpattern total number of peaks in texture window Im(N/2-1) Re(N/2-1) Re(0) Re(N/2) Re(1) Im(1) Im(N/2-1) Re(N/2-1) ... ... ... ... ... ... ... m1 Pan(0) Pan(1) Pan(N/2) ... ... ... ... ... ... ... stereo spectrum (N/2+1points) texture window frames complexspectrum2 (Npoints) peaks FREQUENCY peaks AMPLITUDE peaks PHASE GROUP ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. frameMaxNumPeaks texture window frames analysis window (N samples) S 1 30 2 5 Audio Samples 430 2 5 Shifted Audio Samples 1 4
  52. 52. Conclusions A Computational Framework for Sound Segregation in Music Signals71 }  Proposal of a framework for sound source segregation }  Inspired by ideas of CASA }  Focused on “real-world” music signals }  Designed to be causal and efficient }  Data-driven }  Does not require any training or prior knowledge about audio signals under analysis }  Approaches partial tracking and source separation jointly }  Flexible enough to include new perceptually motivated auditory cues }  Based on a Spectral Clustering technique }  Shows good potential for applications }  Source segregation/separation, }  Monophonic or polyphonic instrument classification, }  Main melody estimation }  Pre-processing for polyphonic transcription, ...
  53. 53. Conclusions A Computational Framework for Sound Segregation in Music Signals72 }  Definition of a novel harmonicity cue }  Termed Harmonically Wrapped Peak Similarity (HWPS) }  Experimentally shown as: }  Good grouping criteria for sound segregation in polyphonic music signals. }  Compares favorably to other state-of-the-art harmonicity cues }  Software development of the sound segregation framework }  Used for validation and evaluation }  Made available as Free and Open Source Software (FOSS) }  Based on Marsyas }  Free for everyone to try, evaluate, modify and improve
  54. 54. Future Work A Computational Framework for Sound Segregation in Music Signals73 }  Analysis front-end }  Evaluate alternative analysis frontends }  Perceptually-informed filterbanks }  Sinusoid+transient representations }  A different auditory front-end (as long as it is invertible).… }  Evaluate alternative frequency estimation methods for spectral peaks }  Parabolic interpolation }  Subspace methods }  … }  Use of a beat-synchronous approach }  Based on the use of onset detectors and beat estimators for dynamic adjustment of texture windows }  Perceptually motivated
  55. 55. Future Work A Computational Framework for Sound Segregation in Music Signals74 }  Grouping Cues }  Improve HWPS }  Better f0 candidate estimation }  Reduce negative impact of sound events in different audio frames }  Inclusion of new perceptually motivated auditory cues }  Time and frequency masking }  Stereo placement of spectral components (for stereo signals) }  Timbre models as a priori information }  Peak tracking as a pre- and post-processing }  Common fate (onsets, offsets, modulation)
  56. 56. Future Work A Computational Framework for Sound Segregation in Music Signals75 }  Implement Sequential integration }  between texture windows }  Cluster segregated clusters? }  Timbre similarity [Martins et al. 2007] Cluster 1 Cluster 2
  57. 57. Future Work A Computational Framework for Sound Segregation in Music Signals76 }  Clustering }  Definition of the neighborhood width (σ) in similarity functions }  JNDs? }  Define and evaluate more expressive combinations of similarity functions }  Automatic estimation of the number of clusters in each texture window }  Extraction of new descriptors directly from segregated cluster parameters (e.g., CPR): }  Pitch, spectral features, frequency tracks, timing information
  58. 58. Future Work A Computational Framework for Sound Segregation in Music Signals77 }  Creation of a sound/music evaluation dataset }  Simple and synthetic sound examples }  For preliminary testing, fine tuning, validation }  “real-world” polyphonic recordings }  More complex signals, for final stress-test evaluations }  To be made publicly available }  Software Framework }  Analysis an processing framework based on Marsyas }  FOSS, C++, multi-platform, real-time }  Feature rich software visualization and sonification tools
  59. 59. Related Publications A Computational Framework for Sound Segregation in Music Signals78 }  PhD Thesis: }  Martins, L. G. (2009).A Computational Framework for Sound Segregation in Music Signals. PhD thesis, FEUP. }  Book: }  Martins, L. G. (2009).A Computational Framework for Sound Segregation in Music Signals – An Auditory Scene Analysis Approach for Modeling Perceptual Grouping in Music Listening. Lambert Academic Publishing. }  Book Chapter: }  Martins, L. G., Lagrange, M., and Tzanetakis, G. (2010). Modeling grouping cues for auditory scene analysis using a spectral clustering formulation. Machine Audition: Principles, Algorithms and Systems. IGI Global.
  60. 60. Related Publications A Computational Framework for Sound Segregation in Music Signals79 }  Lagrange, M., Martins, L. G., Murdoch, J., and Tzanetakis, G. (2008). Normalized cuts for predominant melodic source separation. IEEETransactions on Audio, Speech, and Language Processing, 16(2). Special Issue on MIR. }  Martins, L. G., Burred, J. J.,Tzanetakis, G., and Lagrange, M. (2007). Polyphonic instrument recognition using spectral clustering. In Proc. International Conference on Music Information Retrieval (ISMIR),Vienna,Austria. }  Lagrange, M., Martins, L. G., and Tzanetakis, G. (2008).A computationally efficient scheme for dominant harmonic source separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), LasVegas, Nevada, USA. }  Tzanetakis, G., Martins, L. G.,Teixeira, L. F., Castillo, C., Jones, R., and Lagrange, M. (2008). Interoperability and the Marsyas 0.2 runtime. In Proc. International Computer Music Conference (ICMC), Belfast, Northern Ireland. }  Lagrange, M., Martins, L. G., and Tzanetakis, G. (2007). Semi-automatic mono to stereo up-mixing using sound source formation. In Proc. 112th Convention of the Audio Engineering Society,Vienna,Austria.
  61. 61. Thank you A Computational Framework for Sound Segregation in Music Signals80 Questions? lmartins@porto.ucp.pt http://www.artes.ucp.pt/citar/

×