Your SlideShare is downloading.
×

- 1. Audio Source Separation Based on Low-Rank Structure and Statistical Independence The University of Tokyo Research Associate Daichi Kitamura Nagoya University, Lecture May 30, 2017
- 2. Introduction • Daichi Kitamura (北村大地) • Research Associate of The University of Tokyo • Academic background – Kagawa National Collage of Technology （2005 ~ 2012） • B.S. in Engineering (March 2012) – Nara Institute of Science and Technology （2012 ~ 2014） • M.S. in Engineering (March 2014) – SOKENDAI （2014 ~ 2017） • Ph.D. in Informatics (March 2017) • Research topics – Media signal processing – Audio source separation 2
- 3. Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 3
- 4. Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 4
- 5. • Audio source separation – Signal processing – Separation of speech, music sounds, background noise, … – Cocktail party effect by a computer Research background 5
- 6. • Audio source separation – Signal processing – Separation of speech, music sounds, background noise, … – Cocktail party effect by a computer Research background 6
- 7. Research background 7 Separate Automatic transcription CD • Application of audio source separation – Hearing aid • Easy to talk in a loud environment – Speech recognition systems • Siri, Google search, Cortana, Amazon Echo, … – Automatic music transcription • Musical part separation (Vo., Gt., Ba., …) – Remix of live-recorded music • Professional use (improving quality), personal use (DJ remixing), …
- 8. Demonstration: speech source separation • Real-time speech source separation (video) 8
- 9. Demonstration: music source separation • Music source separation 9 Guitar Vocal Keyboard Guitar Vocal Keyboard Source separation Pay attention to listen three parts in the mixture.
- 10. Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 10 For monaural signals For stereo or multichannel signals
- 11. Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 11 For monaural signals For stereo or multichannel signals
- 12. Time-frequency representation of audio signals • Audio waveform in time domain (speech) 12
- 13. • Time-varying frequency structure – Short-time Fourier transform (STFT) Time-frequency representation of audio signals 13 Time domain Window FFT length Shift length Time-frequency domain Waveform … Fourier transform Fourier transform Fourier transform Spectrogram Complex-valued matrix Frequency Time … Power spectrogram Nonnegative real-valued matrix Entry-wise absolute and power
- 14. Power spectrogram of speech 14
- 15. Power spectrogram of music 15
- 16. • Sparse (for both speech and music) – Strong (yellow) components are fewer – Weak (darker) components are dominant • Continuous contour (only in speech) – Spectrum continuously and dynamically changes • Low rank (especially in music) – Including similar patterns (similar timbres) many times Structural properties 16Speech Music
- 17. Comparison of low-rankness 17 Drums Guitar Vocals Speech
- 18. • Low-rankness (simplicity of a matrix) – can be measured by a cumulative singular value (CSV) – Drums and guitar are quite low-rank • Also, vocals and speech are to some extent low-rank – Music spectrogram can be modeled by few patterns Comparison of low-rankness 18 95% line 7 29 Around 90 Number of bases when CSV reaches 95% （Spectrogram size is 1025x1883）
- 19. Modeling technique of low-rank structures • Nonnegative matrix factorization (NMF) [Lee, 1999] – is a low-rank approximation using limited number of bases • Bases and their coefficients must be nonnegative – can be applied to a power spectrogram • Spectral patterns (typical timbres) and their time-varying gains 19 Amplitude Amplitude Nonnegative matrix (power spectrogram) Basis matrix (spectral patterns) Activation matrix (time-varying gains) Time : # of frequency bins : # of time frames : # of bases Time Frequency Frequency Basis Activation
- 20. • Parameters optimization in NMF – Minimize “similarity measure” between and – Arbitrarily measure for similarity can be used • Squared Euclidian distance , etc. – Closed form solution is still an open problem – Iterative calculation can minimize • Multiplicative update rules [Lee, 2000] Modeling technique of low-rank structures 20 （for the case of squared Euclidian distance）
- 21. Modeling technique of low-rank structures • Example 21 Pf. and Cl. Superposition of rank-1 spectrogram
- 22. Modeling technique of low-rank structures • Example – Pf. and Cl. are separated! – Source separation based on NMF • is a clustering problem of the obtained spectral bases in – But how? 22 Pf. Cl. Pf. and Cl.
- 23. • If the sourcewise training data is available, • Supervised NMF [Smaragdis, 2007], [Kitamura1, 2014] Supervised audio source separation with NMF 23 Separation stage Training stage Given Spectral dictionary of Pf. Other bases Only , , and are optimized
- 24. • Demonstration – Stereo music separation with supervised NMF [Kitamura, 2015] Supervised audio source separation with NMF 24 Original song Training sound of Pf. Separated sound (Pf.) Training sound of Ba. Separated sound (Ba.)
- 25. • Performance will be limited – when the difference of timbres between training data and target source in the mixture becomes large Problem of supervised approach 25 Mixture sound Target Different Pf. Slightly different Training data 60 40 20 0 -20 Amplitude[dB] 3.02.52.01.51.00.50.0 Frequency [kHz] Real sound Artificial sound by MIDI Difference of timbres Mixture (actual Pf. & Tb.) Separated signal using artificial Pf. as training data Supervised NMF
- 26. • Supervised NMF with basis deformation [Kitamura, 2013] – employs to adaptively deform pre-trained bases in Adaptive supervised audio source separation 26 Training stage Deformation term (positive and negative) Slightly different Separation stage Given
- 27. • Constraint in deformation term – Range of deformation is restricted – To avoid excess deformation of Adaptive supervised audio source separation 27 Mixture (actual Pf. & Tb.) Separated signal Supervised NMF Separated signal Supervised NMF with basis deformation Training data is the same (artificial Pf. sound) Frequency Frequency ±30% For the case of
- 28. • Demonstration – Separate actual instrumental sounds using artificial training data produced by MIDI synthesizer. Adaptive supervised audio source separation 28Copyright © 2014 Yamaha Corp. All rights reserved. Original song (actual instruments) Training sound of Sax. (produced by MIDI) Separated sound (Sax.) Training sound of Ba. (produced by MIDI) Separated sound (Ba.) Residual sound Residual sound
- 29. Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 29 For monaural signals For stereo or multichannel signals
- 30. Multichannel recording using microphone array • Number of microphones and sources – Overdetermined situation (# of sources # of mics.) – Underdetermined situation (# of sources # of mics.) • a priori information – Training data of the source, position of sources, room geometry, music scores, etc. – Blind source separation (BSS): without any a priori info. 30 Sources Observed Estimated Mixing system Demixing system Microphone array CD L-ch R-ch Stereo signal (2-ch) One mic. 1-ch Monaural signal (1-ch)
- 31. BSS and independent component analysis • Blind source separation (BSS) – Estimate demixing system without any prior information about the mixing system • Typical BSS is based on a statistical independence • Independent component analysis (ICA) [Comon, 1994] – How to measure a statistical independence? – Define a “distribution of audio signals” – Find demixing system that maximizes independence 31 Demixing systemMixing system
- 32. What is the distribution of audio signals? • Distribution of speech waveform 13 Amplitude Time samples Spiky and heavy-tailed than Gaussian (Normal) distribution Amountofcomponents Amplitude 0 0.1 0.2 0.3 0.4 0.5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Gaussian distribution
- 33. What is the distribution of audio signals? • Distribution of Piano waveform 13 Amplitude Time samples Spiky and heavy-tailed than Gaussian distribution Amountofcomponents Amplitude 0 0.1 0.2 0.3 0.4 0.5 0.6 -5 -4 -3 -2 -1 0 1 2 3 4 5 Laplace distribution
- 34. What is the distribution of audio signals? • Distribution of Drums waveform 13 Amplitude Time samples Spiky and heavy-tailed than Gaussian distribution Amountofcomponents Amplitude 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 1 2 3 4 5 Cauchy distribution
- 35. Central limit theorem 35 • Audio source distribution is basically non-Gaussian – But still we don’t know the source distribution • How to model them for source separation? • Central limit theorem – “A sum of any kind of random variables always approaches to having a Gaussian distribution.”* • Can’t believe? Let’s see 0 0.1 0.2 0.3 0.4 0.5 0.6 -5 -4 -3 -2 -1 0 1 2 3 4 5 Laplace distribution 0 0.002 0.004 0.006 0.008 0.01 -5 -4 -3 -2 -1 0 1 2 3 4 5 Uniform distribution Generate r.v.s Gaussian distribution 0 0.1 0.2 0.3 0.4 0.5 -5 -4 -3 -2 -1 0 1 2 3 4 5 * Several r.v.s do not obey, e.g., Cauchy r.v.
- 36. Central limit theorem 36 • is pips of first dice, and is pips of second dice – – Probability is always 1/6 • Results of 1 million trials for each dice – What about ? Amount Amount
- 37. Central limit theorem 37 • is pips of first dice, and is pips of second dice – – Probability is always 1/6 • Results of 1 million trials for each dice – What about ? Amount Not a uniform distribution any more
- 38. Central limit theorem 38 • is pips of first dice, and is pips of second dice – – Probability is always 1/6 • Results of 1 million trials for each dice Amount Amount
- 39. Central limit theorem 39 • is pips of first dice, and is pips of second dice – – Probability is always 1/6 • Results of 1 million trials for each dice – Approaches to a Gaussian distribution (central limit theorem)
- 40. Central limit theorem in audio signals 40 • is an th speakers signal – – , around 3.3 s Amplitude Time samples Amount Amplitude Amplitude Time samples Amount Amplitude
- 41. Central limit theorem in audio signals 41 • is an th speakers signal – – , around 3.3 s Amplitude Time samples AmountAmplitude
- 42. Central limit theorem in audio signals 42 • is an th speakers signal – – , around 3.3 s Amplitude Time samples Amount Amplitude Amplitude Time samples Amount Amplitude
- 43. • is an th speakers signal – – , around 3.3 s Central limit theorem in audio signals 43 Amplitude Time samples AmountAmplitude
- 44. • is an th speakers signal – – , around 3.3 s Central limit theorem in audio signals 44 Amplitude Time samples AmountAmplitude Almost a Gaussian dist. (central limit theorem)
- 45. Principle of ICA 45 • What we can say from central limit theorem – Gaussian distribution is a limitation of mixture of sources – If we maximize non-Gaussianity of all signals, the signals will be the original sources before they mixed Basic principle of ICA Maximizing non-Gaussianity Maximizing independence between components More general, Approaching to Gaussian (central limit theorem) Departing from Gaussian (ICA)
- 46. Principle of ICA • Assumption in ICA – 1. Sources are mutually independent – 2. Each source distribution is non-Gaussian – 3. Mixing system is invertible and time-invariant Mixing matrix Sources (latent components)1. Mutually independent 2. Non-Gaussian 3. Invertible and time-invariant 10 Mixtures (observed signals) Inverse matrix
- 47. Principle of ICA • Uncertainty in ICA – 1. Signal scale (volume) cannot determined – 2. Signal permutation cannot determined 11 ICA ICA Sources (latent components) Mixtures (observed signals) Sources (latent components) Mixtures (observed signals) Separated signals (estimated by ICA) Separated signals (estimated by ICA)
- 48. • Estimation in ICA – Maximize independence between source distributions – log-likelihood function Principle of ICA 12 Minimize distance ： Non-Gaussian source distribution Generally, is set to an appropriate non-Gaussian distribution
- 49. • Audio mixture in actual environment – Convolutive mixture with reverberation • Ex. office room has 300 ms, concert hall is more than 2000 ms – Mixing coefficient becomes mixing filter • How to deconvolute them? – 1. Estimate deconvolution filter • In 16 kHz sampling, the filter with 300 ms includes 4800 taps • # of parameters that should be estimated explodes – 2. Estimate demixing coefficient in frequency domain • Frequency-wise demixing matrix should be estimated by ICA • encountering permutation problem ICA-based separation of reverberant mixture 49 Reverberation length (length of convolution filter) Simultaneous mixture Convolutive mixture
- 50. ICA-based separation of reverberant mixture • Frequency-domain ICA (FDICA) [Smaragdis, 1998] – Apply simple ICA to each frequency bin 50 Spectrogram ICA1 ICA2 ICA3 … … ICA Frequencybin Time frame … Inverse matrix Frequency-wise mixing matrix Frequency-wise demixing matrix
- 51. ICA-based separation of reverberant mixture 51 • Permutation problem in frequency-domain ICA – Order of separated signals in each frequency is messed up* – Have to take an alignment through the frequency *Scales are also messed up, but they can be easily fixed. ICA In all frequency Source 1 Source 2 Mixture 1 Mixture 2 Permutation Solver Separated signal 1 Separated signal 2Time
- 52. ICA-based separation of reverberant mixture • Popular permutation solvers – Based on direction of arrival (DOA) • Frequency-domain ICA + DOA alignment [Saruwatari, 2006] – Based on a relative correlation among frequencies • Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] – Based on a low-rank modeling of each source • Independent low-rank matrix analysis (ILRMA) [Kitamura, 2016] • Demonstration of BSS using ILRMA – http://d-kitamura.net/en/demo_rank1_en.htm 52
- 53. Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 53
- 54. Conclusions and future works • Audio source separation based on – Low-rank property • Nonnegative matrix factorization – Statistical independence • Blind source separation • For further improving – Separation based on a huge dataset training • Deep learning, denoising auto encoder, etc. • Recording condition is juts one-time – Informed source separation • Music scores could be a powerful information • User can induce the system, and leads more accurate separation • Performance is still insufficient – Almost there? Not at all! Make our life better. That’s an engineering. 54 Duration Region