Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Experimental analysis of optimal window length for independent low-rank matrix analysis

324 views

Published on

Daichi Kitamura, Nobutaka Ono, and Hiroshi Saruwatari, "Experimental analysis of optimal window length for independent low-rank matrix analysis," Proceedings of The 2017 European Signal Processing Conference (EUSIPCO 2017), pp. 1210–1214, Kos, Greece, August 2017 (Invited Special Session).
Presented at 25th European Signal Processing Conference (EUSIPCO) 2017, "SS14: Multivariate Analysis for Audio Signal Source Enhancement," 14:30-16:10, August 30, 2017.

Published in: Engineering
  • Be the first to comment

Experimental analysis of optimal window length for independent low-rank matrix analysis

  1. 1. Experimental analysis of optimal window length for independent low-rank matrix analysis Daichi Kitamura Nobutaka Ono Hiroshi Saruwatari 25th European Signal Processing Conference (EUSIPCO) 2017 SS14: Multivariate Analysis for Audio Signal Source Enhancement August 30, 14:30-16:10 The University of Tokyo, Japan National Institute of Informatics, Japan The University of Tokyo, Japan
  2. 2. Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 2
  3. 3. Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 3
  4. 4. • Blind source separation (BSS) for audio signals – separates original audio sources – does not require prior information of recording conditions • locations of mics and sources, room geometry, timbres, etc. – can be available for many audio app. • Consider only “determined” situation Background 4 Recording mixture Separated guitar BSS Sources Observed Estimated Mixing system Demixing system # of mics # of sources
  5. 5. • Basic theories and their evolution History of BSS for audio signals 5 1994 1998 2013 1999 2012 Age Many permutation solvers for FDICA Apply NMF to many tasks Generative models in NMF Many extensions of NMF Independent component analysis (ICA) Nonnegative matrix factorization (NMF) Frequency-domain ICA (FDICA) Itakura–Saito NMF (ISNMF) Independent vector analysis (IVA) Multichannel NMF Independent low-rank matrix analysis (ILRMA) *Depicting only popular methods 2016 2009 2006 2011 Auxiliary-function-based IVA (AuxIVA) Time-varying Gaussian IVA
  6. 6. • Basic theories and their evolution History of BSS for audio signals 6 1994 1998 2013 1999 2012 Age Many permutation solvers for FDICA Apply NMF to many tasks Generative models in NMF Many extensions of NMF Independent component analysis (ICA) Nonnegative matrix factorization (NMF) Frequency-domain ICA (FDICA) Itakura–Saito NMF (ISNMF) Independent vector analysis (IVA) Multichannel NMF Independent low-rank matrix analysis (ILRMA) *Depicting only popular methods 2016 2009 2006 2011 Auxiliary-function-based IVA (AuxIVA) Time-varying Gaussian IVA
  7. 7. Motivation: fundamental limitation of BSS • Mixing assumption in frequency-domain BSS – “Linear time-invariant mixture” or “rank-1 spatial model” – Valid only when • Too long window also causes another problem – Number of time frames (samples) decreases • Trade-off between short and long window [S. Araki+, 2003] – FDICA suffers from the trade-off – What about for BSS methods with structural source model? • IVA and ILRMA 7 : frequency binsObserved multichannel signal Source signalsFrequency-wise mixing matrix : time frames Statistical bias will increase and estimation becomes unstable window length used in STFT length of room reverberation Performance Window length Optimal length
  8. 8. Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 8
  9. 9. • Frequency-domain ICA (FDICA) [P. Smaragdis, 1998] • Independent vector analysis (IVA) [A. Hiroe, 2006], [T. Kim, 2006] BSS methods: FDICA and IVA 9 Observed Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Estimated Demixing matrix Current empirical dist. Non-Gaussian source dist. STFT Frequency Time Frequency Time Observed Estimated Current empirical dist. STFT Frequency Time Frequency Time Non-Gaussian spherical source dist. Scalar r.v.s Vector (multivariate) r.v.s Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Mixture is close to Gaussian signal because of CLT Source obeys non- Gaussian dist. Mutually independent Demixing matrix Mutually independent
  10. 10. • Spherical Laplace distribution in IVA • Zero-mean complex Gaussian distribution with TF- varying variance (Itakura-Saito NMF)[C. Févotte+, 2009] 10 Frequency-uniform scale Extension of source distribution in IVA Zero-mean complex Gaussian in each TF bin Low-rank decomposition with NMF Spherical Laplace (bivariate) Frequency vector (I-dimensional) Time-frequency-varying variance Time-frequency matrix (IJ-dimensional) Extended to a more flexible model
  11. 11. • Power spectrogram corresponds to variances in TF plane Generative source model in ISNMF 11 Frequencybin Time frame : Power spectrogram Small value of power Large value of power Complex Gaussian distribution with TF-varying variance If we marginalize in terms of time or frequency, the distribution becomes non-Gaussian even though each TF grid is defined in Gaussian distribution Grayscale shows the value of variance
  12. 12. BSS methods: ILRMA • Independent low-rank matrix analysis (ILRMA) [D. Kitamura+,2016] – Unification of IVA and ISNMF – Source model in ILRMA 12 Frequency Basis Basis Time Number of bases can be set to arbitrary value Frequency Time Observed Estimated Low-rank decomposition Time Frequency Frequency Time Update demixing matrix so that estimated signals have low-rank structure in time-frequency domain STFT Demixing matrix
  13. 13. Comparison of source models 13 FDICA source model Non-Gaussian scalar variable IVA source model Non-Gaussian vector variable with higher-order correlation ILRMA source model Non-Gaussian matrix variable with low-rank time-frequency structure Rank of TF matrix of mixture Rank of TF matrix of each source
  14. 14. Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 14
  15. 15. Experimental analysis • Window length in STFT – If window length is too short • Mixing assumption does not hold anymore – If window length is too long • Estimation becomes unstable (# of time frames decreases) 15 Frequency Time … DFT DFT DFT Spectrogram … Window length (= DFT length) Shift length Window function Waveform • Our expectation – Full time-frequency modeling of sources in ILRMA may improve the robustness to a decrease in the number of time frames
  16. 16. Experimental analysis • Dataset: 4 music and 4 speech from SiSEC [S. Araki+, 2012] • Mixing: convolution with RIR in RWCP [S. Nakamura+, 2000] 16 Signal Data name Source (1/2) Length [s] Music bearlin-roads acoustic_guit_main/vocals 14.6 Music another_dreamer-the_ones_we_love guitar/vocals 25.6 Music fort_minor-remember_the_name violins_synth/vocals 24.6 Music ultimate_nz_tour guitar/synth 18.6 Speech dev1_female4 src_1/src_2 10.0 Speech dev1_female4 src_3/src_4 10.0 Speech dev1_male4 src_1/src_2 10.0 Speech dev1_male4 src_3/src_4 10.0 2 m Source 1 5.66cm 50 50 2 m 5.66cm 60 60 Impulse response E2A (reverberation time: T60 = 300 ms) Impulse response JR2 (reverberation time: T60 = 470 ms) Source 2 Source 1 Source 2
  17. 17. Experimental analysis • Compared methods – FDICA+IPS (ideal permutation solver) • Align permutation of estimated components using the reference (oracle) source spectrogram (upper limit performance of FDICA) – FDICA+DOA (DOA-based permutation solver) [S. Kurita+, 2000] • Align permutation of estimated components using DOA after FDICA – IVA [N. Ono, 2011] • using auxiliary function method (a.k.a. MM algorithm) in optimization – ILRMA [D. Kitamura+, 2016] • with several numbers of bases • Other conditions – Window function: Hamming window – Window length: 32 ~ 2048 ms – Shift length: Always quarter of window length 17
  18. 18. Comparison using ideal initialization: condition • Set initial value of demixing matrix to oracle: – This initial value provides the best separation performance under the assumption • Set initial value of source model as oracle (only for ILRMA): 18 Power spectrogram of th source FDICA+DOA & IVA: spatial oracle initialization FDICA+IPS & ILRMA: spatial and spectral oracle initialization
  19. 19. Comparison using ideal initialization: results 19 Music T60 =0.30 s Music T60 =0.47 s Speech T60 =0.30 s Speech T60 =0.47 s
  20. 20. Comparison using random initialization: condition • Set initial value of demixing matrix to identity matrix • Set initial value of source model to uniform random value between [0,1] (only for ILRMA) 20 FDICA+DOA, IVA, & ILRMA: fully blind method FDICA+IPS: using oracle spectrogram
  21. 21. Comparison using random initialization: results 21 Music T60 =0.30 s Music T60 =0.47 s Speech T60 =0.30 s Speech T60 =0.47 s
  22. 22. Conclusion • In the case of ILRMA with oracle initialization, the robustness to long windows (fewer time frames) can be improved – optimal window length is longer than that in FDICA or IVA – thanks to employing not only the independence between sources but also a full modeling of time-frequency structure for the estimation of the demixing matrix • In a practical situation (fully blind case), – optimal window length is similar to that in FDICA or IVA – difficulty of the blind estimation of a precise spectral model in ILRMA 22 Thank you for your attention!

×