Report

Share

Follow

•2 likes•1,647 views

•2 likes•1,647 views

Report

Share

Download to read offline

Daichi Kitamura, "Blind source separation based on independent low-rank matrix analysis and its extension to Student's t-distribution," Télécom ParisTech, Invited Lecture, September 4th, 2017.

Follow

- 1. Blind source separation based on independent low-rank matrix analysis and its extension to Student's t-distribution Télécom ParisTech Visiting September 4th The University of Tokyo, Japan Project Research Associate Daichi Kitamura
- 2. • Name: Daichi Kitamura • Age: 27 (born in 1990) – Kagawa Pref. in Japan • Background: – NAIST, Japan • Master degree (received in 2014) – SOKENDAI, Japan • Ph.D. degree (received in 2017) – The University of Tokyo, Japan • Project Research Associate • Research topics – Acoustic signal processing, statistical signal processing, audio source separation, etc. Self introduction 2 Japan Kagawa Tokyo
- 3. Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – Employ low-rank TF structures of each source in BSS – Gaussian source model with TF-varying variance – Relationship between ILRMA and multichannel NMF – Student’s t source model with TF-varying scale parameters • Conclusion 3
- 4. Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – Employ low-rank TF structures of each source in BSS – Gaussian source model with TF-varying variance – Relationship between ILRMA and multichannel NMF – Student’s t source model with TF-varying scale parameters • Conclusion 4
- 5. • Blind source separation (BSS) for audio signals – separates original audio sources – does not require prior information of recording conditions • locations of mics and sources, room geometry, timbres, etc. – can be available for many audio app. • Consider only “determined” situation Background 5 Recording mixture Separated guitar BSS Sources Observed Estimated Mixing system Demixing system # of mics # of sources
- 6. • Basic theories and their evolution History of BSS for audio signals 6 1994 1998 2013 1999 2012 Age Many permutation solvers for FDICA Apply NMF to many tasks Generative models in NMF Many extensions of NMF Independent component analysis (ICA) Frequency-domain ICA (FDICA) Itakura–Saito NMF (ISNMF) Independent vector analysis (IVA) Multichannel NMF Independent low-rank matrix analysis (ILRMA) *Depicting only popular methods 2016 2009 2006 2011 Auxiliary-function-based IVA (AuxIVA) Time-varying Gaussian IVA Nonnegative matrix factorization (NMF)
- 7. Motivation of ILRMA • Conventional BSS techniques based on ICA – Minimum distortion (linear demixing) – Relatively fast and stable optimization • FastICA [A. Hyvarinen, 1999], natural gradient [S. Amari, 1996], and auxiliary function technique [N. Ono+, 2010], [N. Ono, 2011] – Could not use “specific” assumption of sources • Only assumes non-Gaussian p.d.f. for sources – Permutation problem is crucial and still difficult to solve • IVA often fails causing a “block permutation problem” [Y. Liang+, 2012] • Better to use a “specific source model” in TF domain – Independent low-rank matrix analysis (ILRMA) employs a low-rank property 7 : frequency bins Observed signal Source signalsFrequency-wise mixing matrix : time frames Estimated signal Frequency-wise demixing matrix
- 8. Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – Employ low-rank TF structures of each source in BSS – Gaussian source model with TF-varying variance – Relationship between ILRMA and multichannel NMF – Student’s t source model with TF-varying scale parameters • Conclusion 8
- 9. • Independent component analysis (ICA)[P. Comon, 1994] – estimates without knowing – Source model (scalar) • is non-Gaussian and mutually independent – Spatial model • Mixing system is a time-invariant matrix • Mixing system in audio signals – Convolutive mixture with room reverberation Related methods: ICA 9 Mixing matrix Demixing matrix Source model Sources Observed Estimated Spatial model
- 10. • Frequency-domain ICA (FDICA) [P. Smaragdis, 1998] – estimates frequency-wise demixing matrix – Source model (scalar) • is complex-valued, non-Gaussian, and mutually independent – Spatial model • Frequency-wise mixing matrix is time-invariant – Instantaneous mixture in each frequency band – A.k.a. rank-1 spatial model [N.Q.K. Duong, 2010] • Permutation problem? – Order of estimated signals cannot be determined by ICA – Alignment of frequency-wise estimated signals is required • Many permutation solvers were proposed Related methods: FDICA 10 Spectrograms ICA1 … Frequencybin Time frame … ICA2 ICA I
- 11. • FDICA requires signal alignment for all frequency – Order of estimated signals cannot be determined by ICA* Permutation problem 11 ICA All frequency components Source 1 Source 2 Observed 1 Observed 2 Permutation Solver Estimated signal 1 Estimated signal 2 Time *Signal scale should also be restored by a back-projection technique
- 12. Related methods: IVA • Independent vector analysis (IVA)[A. Hiroe, 2006], [T. Kim, 2006] – extends ICA to multivariate probabilistic model to consider sourcewise frequency vector as a variable – Source model (vector) • is multivariate, spherical, complex-valued, non-Gaussian, and mutually independent – Spatial model • Mixing system is a time-invariant matrix (rank-1 spatial model) 12 … … Mixing matrix … … … Observed vector Demixing matrix Estimated vector Multivariate non- Gaussian dist. Have higher-order correlations Permutation-free estimation of is achieved! Source vector
- 13. • Spherical multivariate distribution[T. Kim, 2007] • Why spherical distribution? – Frequency bands that have similar activations will be merged together as one source avoid permutation problem Higher-order correlation assumed in IVA 13 x1 and x2 are mutually independent Spherical Laplace dist. Mutually independent two Laplace dist.s x1 and x2 have higher-order correlation Probability depends on only the norm
- 14. • Frequency-domain ICA (FDICA) [P. Smaragdis, 1998] • Independent vector analysis (IVA)[A. Hiroe, 2006], [T. Kim, 2006] Comparison of source models 14 Observed Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Estimated Demixing matrix Current empirical dist. Non-Gaussian source dist. STFT Frequency Time Frequency Time Observed Estimated Current empirical dist. STFT Frequency Time Frequency Time Non-Gaussian spherical source dist. Scalar r.v.s Vector (multivariate) r.v.s Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Mixture is close to Gaussian signal because of CLT Source obeys non- Gaussian dist. Mutually independent Demixing matrix Mutually independent
- 15. Related method: NMF • Nonnegative matrix factorization (NMF) [D. D. Lee, 1999] – Low-rank decomposition with nonnegative constraint • Limited number of nonnegative bases and their coefficients – Spectrogram is decomposed in acoustic signal processing • Frequently appearing spectral patterns and their activations 15 Amplitude Amplitude Nonnegative matrix (power spectrogram) Basis matrix (spectral patterns) Activation matrix (time-varying gains) Time : # of freq. bins : # of time frames : # of bases Time Frequency Frequency
- 16. • ISNMF[C. Févotte, 2009] – can be decomposed using “stable property” of • If we define , Related method: ISNMF 16 Equivalent Circularly symmetric complex Gaussian dist. Complex-valued observed signal Nonnegative variance Variance is also decomposed!
- 17. • Power spectrogram corresponds to variances in TF plane Related method: ISNMF 17 Frequencybin Time frame : Power spectrogram Small value of power Large value of power Complex Gaussian distribution with TF-varying variance If we marginalize in terms of time or frequency, the distribution becomes non-Gaussian even though each TF grid is defined in Gaussian distribution Grayscale shows the value of variance
- 18. Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – Employ low-rank TF structures of each source in BSS – Gaussian source model with TF-varying variance – Relationship between ILRMA and multichannel NMF – Student’s t source model with TF-varying scale parameters • Conclusion 18
- 19. Extension of source model in IVA • Source model in IVA – has a frequency-uniform scale • Multivariate Laplace with fixed scale • Since scale cannot be determined, it is not equivalent to the flat spectral basis – Almost an NMF with only one basis • Extend to ISNMF-based source model – NMF with arbitrary number of bases • can represent complicated TF structures – can learn “co-occurrence” of each source in TF domain • Co-occurrence is captured as the variance – The structure can easily be estimated by NMF 19 Frequency Time Frequency Time
- 20. • Spherical Laplace distribution in IVA • Gaussian distribution with TF-varying variance in ISNMF[C. Févotte+, 2009] 20 Frequency-uniform scale Extension of source model in IVA Complex-valued Gaussian in each TF bin Low-rank decomposition with NMF Spherical Laplace (bivariate) Frequency vector (I-dimensional) Time-frequency-varying variance Time-frequency matrix (IJ-dimensional)
- 21. • Negative log-likelihood in ILRMA Cost function in ILRMA and partitioning function 21 All the variables can easily be optimized by an alternative update Update rules in ICA Update rules in ISNMF Estimated signal: Cost function in ICA (estimates demixing matrix) Cost function in ISNMF (estimates low-rank source model)
- 22. Update rules of ILRMA • ML-based iterative update rules – Update rule for is based on iterative projection [N. Ono, 2011] – Update rules for NMF variables is based on MM algorithm – Pseudo code is available at • http://d-kitamura.net/pdf/misc/AlgorithmsForIndependentLowRankMatrixAnalysis.pdf 22 Spatial model (demixing matrix) Source model (NMF source model) where and is a one-hot vector that has 1 at th element
- 23. • ILRMA with partitioning function – Appropriate number of bases for each source can automatically be determined – Useful when various types of sources are mixed • Ex. drums are very low-rank but vocals are not so low-rank Cost function in ILRMA and partitioning function 23 andwhere
- 24. Update rules of ILRMA • ML-based iterative update rules – Update rule for is based on iterative projection [N. Ono, 2011] – Update rules for NMF variables is based on MM algorithm 24 Spatial model (demixing matrix) Source model (NMF source model) where and is a one-hot vector that has 1 at th element
- 25. Optimization process in ILRMA • Demixing matrix and source model are alternatively updated – The precise modeling of low-rank TF structures will improve the estimation accuracy of demixing matrix 25 Estimating demixing matrix Mixture Separated Source model Update NMF NMF Estimating NMF variables
- 26. Comparison of source models 26 FDICA source model Non-Gaussian scalar variable IVA source model Non-Gaussian vector variable with higher-order correlation ILRMA source model Non-Gaussian matrix variable with low-rank time-frequency structure Rank of TF matrix of mixture Rank of TF matrix of each source
- 27. • Multichannel NMF[A. Ozerov+, 2010], [H. Sawada+, 2013] Multichannel extension of NMF 27 Spatial covariances in each time-frequency slot Observed multichannel signal Spatial covariances of each source Basis matrix Activation matrix Spatial model Source model Partitioning function Spectral patterns Gains Spatial property of each source Timber patterns of all sources Multichannel vector Simultaneous spatial covariance
- 28. Relationship b/w ILRMA and multichannel NMF • Difference b/w ILRMA and multichannel NMF? – Source distribution: complex Gaussian distribution (same) – ILRMA assumes – Multichannel NMF assumes full-rank spatial covariance • Assumption: rank-1 spatial model – Spatial covariance of each source is rank-1 matrix – Equivalent to simultaneous mixing assumption 28 Sourcewise steering vector ,
- 29. Relationship b/w ILRMA and multichannel NMF • Multichannel NMF with rank-1 spatial model 30 Substitute into the cost function Transform the variables as
- 30. Relationship b/w MNMF, IVA, and ILRMA • From multichannel NMF side, – Rank-1 spatial model is introduced, transform the problem from the estimation of mixing system to that of demixing matrix • From IVA side, – Increase the number of spectral bases in source model 31 Source model Spatialmodel FlexibleLimited FlexibleLimited IVA Multichannel NMF ILRMA NMF source model Rank-1 spatial model
- 31. Experimental evaluation • Conditions 32 Source signals Music signals obtained from SiSEC Convolve impulse response, two microphones and two sources Window length 512 ms of Hamming window Shift length 128 ms (1/4 shift) Number of bases 30 per each source (ILRMA w/o partitioning function) 60 for all source (ILRMA with partitioning function) Evaluation score Improvement ot signal-to-distortion ratio (SDR) 2 m Source 1 5.66cm 50 50 Source 2 2 m Source 1 5.66cm 60 60 Source 2 Impulse response E2A (reverberation time: 300 ms) Impulse response JR2 (reverberation time: 470 ms)
- 32. Results: fort_minor-remember_the_name 33 16 12 8 4 0 -4 -8 SDRimprovement[dB] Sawada’s MNMF IVA Ozerov’s MNMF Ozerov’s MNMF with random initialization Sawada’s MNMF initialized by ILRMA ILRMA w/o partitioning function ILRMA with partitioning function Directional clustering Sawada’s MNMF IVA Ozerov’s MNMF Ozerov’s MNMF with random initialization Sawada’s MNMF initialized by ILRMA ILRMA w/o partitioning function ILRMA with partitioning function Directional clustering 16 12 8 4 0 -4 -8 SDRimprovement[dB] Violin synth. Vocals Violin synth. Vocals E2A （300 ms） JR2 （470 ms） Poor Good Poor Good
- 33. Results: ultimate_nz_tour 34 Sawada’s MNMF IVA Ozerov’s MNMF Ozerov’s MNMF with random initialization Sawada’s MNMF initialized by ILRMA ILRMA w/o partitioning function ILRMA with partitioning function Directional clustering 20 15 10 5 0 -5 SDRimprovement[dB] Sawada’s MNMF IVA Ozerov’s MNMF Ozerov’s MNMF with random initialization Sawada’s MNMF initialized by ILRMA ILRMA w/o partitioning function ILRMA with partitioning function Directional clustering 20 15 10 5 0 -5 SDRimprovement[dB] Guitar Synth. Guitar Synth. Poor Good Poor Good E2A （300 ms） JR2 （470 ms）
- 34. • Signal length: 14 s 12 10 8 6 4 2 0 -2 SDRimprovement[dB] 4003002001000 Iteration steps IVA MNMF ILRMA ILRMA Results: bearlin-roads 35 without Z with Z 11.5 s 15.1 s 60.7 s 7647.3 s Poor Good
- 35. Demonstration: music source separation • Music source separation 36 Guitar Vocal Keyboard Guitar Vocal Keyboard Source separation Pay attention to listen three parts in the mixture Another demo is available at http://d-kitamura.net/en/index_en.html
- 36. • Source model based on Symmetric a-stable (SaS) distribution[A. Liutkus+, 2015], [U. Şimşekli+, 2015], [S. Leglaive+, 2017], [M. Fontaine+, 2017] – which can validate the decomposition of complex-valued r.v.s as the decomposition of their parameters – Heavy tail (sparse) when a approaches to 0 • Student’s t-distribution is also used as a source model[C. Févotte+, 2006], [K. Yoshii+, 2016], [K. Kitamura+, 2016], [S. Leglaive+, 2017] – that includes Cauchy distribution ( ) and Gaussian distribution ( ) Stable and Student’s t-distributions 37 SaS (stable family) Student’s t (partially stable) Cauchy Gauss
- 37. Source model of Student’s t-distribution • Degree-of-freedom parameter – Heavy tail when approaches to 0 • Complex Student’s t-dist. – Circularly symmetric – Student’s t NMF (t-NMF) [K. Yoshii+ 2016] 38 Defined in each TF slot Scale corresponds to NMF model Phase is assumed to be uniform
- 38. Motivation for using Student’s t-dist. • Better separation with t-NMF was reported[K. Yoshii+, 2016] – in a very simple experiment using only C4, E4, and G4 piano tones • NMF with heavy tail distribution – tends to provide excessive low-rank approximation • Sparse components (which may increase the rank of model data) are considered as outliers • ILRMA based on Student’s t source model (t-ILRMA) – may improve the separation accuracy by forcing NMF source model to be excessively low-rank – will be presented at MLSP2017! (preprint is available on arXiv) • https://arxiv.org/abs/1708.04795 39
- 39. • th power spectrogram corresponds to scales in TF plane Source model based on Student’s t-distribution 40 Frequencybin Time frame : th power spectrogram Small value of power Large value of power Complex Student’s t-distribution with TF-varying scale Grayscale shows the value of scale
- 40. • Negative log-likelihood in ILRMA Cost function in ILRMA based on Student’s t-dist. 41 Gaussian ILRMA modeling power spectrogram by variance Student’s t ILRMA modeling pth power spectrogram by scale Generalization of p.d.f. and model domain
- 41. Experimental results: randomized t-ILRMA • Examples – Improved when – Stable when but score is not sufficient – Root spectrogram ( ) is preferable for speech signals • In the case of – Source model is over-fitted to mixture 42 Music signals Speech signals
- 42. Tempering parameter • Random initialization (previous result) • Initialization based on Gaussian ILRMA – (Tempering approach of parameter) 43 t-ILRMA (iteration: 200) Identity matrix Uniform random values Gauss ILRMA (iteration: 100) Identity matrix Uniform random values t-ILRMA (iteration: 100) t-NMF (iteration: 100)Uniform random values arbitrary val.
- 43. Experimental results: initialized t-ILRMA • Examples – Improved for all value of – Could avoid over- fitting problem in the case • Best parameter? – Completely depending on data 44 Music signals Speech signals
- 44. Average results: music signals 45
- 45. Average results: speech signals 46
- 46. Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – Employ low-rank TF structures of each source in BSS – Gaussian source model with TF-varying variance – Relationship between ILRMA and multichannel NMF – Student’s t source model with TF-varying scale parameters • Conclusion 47
- 47. Conclusion • Independent low-rank matrix analysis (ILRMA) – Assumption • Statistical independence between sources • Low-rank time-frequency structure of each source – Equivalent to multichannel NMF • when the mixing assumption is valid • Student’s t-distribution is newly introduced – including two symmetric a-stable distributions • Complex Cauchy distribution ( ) • Complex Gaussian distribution ( ) • Further extensions – Relaxation of rank-1 spatial model? – Employ another distribution? – Supervised ILRMA? User-guided ILRMA? 48

- This talk treats blind source separation problem, BSS, which is a separation technique of individual sources from the recorded mixture. The word “blind” means that the method does not require any prior information about the recording conditions, such as locations of microphones, sources, and room geometry. This kind of technique is very useful for many applications as a system front-end processing. In this talk, we only consider a “determined” situation, namely, the numbers of microphones and sources are equal.
- This is a history of basic theories in audio BSS field. For acoustic signals, independent component analysis, ICA, was applied to the frequency domain signals as FDICA. After that, many permutation solvers for FDICA have been proposed, but eventually, an elegant solution, independent vector analysis, IVA was proposed. It is still extended to more flexible models. On the other hand, nonnegative matrix factorization, NMF, is also developed and extended to a multichannel signals for source separation problems. Recently, we have developed a new framework, which unifies these two powerful theories, called independent low-rank matrix analysis, ILRMA. I will explain about the detail.
- I here explain the motivation of this talk.
- I briefly explain the separation mechanism in FDICA and IVA. In FDICA, ICA is applied to each frequency bin considering the scalar time-series as random variables, and we maximize its non-Gaussianity to estimate the frequency-wise demixing matrix. In IVA, we consider a vector time-series variable of all frequencies like this figure, then assume a multivariate non-Gaussian distribution, which has a spherical property. Since spherical property ensures higher-order correlation in frequency variable, namely among frequency bins, the permutation problem can be avoided.
- This is a graphical image of the source model in ISNMF. In each time-frequency slot, zero-mean complex Gaussian distribution is defined, and they are mutually independent in all time and frequency. Now, the variances of these Gaussians are corresponding to the power spectrogram. Therefore, the slot that has strong power, such as a spectral peak and its harmonics, the Gaussian with a large variance becomes the generative model. Note that, even though each slot is Gaussian, the marginal distribution is non-Gaussian, because the variance fluctuates. So, we can use this model as a source model in ICA-based method.
- This figure shows the difference of source models in IVA and ILRMA. Since IVA assumes frequency-uniform scale, it is almost an NMF with only one flat basis. On the other hand, ILRMA has more flexible source model with arbitrary number of spectral bases. So we can capture more precise TF structure of each source.
- The spherical source distribution in IVA can be extended to more flexible model. This is called local Gaussian model, which employs zero-mean complex Gaussian distribution with time-frequency-varying variances. Namely, in each time-frequency slot i and j, complex Gaussian is defined, and its variance varies depending on the time and frequency. This model is equivalent to Itakura-Saito NMF, and the variance can be decomposed into basis T and activation V.
- 提案手法ILRMAの対数尤度関数はこのように得られます．ここで（クリック）青丸で囲った空間分離フィルタWと，赤丸で囲ったNMF音源モデルTVが求めるべき変数になります．（クリック） さらにこの式は，（クリック）前半が従来のIVAのコスト関数と等価であり，（クリック）後半が従来のNMFのコスト関数と等価です．（クリック） したがって，IVAとNMFの反復更新式を交互に反復することで全変数を容易に推定できます． さらに，音源毎に適切なランク数を潜在変数で適応的に決定することも可能です． これは，冒頭で示した通り，音楽信号といえどもボーカルはあまり低ランクにならず，ドラム信号は低ランク，といったことが起こりえますので，音源毎の適切なランクが変わります． そのような状況に対して尤度最大化の基準で自動的に基底を割り振るのがこの潜在変数の役割です．
- ILRMAの反復更新式はこのように導出できます． 空間分離フィルタの更新と音源モデルの更新を交互に行うことで，全変数が最適化されます． これらの反復計算で尤度が単調増加することが保証されているので，初期値近傍の局所解への収束が保証されています．
- 提案手法ILRMAの対数尤度関数はこのように得られます．ここで（クリック）青丸で囲った空間分離フィルタWと，赤丸で囲ったNMF音源モデルTVが求めるべき変数になります．（クリック） さらにこの式は，（クリック）前半が従来のIVAのコスト関数と等価であり，（クリック）後半が従来のNMFのコスト関数と等価です．（クリック） したがって，IVAとNMFの反復更新式を交互に反復することで全変数を容易に推定できます． さらに，音源毎に適切なランク数を潜在変数で適応的に決定することも可能です． これは，冒頭で示した通り，音楽信号といえどもボーカルはあまり低ランクにならず，ドラム信号は低ランク，といったことが起こりえますので，音源毎の適切なランクが変わります． そのような状況に対して尤度最大化の基準で自動的に基底を割り振るのがこの潜在変数の役割です．
- ILRMAの反復更新式はこのように導出できます． 空間分離フィルタの更新と音源モデルの更新を交互に行うことで，全変数が最適化されます． これらの反復計算で尤度が単調増加することが保証されているので，初期値近傍の局所解への収束が保証されています．
- つまり，提案手法はまず空間分離フィルタを学習し，それで分離された信号の音色構造をNMFで学習，その結果得られる音源モデルを空間分離フィルタの学習に再利用し，さらに高精度な分離信号が得られる，という反復になります． このプロセスを何度も更新することで，音源毎の明確な音色構造が捉えられ，空間分離フィルタの性能向上が期待できます．
- This is a comparison of source models in FDICA, IVA, and ILRMA again. The important idea in ILRMA is that the rank of TF matrix of mixture signal is grater than the rank of TF matrix of each source. So, if we assume not only the independence between source but also a low-rank TF structure for each source, the separation will be done accurately.
- また，論文ではNMFの多チャネル信号への拡張手法である多チャネルNMFとILRMAが密接に関連しているという事実を明らかにしています． 簡単に説明いたしますと，従来の多チャネルNMFで定義されている空間情報に関するモデル「空間相関行列」のランクが1となる制約を課した場合とILRMAが等価となる，という事実です． ただし，多チャネルNMFは混合系を推定する手法であり，ILRMAやIVAのように分離系を推定する技術とは異なります．そのため，多チャネルNMFは計算効率や不安定性の観点から実用性にやや欠ける点があります．これに関しては比較実験で示します．
- 音楽信号の分離実験を行いました．こちらは実験条件です．二つの音楽信号をこのような配置で鳴らし，2チャンネルのマイクで録音しました．このときの残響時間は300msです． 評価値はSDRという値を用いています．これは音質と分離度合いを含む総合的な性能を示す尺度です．
- こちらは3音源の分離結果の一例です．横軸は最適化更新回数，縦軸は分離精度をそれぞれ示しています． このように，反復更新に対する収束速度が多チャネルNMFとは全く違い，IVAやILRMAは非常に高速であることがわかります． また，一回の反復に対する計算量も大きく違うため，実際にかかる計算時間も非常に小さくなっています． そして分離精度はILRMAが良く，少し収束は遅くなりますが潜在変数がある場合が最もよくなっております．
- Anyway, the next one is a music source separation. Here we have a mixture signal of three parts. It’s just like a typical music. Please pay attention to listen three parts, guitar, vocal, and keyboard, OK? Let’s listen. Then, if we apply source separation, we can obtain this kind of signals. So, we can remix them, re-edit them, or anything we want. This is a source separation.
- This is a graphical image of the source model in ISNMF. In each time-frequency slot, zero-mean complex Gaussian distribution is defined, and they are mutually independent in all time and frequency. Now, the variances of these Gaussians are corresponding to the power spectrogram. Therefore, the slot that has strong power, such as a spectral peak and its harmonics, the Gaussian with a large variance becomes the generative model. Note that, even though each slot is Gaussian, the marginal distribution is non-Gaussian, because the variance fluctuates. So, we can use this model as a source model in ICA-based method.
- I do not explain about the detailed derivation of update rules, but they can easily be derived by the same manner as the previous Gaussian ILRMA.
- The source model in IVA, spherical Laplace, was extended to this ISNMF model resulting in a independent low-rank matrix analysis (ILRMA). So, ILRMA is a unification of IVA and ISNMF, and we employed NMF source model to capture the low-rank time-frequency structures of each source. This source model can improve the estimation accuracy of the demixing matrix.
- As I already explained, the window length in STFT affects the performance of ICA-based separation. If we use short window, x=As does not hold anymore, and if we use long window, the estimation becomes unstable because the number of time frames J decreases. However, ILRMA employs full time-frequency modeling of sources. This model may improve the robustness to a decrease in J. This is our expectation. Let’s check about this issue.
- Here we used 4 music and 4 speech signals obtained from SiSEC database, and we produced the observed signal by convolutiong the impulse response shown in the bottom. We used two types of impulse response, one has 300-ms-long reverberation, and the other one is 470 ms.
- We compare 4 methods, FDICA + ideal permutation solver, FDICA + DOA-based permutation solver, IVA, and ILRMA. In FDICA+IPS we used the reference, oracle source spectrogram. So this is an upper limit of FDICA. FDICA+DOA is a blind method that uses DOA clustering for solving the permutation problem. Of cause IVA and ILRMA are also blind method. Then, we used Hamming window with various window lengths.
- First, we show the results with ideal initialization. Namely, we first give a correct answer of demixing matrix using the oracle source. So, the initial value provides the best separation performance here. In addition, only for ILRMA, we set the initial value of NMF model T and V as the oracle values. Therefore, FDICA+DOA and IVA are using the spatial oracle initialization, and FDICA+IPS and ILRMA are using spatial and spectral oracle initialization.
- This is the result. The left ones are music, and right ones are the speech, and the reverberation time is short (top) and long (bottom). The horizontal axis shows the window length, and the vertical axis shows the separation performance. The colored lines are the results of ILRMA with various numbers of NMF bases. In the music results, we can see that FDICA and IVA could not achieve the good separation when the window becomes long. In ILRMA, the performance maintains even in a long long windows. This is obtained from the full modeling of time-frequency structure of each source. However, for the speech signals, the performance of ILRMA becomes worse. We guess this is because speech is not low-rank, and the source model could not capture the precise TF structures.
- Next, we show the results with fully blind situation. Initial W is set to identity matrix, and the initial source model is randomized. Note that FDICA+IPS still uses the oracle spectrogram for solving the permutation.
- This is the result. We could not obtain the same results as the previous one. The performance of all the methods is degraded when the window length becomes long. Therefore, at least we can say that, ILRMA has a good potential to separate the sources even in a long window case, but in practice, the blind estimation of precise source model is a difficult problem.