- 1. 13th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2021) Overview Session OS-1: Acoustic Signal Processing Blind Audio Source Separation Based on Time-Frequency Structure Models Daichi Kitamura National Institute of Technology, Kagawa College Japan
- 2. 2 • Daichi Kitamura • National Institute of Technology, Kagawa College • Research interests – Audio source separation – Array signal processing – Machine learning – Music signal processing – Biosignal processing Self introduction
- 3. 3 Contents • Background – Blind source separation (BSS) for audio signals and its history – Motivation • Preliminaries – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Time-frequency-masking-based BSS (TFMBSS) – Reformulation of BSS problems and its optimization – BSS based on primal-dual splitting method – Interpretation of TF masking and application • Conclusion
- 4. 4 Contents • Background – Blind source separation (BSS) for audio signals and its history – Motivation • Preliminaries – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Time-frequency-masking-based BSS (TFMBSS) – Reformulation of BSS problems and its optimization – BSS based on primal-dual splitting method – Interpretation of TF masking and application • Conclusion
- 5. 5 • Blind source separation (BSS) for audio signals – estimates specific audio sources in the observed mixture – does not require prior information of recording conditions • locations of mics and sources, room geometry, timbres, etc. • The word “blind” means “unsupervised”. – is available for many audio applications • Hearing aid systems • Automatic speech recognition (ASR) • Preprocessing for music analysis etc. Background: BSS for audio signals Observed mixture BSS Estimated source signals
- 6. 6 Background: BSS for audio signals • Music BSS using ILRMA Guitar Vocal Keyboard Guitar Vocal Keyboard BSS Please pay attention to listen three parts in the mixture. MATLAB code: https://github.com/d-kitamura/ILRMA Python code: Implemented in “Pyroomacoustics” library
- 7. 7 • Numbers of mics and sources • Consider only “determined” situation – # of mics # of sources – BSS estimates “demixing system” (inverse of mixing) Background: BSS for audio signals Source signals Observed signals Estimated signals Mixing system Demixing system Monaural rec. 1ch Single-channel signal Mic array 1ch Mch Multichannel signal 2ch … …
- 8. 8 Spectral subtraction Time-frequency masking Many other methods Beamforming Sparse coding Time-frequency masking DOA clustering Many other methods Historical overview (only the methods related in this talk) 1994 1998 2013 1999 2012 Permutation solvers Extension of models Generative models Frequency-domain ICA Itakura-Saito NMF IVA 2016 2009 2006 2011 AuxIVA Time-varying IVA Multichannel NMF 2018 IDLMA Single-channel Spatial covariance model Spatial covariance model+DNN Supervised approaches based on deep neural networks (DNN) ICA [Comon], [Bell and Sejnowski], [Cardoso], [Amari], [Cichocki], … [Smaragdis] [Saruwatari], [Murata], [Morgan], [Sawada], … [Hiroe], [Kim] [Ono] [Ono] [Kitamura] [Nugraha] [Ozerov, Sawada] [Duong] [Févotte] [Lee] [Virtanen], [Smaragdis], [Kameoka], [Ozerov], … 2010 Underdetermined Determined [Yatabe&Kitamura] 2021 Time-freq.-masking- based BSS (TFMBSS) [Mogami] NMF ILRMA Gray-colored methods are “supervised” (not fully blind)
- 9. 9 Motivation of determined BSS • Conventional BSS: IVA, AuxIVA, and ILRMA – Minimum distortion (linear demixing) – Relatively fast and stable optimization • Iterative projection (AuxIVA) [Ono+, 2010], [Ono, 2011] – Time-frequency (TF) structure model affects performance • IVA: co-occurrence along frequency axis • ILRMA: NMF-based low-rank time-frequency structure – Optimization algorithm depends on the TF model • Difficult to derive update rules • Easily replace TF model and search the best one – Time-frequency-masking-based BSS (TFMBSS) : frequency bins Observed signal Source signals Frequency-wise mixing matrix : time frames Estimated signal Frequency-wise demixing matrix [Yatabe & Kitamura, 2021]
- 10. 10 Contents • Background – Blind source separation (BSS) for audio signals and its history – Motivation • Preliminaries – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Time-frequency-masking-based BSS (TFMBSS) – Reformulation of BSS problems and its optimization – BSS based on primal-dual splitting method – Interpretation of TF masking and application • Conclusion
- 11. 11 Independence-based BSS in time domain • Independent component analysis (ICA) [Comon, 1994] – If we assume – then we can estimate demixing matrix • by maximizing independence between the estimates ( and ) Mixing matrix Sources (latent components) 1. Mutually independent 2. Non-Gaussian 3. Invertible and time-invariant Mixtures (observed signals) Inverse matrix
- 12. 12 • Independent component analysis (ICA) [Comon, 1994] – Maximizes independence between source distributions – Optimization problem in ICA Independence-based BSS in time domain Minimize similarity ： Non-Gaussian source distribution (e.g., Laplace distribution) ...
- 13. 13 Independence-based BSS in time domain • Independent component analysis (ICA) [Comon, 1994] – However, • 1. Signal scales (volumes) cannot be determined • 2. Signal permutation cannot be determined Sources (latent components) Mixtures (observed signals) Sources (latent components) Mixtures (observed signals) Separated signals (estimated by ICA) Separated signals (estimated by ICA)
- 14. 14 • General audio mixture – Convolution with room reverberation • To deconvolute (separate) them, – apply short-time Fourier transform (STFT) and convert signals to TF domain – estimate frequency-wise demixing matrix Independence-based BSS in frequency domain Mixture without reverb. Mixture with reverb. Convolutive mixture in time domain Mixture in TF domain : freq. index : time index Reverb. length
- 15. 15 • Frequency-domain ICA (FDICA) [Smaragdis, 1998] – applies ICA to each of frequencies separately – estimates frequency-wise demixing matrix Inverse matrix Frequency-wise mixing matrix Frequency-wise demixing matrix FDICA : freq. index : time index
- 16. 16 • Frequency-domain ICA (FDICA) [Smaragdis, 1998] – Optimization problem in FDICA – By assuming circularly symmetric complex Laplace dist., – the minimization problem in FDICA becomes as • separable w.r.t. frequency FDICA ： Non-Gaussian complex-valued source distribution (e.g., circularly symmetric complex Laplace distribution) ...
- 17. 17 • Permutation problem in FDICA – Order of separated signals is messed up – Alignment along the frequency *Signal scales are also messed up, but they can be easily fixed by applying projection back technique. ICA In all frequency Source 1 Source 2 Mixture 1 Mixture 2 Permutation Solver Separated signal 1 Separated signal 2 Time Permutation problem
- 18. 18 Popular permutation solvers • Signal correlation between frequencies – FDICA + correlation-based clustering [Murata+, 2001], [Sawada+, 2011] • Direction of arrival of each source (DOA) – FDICA + DOA-based alignment [Saruwatari+, 2006] • Co-occurrence among frequencies of each source – Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] , [Kim, 2007] • Low-rank TF modeling of each source – Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016] • DNN-based supervised TF modeling of each source – Independent deeply learned matrix analysis (IDLMA) [Makishima+, 2019] • DNN-based permutation solver – Generalized permutation solver with training [Yamaji&Kitamura, 2020] • Spectrogram consistency – Consistent IVA and consistent ILRMA [Yatabe, 2020], [Kitamura+, 2020]
- 19. 19 • Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] – utilizes sourcewise frequency vector as a random variable – Vector source model in IVA • Spherical property of groups components that have co-occurrence of all frequencies as one source IVA Permutation-problem-free estimation of can be achieved! … … Mixing matrix … … … Observed vector Demixing matrix Estimated vector Multivariate distribution Have internal correlations Source vector Frequency Time Co-occurrence of all frequencies in each source
- 20. 20 • Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] – How much valid is IVA’s TF structure model? • Typical audio sources have co-occurrence of all frequencies • Can be interpreted as “group sparsity” in TF domain IVA Speech source (conversation) Vocal source (pop music)
- 21. 21 • Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] – Optimization problem in IVA – By assuming spherical Laplace dist., [Hiroe, 2006], [Kim, 2006] – the minimization problem in IVA becomes as follows IVA ： Non-Gaussian multivariate and spherical complex- valued source distribution (e.g., spherical Laplace distribution)
- 22. 22 • Auxiliary-function-based IVA (AuxIVA)[Ono, 2011] – Fast and stable optimization called iterative projection (IP) • Auxiliary function technique (or majorization-minimization algorithm) – Convergence-guaranteed fast and stable optimization without stepsize parameters Efficient optimization for IVA Update of auxiliary variables Update of original variables https://pyroomacoustics.readthedocs.io/en/pypi- release/pyroomacoustics.bss.auxiva.html Python code: Implemented in “Pyroomacoustics” library
- 23. 23 Frequency Time TF structure in IVA Frequency Time Frequency-uniform vector Time activation Frequency Basis Basis Time # of bases can arbitrarily be set To represent more complicated TF structure, NMF modeling can be introduced, resulting in independent low-rank matrix analysis (ILRMA) Extension of TF structure assumed in IVA Frequency Time TF structure in ILRMA
- 24. 24 ILRMA • Independent low-rank matrix analysis (ILRMA) – assumes each source has a low-rank TF structure – is a unification of • independence-based estimation of demixing matrix (FDICA or IVA) • low-rank TF modeling of each source (NMF) – avoids encountering the permutation problem • TF structure is introduced as well as IVA [Kitamura+, 2016] Observed signal Frequency-wise demixing matrix Estimated signal Time Frequency Frequency Time Update demixing matrix so that estimated signals are 1. mutually independent (ICA) 2. have low-rank TF structures (NMF) STFT Low-rank approximation by NMF Low rank Low rank Not low rank
- 25. 25 • Independent low-rank matrix analysis (ILRMA) – Optimization problem in ILRMA – Convergence-guaranteed update rules • NMF’s multiplicative update • AuxIVA (IP) ILRMA [Kitamura+, 2016] Cost function in FDICA or IVA Estimates frequency-wise demixing matrix Cost function in NMF Estimates low-rank TF structure of each source MATLAB code: https://github.com/d-kitamura/ILRMA Python code: Implemented in “Pyroomacoustics” library
- 26. 26 Contents • Background – Blind source separation (BSS) for audio signals and its history – Motivation • Preliminaries – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Time-frequency-masking-based BSS (TFMBSS) – Reformulation of BSS problems and its optimization – BSS based on primal-dual splitting method – Interpretation of TF masking and application • Conclusion
- 27. 27 Reformulation of BSS • Cost functions of independence-based BSS – FDICA w/ Laplace – IVA w/ spherical Laplace – ILRMA w/ Itakura-Saito NMF
- 28. 28 Reformulation of BSS • All of them are coming from ICA’s cost • Source generative model – corresponds to TF structure model for each source – is necessary for avoiding the permutation problem • Better assumption of TF structures – provides better BSS performance Freq. Time Low-rank Freq. Time Sparse Freq. Time Group-sparse and more
- 29. 29 Reformulation of BSS • Derivation of optimization algorithm – is problem dependent (depends on TF structure model) – requires technical knowledges and math skills • To try various TF structures in plug-and-play manner, – let’s reformulate BSS problems in a more general form – then solve it using a TF-structure-independent algorithm BSS algorithm Sparse Low-rank Plug and play Group-sparse
- 30. 30 Reformulation of BSS • Generalized optimization problem [Yatabe&Kitamura, 2018] – • TF structure model for each source • Often called “source model” in the context of BSS • Replace this function with a plug-and-play manner – • Coming from an ICA theory (Jacobian between and ) • Interpreted as “barrier function” avoiding to be rank-deficient of
- 31. 31 Reformulation of BSS • Generalized optimization problem [Yatabe&Kitamura, 2018] – FDICA w/ Laplace (L1 sparse regularizer) – IVA w/ spherical Laplace (L2,1 group-sparse regularizer) – ILRMA w/ Itakura-Saito NMF (low-rank approximation) Freq. vector
- 32. 32 Reformulation of BSS • Generalized optimization problem [Yatabe&Kitamura, 2018] – But, how? • Apply convex optimization technique – Primal-dual splitting method – Proximity operator • If is “proximable”, then we obtain optimization algorithm! If we change the TF structure model , its optimization algorithm can easily be obtained! Objective [Condat, 2013], [Vu, 2013], [Komodakis+, 2015]
- 33. 33 Primal-dual splitting method • Primal-dual splitting method [Condat, 2013], [Vu, 2013], – considers following problem – Iterative optimization algorithm – Proximity operator • If a proximity operator of can easily be calculated, is called “proximable” [Komodakis+, 2015] Step size parameters and : proper lower-semicontinuous convex function
- 34. 34 BSS using Primal-dual splitting method • Convert BSS to primal-dual-splitting-applicable form – Vectorization of demixing matrices – Matrixization th singular value of ... ... Mat to vec Collect all freqs. ...
- 35. 35 BSS using Primal-dual splitting method • Convert BSS to primal-dual-splitting-applicable form Introduce vectorized notation ( is a reshaped matrix that includes ) Ready to apply primal-dual splitting! C.f. problem for primal-dual splitting
- 36. 36 BSS using Primal-dual splitting method • General BSS algorithm using primal-dual splitting – Function is always proximable [Yatabe&Kitamura, 2018] Singular value decomposition
- 37. 37 BSS using Primal-dual splitting method • General BSS algorithm using primal-dual splitting – L2,1 Group sparse BSS (IVA) – Nuclear-norm-based low-rank BSS (ILRMA?) Nuclear norm (sum of singular values)
- 38. 38 BSS using Primal-dual splitting method • Multiple TF structures can also be utilized – L2,1 group-sparse + L1 sparse BSS (sparse IVA) – Low-rank + L1 sparse BSS (sparse ILRMA?) Proximable Proximable Proximable Proximable If TF structure models are proximable, you can use them in a plug-and-play manner! Advantage of proposed BSS
- 39. 39 BSS using Primal-dual splitting method • Experiment of two-speech-source BSS – Compare improvement of source-to-distortion ratio (SDR) Mixture A Mixture B Group-sparse Group-sparse + sparse Low-rank + sparse Low-rank Group-sparse Group-sparse + sparse Low-rank + sparse Low-rank
- 40. 40 Interpretation of TF masking • Proximity operators of many sparsity-inducing functions are obtained as thresholding operators – L1 norm: – L2,1 norm: – They have the same form: TF masking to the variable Proximity operator TF mask (0~1 values) determined by TF structure model Variable in TF shape Elementwise product
- 41. 41 TMFBSS • Time-frequency-masking-based BSS (TFMBSS) – Skip designing TF structure model function – TF mask of intended TF structure is employed in the optimization algorithm [Yatabe&Kitamura, 2021] 1. Design intended TF structure model 2. Calculate proximal operator 3. Optimize the problem BSS based on primal-dual splitting method TFMBSS ??? 1. ― 2. Design intended TF mask 3. Optimize the problem [Yatabe&Kitamura, 2019]
- 42. 42 TMFBSS • Time-frequency-masking-based BSS (TFMBSS) – Intended TF structure model is input to TFMBSS as a TF mask – Demixing matrix is optimized so that the estimated signals have the intended TF structures – Iterative update of TF masks are also interesting Mixture Frequency-wise demixing matrix Time Frequency Frequency Time Update demixing matrix so that the estimated signals have TF structures enhanced by the input TF masks STFT Enhancement by TF masking Time Frequency Frequency Time Time Frequency Frequency Time Estimates [Yatabe&Kitamura, 2021] [Yatabe&Kitamura, 2019]
- 43. 43 Application of TMFBSS • HPSS-based TFMBSS [Oyabu&Kitamura, 2021] – utilizes TF mask that is obtained via harmonic- percussive sound separation (HPSS) in TFMBSS
- 44. 44 • HPSS-based TFMBSS [Oyabu&Kitamura, 2021] Mixture Optimization- based HPSS [Ono+, 2008] Median-based HPSS [FitzGerald, 2010] Optimization- based HPSS + TFMBSS Median- based HPSS + TFMBSS Application of TMFBSS Linear, multichannel Estimated percussive sound Estimated harmonic sound Nonlinear, single-channel
- 45. 45 Contents • Background – Blind source separation (BSS) for audio signals and its history – Motivation • Preliminaries – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Time-frequency-masking-based BSS (TFMBSS) – Reformulation of BSS problems and its optimization – BSS based on primal-dual splitting method – Interpretation of TF masking and application • Conclusion
- 46. 46 Application of TMFBSS • Audio BSS with TF structure model – TF structure model is necessary for avoiding the permutation problem • Conventional algorithms (IVA, ILRMA, and so on) – Which TF structure is the best? Try and error – The optimization algorithm is problem-dependent • Changing TF structure model requires derivation of the algorithm • Proposed generalized BSS using primal-dual splitting – Easy to replace TF structure model • (if the function is “proximable”) – Easy to search the best TF structure for each BSS problem • TFMBSS – Explicitly define TF structure as TF masking

- Hi everyone, thank you for coming to my overview presentation. The title is Blind Audio Source Separation Based on Time-Frequency Structure models
- First of all, let me introduce myself.
- This is the contents of this talk; Background, Preliminaries, main topic, and conclusion.
- The first topic is background.
- This talk treats blind source separation problem, BSS, which is a separation technique of individual sources from the recorded mixture. The word “blind” means “unsupervised”. Thus, the BSS method does not require any prior information about the recording conditions and sources, such as locations of microphones, sources, room geometry, training dataset of sound sources, and so on. This kind of technique is very useful for many applications. For example, hearing aid systems, automatic speech recognition, and preprocessing for music analysis.
- This is a demonstration of music BSS using the method called ILRMA. Here we have a mixture signal of three parts, which was recorded using three microphones. Please pay attention to listen three parts, guitar, vocal, and keyboard, OK? Let’s listen. Then, if we apply ILRMA to this multichannel signal, we can obtain this kind of estimates. So, we can remix them, re-edit them, or anything we want. This is a source separation. By the way, the source code of ILRMA is available here, so please check it.
- In BSS for audio signals, numbers of microphones and sources are important. In this talk, we only consider a “determined” situation, namely, the numbers of microphones and sources are equal. If we want to separate three sources, we have to put three microphones. In the determined situation, the BSS problem becomes an estimation of the demixing system W, which is an inverse system of the mixture A.
- Here we show the historical overview in this slide, where only the related methods are shown here. There are three columns, determined, underdetermined, and single-channel. The origin of determined BSS is independent component analysis, ICA. And the important methods in this talk are IVA, AuxIVA, and ILRMA. In this talk, we review this column, namely, from ICA to the newest method called TFMBSS from the viewpoint of the utilized time-frequency structure models in each method.
- I here explain the motivation of this talk. The conventional determined BSS have advantages. One is a minimum distortion. Since these algorithms separate sources by multiplying frequency-wise demixing matrices, we can avoid artificial distortion as much as possible. Another advantage is a fast and stable optimization. In AuxIVA, very efficient algorithm called iterative projection was proposed, and this advantage was inherited to ILRMA. IVA and ILRMA assumes their own time-frequency structure models. However, if this model does not fit to the actual sources in the mixture, the BSS performance is degraded. So, we want to try various TF structure models in BSS. But we need to derive the optimization algorithms for each of TF structure models. Motivated by this issue, we propose a new BSS algorithm that can easily replace TF structure model and can easily search the best one. This is the main topic of this talk.
- 5分 The next one is Preliminaries. I’m gonna review the conventional methods from ICA to ILRMA.
- ICA is a fundamental algorithm for BSS. ICA assumes that the source distributions are mutually independent and non-Gaussian. Also, the mixing system is modeled by a multiplication of mixing matrix A, which is invertible and time-invariant. Based on these assumptions, ICA estimates the demixing matrix W, which is ideally an inverse matrix of A.
- The estimation theory in ICA is here. ICA minimizes the similarity between these distributions. This is equivalent to a maximization of independence between the separated sources. Since the separated signal y includes the demixing matrix, the optimization problem in ICA can be formulated as this problem, where p(y) is a non-Gaussian source distribution we need to assume. So, we find W that minimizes this function.
- However, ICA has two ambiguities: scales and permutation. ICA cannot determine the scales and the order of the estimated signals. In particular, the permutation ambiguity will be a serious problem in an audio BSS problem.
- For audio mixture signals, simple ICA cannot separate the sources. This is because the mixture of audio signals is not the multiplication of A but the convolution of mixing filters, which is due to the room reverberation. To deconvolute the mixture, we apply short-time Fourier transform and convert signals to TF domain. Since convolution in the time domain becomes multiplication in the TF domain, we can apply ICA and estimates frequency-wise demixing matrix.
- This method is called frequency-domain ICA, FDICA in short. We apply ICA to each of frequencies separately. Then, we estimate the demixing matrix Wi, where i is the index of frequencies and j is the index of time frames.
- Optimization problem in FDICA is formulated like this, and p(y) is a source distribution in the TF domain. Complex Laplace distribution, shown here, is often used for this assumption, and the minimization problem can be obtained like this.
- However, FDICA encounters the serious problem, which is so-called the permutation problem. In FDICA, simple ICA is performed in each frequency separately. Therefore, the order of the estimated signals is messed up along the frequency axis. Even if we completely separate the sources in each frequency, we have to take an alignment of the order of them along the frequency. Several permutation solvers have been proposed so far.
- I here listed popular permutation solvers. Before 2006, the permutation solver was a post processing (戻って) as shown in this figure, which uses correlation between frequencies or direction of arrival. Then, independent vector analysis, IVA, and independent low-rank matrix analysis, ILRMA, were proposed. These methods are a unification of ICA and permutation solver.
- From this slide, we review the important BSS algorithms, IVA and ILRMA, from the viewpoint of the TF structure models. IVA is a multivariate extension of FDICA, namely, IVA utilizes sourcewise frequency vector as a random variable to unify all the frequency components in the estimation of ICA. IVA assumes a joint distribution of all the frequency components as a source distribution p(s). In addition, this distribution p(s) has an inner structure, a co-occurrence of all the frequency components. This model is called “spherical property” of multivariate distribution, but anyway, ICA assumes the co-occurrence of all the frequency components in the same source, which is depicted in this figure. By the assumption of this TF structure for each source, Wi is estimated so that the permutation problem does not arise. 10分
- The question is how much valid is IVA’s TF structure model? I here showed the time-frequency powers of speech and vocal sources. As you can see, typical audio sources have co-occurrence of all the frequencies when the source is active, and IVA’s assumption seems to be valid. Also, this structure can be interpreted as group sparsity in the TF domain.
- The optimization problem in IVA can be defined like this, and the joint distribution p will enforce previous TF structure by assuming the spherical distribution here. For example, when we assume a spherical Laplace distribution, this model, the minimization problem in IVA becomes as shown in the bottom. In the original IVA paper, this problem was optimized by a simple gradient descent, but
- in 2011, an efficient update algorithm for IVA was proposed, which is called AuxIVA. It provides an elegant update rules called iterative projection, IP, and the convergence-guaranteed fast optimization without stepsize parameters was established. This graph shows the value of cost function and the number of iterations. AuxIVA sufficiently converges in less than 20 times update. I play the sound demo of AuxIVA.
- In 2016, we extended the TF structure model in IVA to richer one. IVA assumes the uniform co-occurrence of all the frequencies. This can be considered as a rank-1 time-frequency structure, namely, frequency-uniform vector is activated along time axis. As we already shown, this model is valid for typical audio signals, but it may be too simple because audio sources have a harmonic frequency structure. To represent more complicated TF structure, we proposed independent low-rank matrix analysis, ILRMA, which employs NMF modeling as a TF structure. In ILRMA, the single uniform frequency vector in IVA is extended to the multiple complicated vectors, and more accurate spectrogram can be modeled as a low-rank matrix. Such an accurate TF model will improve the estimation performance of the frequency-wise demixing matrices.
- ILRMA assumes that each source has a low-rank TF structure, and the rank of mixture spectrogram increases. Thus, by enforcing the low-rankness of each estimated signal in the TF domain, the demixing matrix can avoid encountering the permutation problem, and richer TF structure model than IVA will improve the BSS performance. 14分
- The optimization problem in ILRMA is shown here. We find Wi, and the NMF variables Tn and Vn that minimize this cost function. （クリック）The first and second terms of this function coincide with the cost function in NMF, （クリック）and the second and third terms coincide with the cost function in FDICA or IVA. （クリック）Thus, we can iterate NMF update rules and IP-based update of the demixing matrix. This iteration guarantees the theoretical convergence. This graph shows the behavior of the cost function value. ILRMA converges in less than 100 iterations. Let’s play the sample sounds. This result is better than that of IVA. 15分くらい
- Let’s move on to the main topic of this talk.
- So far, we showed the cost functions of FDICA, IVA, and ILRMA, which are listed in this slide. We can see that they have the similar forms. This is because
- all of them are coming from the original ICA’s cost function, this one, and the difference is just an assumption of the source distribution p(Y), which is often called source generative model. This generative model corresponds to the TF structure model for each source, and this model is necessary for avoiding the permutation problem. Of course, better assumption of TF structures provides better BSS performance, but the suitable TF structure model depends on the type of sources, such as speech, music, harmonic source, percussive source, noise source, and so on. Therefore, we have to search the best TF structure model with a try-and-error approach.
- However, in the conventional method, it is difficult to replace the TF structure model because we have to derive the optimization algorithm, which requires technical knowledges and math skills. If we derive a general BSS algorithm, and if we can replace the TF structure model in a plug-and-play manner, it is very useful to search the best model for each problem. So, to try various TF structure models in a “plug-and-play manner”, first, we reformulate the BSS problem in a more general form. Then, we solve it using a TF-structure-independent algorithm. 17分
- This problem is our proposed generalized BSS problem, which includes FDICA, IVA, and ILRMA. The function P(W, X) corresponds to the TF structure model we assume, which is often called the source model. By replacing the function P, we can try various TF structure models. The negative log-determinant term is coming from an original ICA theory. We can interpret this function as a “barrier function” avoiding to be rank-deficient of Wi. If Wi becomes a rank-deficient matrix, its determinant becomes zero, and this term becomes infinity. So, we can avoid such solution in the optimization. 18分
- For the conventional BSS algorithm, the function P(W, X) corresponds to these functions, respectively. FDICA corresponds to an L1-norm sparse regularizer, and IVA is an L2,1-norm group-sparse regularizer. ILRMA is a little bit difficult, but still we can represent it using an argument minimum as shown here, where DIS is an Itakura-Saito divergence.
- The objective of this reformulation is that / if we change the TF structure model P, its optimization algorithm can easily be obtained. This is because we want to establish a new BSS algorithm with plug-and-play TF structure models. But the question is, how can we do that? The idea is coming from a convex optimization field. We utilize an algorithm called “primal-dual splitting method”. In this algorithm, we need a proximity operator of the function P. The function whose proximity operator can easily be calculated / is called “proximable”. So, if the TF structure model P is proximable, we can obtain the optimization algorithm for this generalized BSS problem.
- Primal-dual splitting method considers this problem. Minimize the vector w for the function g(w) + h(Lw), where L is just a matrix. This minimization can be solved by this iterative optimization algorithm. This is a primal-dual splitting method. In the first line, we calculate the proximity operator of the function g with this input. Then, the second line calculates the new input z, and in the third line, we calculate the proximity operator of the function h with the input z. By iterating these three steps, we can minimize this cost function. Prox is a regularized minimization of the function f in the neighborhood of input x, which always has a unique solution. We do not dive into the details of this algorithm in this overview, but you can referrer some papers to know the theory of the method. The important point is that, we can use any function P, any TF structure P if the functions P are all proximable. We just switch the proximity operator of P according to the recipe of well-known proximity operators of popular functions. 21分
- The goal is to convert this minimization function to the primal-dual-splitting-applicable form. So, we convert this function （戻って）to this. As a first step, we transform the determinant of Wi to the singular values sigma using this equation. Next, we vectorize the demixing matrices Wi with this computation, where V is a linear operator converting a matrix Wi into a vector And we also define the inverse operation M, namely, M is a linear operator converting the vector w back into the matrices Wi.
- By introducing the vectorization, we get this function. Its almost there. Then, we define I(w) like this, and now we are ready to apply the primal-dual splitting method. Now we have the same form as this original function.
- In summary, we defined the general BSS algorithm as this minimization problem, and we can optimize this using a primal-dual splitting method. The algorithm is shown here. And we have a proximity operator of a new function I in this line. I(w) is a sum the logarithm of singular values. The proximity operator of the Logarithm function and singular values are well-known. Thus, we can easily obtain the proximity operator of I(w) as shown in the bottom of this slide.
- OK, let me see how IVA and ILRMA are defined in this BSS formulation. The TF structure assumed in IVA is group sparseness, which can be defined as L2,1 norm of the estimated spectrogram Yn. So, we replace the function P to the L2,1 norm, and we do not have to resolve the algorithm. The proximity operator of L2,1 norm is obtained like this, so we use this calculation in the third line of this algorithm. Next, ILRMA assumes the low-rank TF structure by applying NMF to the estimated spectrogram Yn. Instead of NMF, we use a nuclear norm to represent the low-rank regularization. Again, the proximity operator of the nuclear norm is well-known. We can obtain the optimization algorithm by replacing the third line to this calculation. From this, we can see that the proposed algorithm can handle various TF structures in a unified algorithm, which is very useful to search the best TF structure.
- In addition, multiple TF structures can also be utilized. For example, group sparse + sparse BSS can be defined like this function, which can be interpreted as a sparse IVA. Of course, these functions are both proximable, we can obtain the optimization algorithm. As another example, low-rank + sparse BSS can also be defined as sparse ILRMA like this problem. As you can see, the important point is that, when you want to utilize a new TF structure model P, check whether P is proximable. If P is proximable, you can use it in the proposed BSS algorithm in a plug-and-play manner. This is a strong advantage of the proposed BSS. 25分半
- These graphs show the BSS performance of two-speech mixtures with AuxIVA and various TF structures. The vertical axis shows SDR improvements, which indicates the separation performance. And the horizontal axis shows the number of iterations in each algorithm. Since the group-sparse model is equivalent the IVA model, it provides the completely same performance in the converged point. Low-rank model is similar to ILRMA, and group sparse + sparse model is a sparsity-induced IVA. Also, low-rank + sparse is a sparse version of ILRMA. Again, we can easily compare which TF structure model is the best for the speech source separation. In this experiment, Low-rank + sparse model provides the best performance for both mixture samples. 26分半
- Now we have extended the proposed BSS algorithm to more explicit formulation, namely, we do not assume a function P, but we directly introduce TF mask as an intended TF structure. Let me explain this extension as a final topic of this talk. It is known that the proximity operators of many sparsity-inducing functions are obtained as thresholding operators. For example, prox of L1 norm is obtained like this, and this calculation is soft thresholding of the input variable because this term becomes a value between 0 and 1. L2,1 norm also becomes soft thresholding. Since the input vector z includes spectrograms of the estimated signals, these soft thresholding in each element can be interpreted as a time-frequency soft masking. Namely, the calculation of proximity operator, （戻って）the third line of the algorithm, is just applying a TF soft mask defined by the intended TF model and the current optimization variable Z. This fact tells us that we don’t have to design a TF structure function P. Just we have to do is to design a TF mask of the intended TF structure. 28分
- From this motivation, we proposed time-frequency-masking-based BSS, TFMBSS in short. The different point between the previous general BSS and TFMBSS is shown here. In the previous algorithm, we had to design the TF model function P, and we obtain its proximity operator. In TFMBSS, we skip designing the function P, and we directly design the intended TF mask. Therefore, we don’t care about what kind of cost function is minimized in this algorithm.
- This figure is a concept of TFMBSS. We input TF masks as a TF structure model. And the demixing matrix is optimized so that the estimated signals have the intended TF structures.
- Let me introduce one application of TFMBSS. We utilized a well-known music BSS algorithm called harmonic-percussive sound separation, HPSS, to accurately separate drum sounds and the other musical instruments. In this method, we apply HPSS to the temporal estimated signals Zharmonic and Zpercussive independently and produce the masks in a Wiener filtering manner. These masks are input to TFMBSS as a TF structure model. This process is iterated until it converges, so in each iteration of TFMBSS, two HPSS are performed.
- This is a demonstration. We utilized two types of HPSS. Since HPSS is a single-channel nonlinear algorithm, the artificial distortions may arise. If we have a multichannel observation, we can use these HPSS in TFMBSS and achieve linear distortion-less separation. The red cells are harmonic estimates, and the blue ones are the percussive estimates. 再生 As you can see, TFMBSS provides better separation.
- This is a conclusion.