- 1. Daichi Kitamura Nobutaka Ono Hiroshi Sawada Hirokazu Kameoka Hiroshi Saruwatari Relaxation of Rank-1 Spatial Constraint in Overdetermined Blind Source Separation (SOKENDAI) (NII/SOKENDAI) (NTT) (The Univ. of Tokyo/NTT) (The Univ. of Tokyo) EUSIPCO 2015, 2 Sept.,14:30 - 16:10, SS30 Acoustic scene analysis using microphone array
- 2. Research Background • Blind source separation (BSS) – Estimation of original sources from the mixture signal – We only focus on overdetermined situations • Number of sources Number of microphones • Ex) Independent component analysis, independent vector analysis • Applications of BSS – Acoustic scene analysis, speech enhancement, music analysis, reproduction of sound field, etc. 2/21 Original sources Observation (mixture) Estimated sources Mixing system BSS Unknown
- 3. Problems and Motivations • For reverberant signals – ICA-based methods cannot separate sources well because Linear time-invariant mixing system is assumed – When the number of microphones is grater than the number of sources, PCA is often applied before BSS • Reverberation is also important information to analyze acoustic scenes – We should separate the sources with their own reverberations. 3/21 Original sources Observed signals Mixing Estimated sources BSS Dimension- reduced signals PCA Instantaneous mixing in time-frequency domain To remove weak (reverberant) components of all the sources
- 4. • Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] – assumes independence between source vectors – assumes linear time-invariant mixing system • The mixing system can be represented by mixing matrix in each frequency bin. – can efficiently be optimized [Ono, 2011] Conventional Methods (1/4) 4/21 … … Original sources Mixing matrices … … … Observed signals Demixing matrices Estimated sources
- 5. Conventional Methods (2/4) • Nonnegative matrix factorization (NMF) [Lee, 2001] – decomposes spectrogram into spectral bases – Decomposed bases should be clustered into each source. • Very difficult problem – Multichannel extension of NMF has been proposed. 5/21 Amplitude Amplitude Observed matrix (power spectrogram) Basis matrix (spectral patterns) Activation matrix (Time-varying gain) Time : Number of frequency bins : Number of time frames : Number of bases Time Frequency Frequency Basis
- 6. • Multichannel NMF (MNMF) [Ozerov, 2010], [Sawada, 2013] Conventional Methods (3/4) 6/21 Time-frequency-wise channel correlations Multichannel observation Multichannel vector Instantaneous covariance Source-frequency-wise spatial covariances Basis matrix Activation matrix Spatial model Source model Cluster- indicator Spectral patterns Gains
- 7. • MNMF with rank-1 spatial model (Rank-1 MNMF) – Spatial model can be optimized by IVA – Source model and can be optimized by simple NMF We can optimize all the variables using update rules of IVA and simple NMF Time-frequency-wise channel correlations Source-frequency-wise spatial covariances Basis matrix Activation matrix Spectral patterns Gains Conventional Methods (4/4) 7/21 [Kitamura, ICASSP 2015] = Linear mixing assumption as well as IVA Modeled by rank-1 matrices (constraint) Cluster- indicator
- 8. • Rank-1 spatial constraint Linear mixing assumption – Instantaneous mixture in a time-frequency domain – Mixing system can be represented by mixing matrix Rank-1 Spatial Constraint 8/21 1. Sources can be modeled as point sources 2. Reverberation time is shorter than FFT length Frequency Time Observed spectrogram Time-invariant mixing matrix Observed signal Source signal
- 9. • When reverberation time is longer than FFT length, – the impulse response becomes long – reverberant components leak into the next time frame Problem of Rank-1 Spatial Model 9/21 Mixing system cannot be represented by using only . The separation performance markedly degrades. Frequency Time Observed spectrogram Observed signal Source signal Leaked components
- 10. Summary of Conventional methods • MNMF [Ozerov, 2010], [Sawada, 2013] – Full-rank spatial model • does not use rank-1 spatial constraint – much computational costs – strong dependence on initial values • IVA [Hiroe, 2006], [Kim, 2006] & Rank-1 MNMF [Kitamura, 2015] – Rank-1 spatial constraint (linear mixing assumption) • Separation performance degrades for the reverberant signals – Faster and more stable optimization 10/21 Relax the rank-1 spatial constraint while maintaining efficient optimization To achieve good and stable separation even for the reverberant signals,
- 11. • Dimensionality reduction with principal component analysis (PCA) – remove reverberant components of all the sources by PCA – But the reverberant components are important! • Utilize extra observations to model direct and reverberant components simultaneously. – microphones for sources, where Proposed Approach 11/21 Original sources Observed signals Mixing Estimated sources BSS Dimension-reduced signals PCA Ex. sources, microphones ( )
- 12. Proposed Approach 12/21 • Utilize extra observations to model direct and reverberant components simultaneously. – microphones for sources, where Original sources Observed signals Mixing Ex. sources, microphones ( ) Estimated sources Reconstruction Separated components BSS IVA or Rank-1 MNMF
- 13. Proposed Approach 13/21 • Utilize extra observations to model direct and reverberant components simultaneously. – microphones for sources, where Original sources Observed signals Mixing Ex. sources, microphones ( ) Direct Reverb. Direct Reverb. Estimated sources Reconstruction Separated components BSS • We assume the independence between not only sources but also the direct and reverberant components of the same sources.
- 14. • Permutation problem of separated components – Order of separated components depends on initial values • We propose two methods to cluster the components – 1. Using cross-correlations for IVA – 2. Sharing basis matrices for Rank-1 MNMF Clustering of Separated Components 14/21 Separated components Which separated components belong to which source?
- 15. • Permutation problem of separated components – Order of separated components depends on initial values • We propose two methods to cluster the components – 1. Using cross-correlations for IVA – 2. Sharing basis matrices for Rank-1 MNMF Clustering of Separated Components 15/21 Estimated source Reconstruction Separated components Clustered components Direct component of source 1 Clustering Reverb. component of source 1 Direct component of source 2 Reverb. component of source 2
- 16. Clustering Using Spectrogram Correlation • Direct and reverberant components of the same source have a strong cross-correlation. • Cross-correlation of two power spectrograms – Calculate for all combination of separated components – Merge the components in a descending order of 16/21 Power spectrogram of Power spectrogram of ・・・
- 17. • Direct and reverberant components can be modeled by the same bases (spectral patterns) • Estimate signals with Basis-Shared Rank-1 MNMF – Only for Rank-1 MNMF • because IVA doesn’t have NMF source model – By imposing basis-shared source model, Rank-1 MNMF can automatically cluster the components. Auto-Clustering by Sharing Basis Matrix 17/21 Separated components Source model of Basis- Shared Rank-1 MNMF Shared basis matrix for source 1 Reconstruction Estimated sources Shared basis matrix for source 2 Direct component of source 1 Reverb. component of source 1 Direct component of source 2 Reverb. component of source 2
- 18. • Conditions – JR2 impulse response Experiments Original source Professionally-produced music signals from SiSEC database JR2 impulse response in RWCP database is used Two sources and four microphones Sampling frequency Down sampled from 44.1 kHz to 16 kHz FFT length in STFT 8192 points (128 ms, Hamming window) Shift length in STFT 2048 points (64 ms) Number of bases 15 bases for each source (30 bases for all the sources) Number of iterations 200 Number of trials 10 times with various seeds of random initialization Evaluation criterion Average SDR improvement and its deviation 18/21 Reverberation time: 470 ms 2 m Source 1 80 60 Microphone spacing: 2.83 cm Source 2
- 19. • Compared methods (7 methods) – PCA + 2ch IVA • Apply PCA before IVA – PCA + 2ch Rank-1 MNMF • Apply PCA before Rank-1 MNMF – 4ch IVA + Clustering • Apply IVA without PCA, and cluster the components – 4ch Basis-Shared Rank-1 MNMF • Apply Basis-Shared Rank-1 MNMF without PCA – 4ch MNMF-based BF (beam forming) • Apply maximum SNR beam forming (time-invariant filtering) using full-rank covariance estimated by 4ch MNMF – 4ch MNMF • Apply conventional MNMF (full-rank model), and apply multichannel Wiener filtering (time-variant filtering) – Ideal time-invariant filtering • The upper limit of time-invariant filtering (supervised) Experiments 19/21 Conventional methods Proposed methods Conventional methods Reference score
- 20. • Results (song: ultimate_nz_tour__snip_43_61) – Source 1: Guitar – Source 2: Vocals 16 14 12 10 8 6 4 2 0 SDRimprovement[dB] Experiments 20/21 Rank-1 spatial model Time-invariant filter (1/src) Full-rank model Time-invariant filter (1/src) Full-rank spatial model Time-variant filter (1/src) Upper limit of time-invariant filter (1/src) Rank-1 spatial model Time-invariant filter (2/src) : Source 1 : Source 2 PCA+ 2ch IVA PCA+ 2ch Rank1 MNMF 4ch IVA+ Clustering 4ch MNMF- based BF 4ch MNMF Ideal time- invariant filtering (supervised) 4ch Basis- Shared Rank-1 MNMF
- 21. • Results (song: bearlin-roads__snip_85_99) – Source 1: Acoustic guitar – Source 2: Piano 12 10 8 6 4 2 0 -2 -4 SDRimprovement[dB] Experiments 21/21 : Source 1 : Source 2 PCA+ 2ch IVA PCA+ 2ch Rank1 MNMF 4ch IVA+ Clustering 4ch MNMF4ch Basis- Shared Rank-1 MNMF Ideal time- invariant filtering (supervised) 4ch MNMF- based BF
- 22. Experiments 22/21 • Comparison of computational times – Conditions • CPU: Intel Core i7-4790 (3.60GHz) • MATLAB 8.3 (64-bit) • Song: ultimate_nz_tour__snip_43_61 (18s, 16kHz sampling) PCA + 2ch IVA PCA + 2ch Rank1MNMF 4ch IVA+ Clustering 4ch Basis- Shared Rank1 MNMF 4ch MNMF 23.4 s 29.4 s 60.1 s 143.9 s 3611.8 s Achieve efficient optimization compared with MNMF (The performance is comparable with MNMF) 1h!2.4m
- 23. Conclusion • For the case of reverberant signals – Achieve both good performance and efficient optimization • The proposed method – Can be applied when the number of microphones is grater than twice the number of sources – separately estimates direct and reverberant components utilizing extra observations – can be thought as a relaxation of rank-1 spatial constraint • Experimental results show better performance – The proposed method outperforms the upper limit of time- invariant filtering in some cases 23/21 Thank you for your attention!

- Blind source separation is a technique to estimate original sources from the observed mixture signal, where the mixing system is unknown. Therefore, we cannot use any information about recording environment, or locations of the sources and microphones. And in this presentation, we only focus on the overdetermined situations, which means the number of sources is equal or smaller than the number of sources. As you know, independent component analysis is a very famous method for the overdetermined BSS. There are so many applications for BSS. The very big one is an acoustic scene analysis because we have to separate the sources before we analyze the observations.
- However, for the reverberant signals, ICA-based methods cannot separate the sources well. This is because these methods assume a linear time-invariant mixing system, which is an instantaneous mixing in time-frequency domain. Also, when the number of microphones is grater than the number of sources, PCA is often applied before BSS. This process expects to remove the reverberant components of all the sources and to make the linear time-invariant assumption valid. However, reverberation is also important information to analyze acoustic scenes. Therefore, we should separate the sources with their own reverberations.
- Let me introduce some conventional methods of BSS. First one is independent vector analysis, IVA. This is an extension of Frequency-Domain ICA. IVA assumes independence between frequency vectors. In addition, IVA assumes linear time-invariant mixing system. In this assumption, the mixing system can be represented by the mixing matrix A in each frequency bin. And recently, the efficient optimization scheme for IVA has been proposed by Prof. Ono.
- Another famous method is NMF. NMF decomposes a power spectrogram into two nonnegative matrices, T and V. T is a basis matrix, which has spectral patterns, and V is activation matrix, which involves time-varying gains of each basis. So we can extract some significant spectra from the mixture spectrogram. And then, the decomposed bases should be clustered into each source, but it’s a very difficult problem. So, the multichannel extension of NMF has been proposed.
- For the multichannel signal, we have M spectrograms. M is a number of microphones. And we can calculate M by M instantaneous covariance matrix like this. This matrix can be calculated in each time and frequency like a tensor X in this figure. Multichannel NMF decomposes X into the spatial covariance H, cluster-indicater z, basis matrix T and activation matrix V. T and V are the same as simple NMF, spectral patterns and their activations. H includes source-wise spatial covariances. So, MNMF clusters bases into the sources using spatial model and cluster-indicator. The problem of this method is that the optimization of these variables are too much difficult. The result strongly depends on initial values
- Then, we have proposed a new efficient optimization scheme for MNMF, which utilizes rank-1 spatial constraint at the last ICASSP. In this model, all of the spatial covariances in H must be the rank-1 matrices. It means, we assume the linear time-invariant mixing system, as well as the IVA. And this new model can efficiently be optimized using update rules of IVA and simple NMF, alternatively.
- As I already said, the rank-1 spatial constraint is equal to the linear mixing assumption. This can be thought as an instantaneous mixture in a time-frequency domain. So, the frequency-wise mixing matrix Ai can be defined, which is time-invariant. In this model, we assume that the sources can be modeled as point sources, and the reverberation time is shorter than the FFT length.
- However, when the reverberation time is longer than the FFT length, the reverberant components leak into the next time frame like this figure. Therefore, we cannot represent the mixture signal x by using only Ai. The leaked component n, which comes from the previous time frame, is added. Since the IVA and Rank-1 MNMF estimates the inverse of Ai, the separation performance markedly degrades in this reverberant case.
- This is a summary of the problems. MNMF can estimate full-rank spatial model. So, it works for the reverberant signals to some extent. But it requires much computational costs, and it strongly depends on the initial values. IVA and Rank-1 MNMF use rank-1 spatial constraint, the linear mixing assumption. So the separation performance degrades for the reverberant signals. But they have efficient optimization method. To achieve good and stable separation even for the reverberant signals, we propose to relax the rank-1 spatial constraint while maintaining efficient optimization.
- In this presentation, we propose to utilize extra observations to model direct and reverberant components simultaneously. Now we assume that there are M microphones for N sources, where M = PN. For example, there are 4 microphones and 2 sources, where P = 2. In general, (click) we apply PCA to reduce the dimension of the signal (click). Then, we apply BSS. This dimensionality reduction expects to remove the reverberant components of all the sources. But the reverberant components are important for acoustic scene analysis. So we shouldn’t ignore them.
- In the proposed method (click), we apply IVA or Rank-1 MNMF with extra observations (click). We expect that the 2 original sources are separately obtained as (click) direct and reverberant components like this. Therefore, we assume the independence between not only sources but also these components. Finally, we reconstruct the components to the sources by adding them.
- In the proposed method (click), we apply IVA or Rank-1 MNMF with extra observations (click). We expect that the 2 original sources are separately obtained as (click) direct and reverberant components like this. Therefore, we assume the independence between not only sources but also these components. Finally, we reconstruct the components to the sources by adding them.
- However, in this method, there is a permutation problem of the separated components because the order of the separated components depends on the initial values. It means that, we don’t know which separated components belong to which source. So we have to cluster them into each source, and reconstruct the original estimated sources by adding the components of the same sources. Here, we propose two methods to cluster these components for IVA and for Rank-1 MNMF.
- However, in this method, there is a permutation problem of the separated components because the order of the separated components depends on the initial values. It means that, we don’t know which separated components belong to which source. So we have to cluster them into each source, and reconstruct the original estimated sources by adding the components of the same sources. Here, we propose two methods to cluster these components for IVA and for Rank-1 MNMF.
- The first method is for the IVA. We can expect that the direct and reverberant components of the same source have a strong cross-correlation in the power spectrogram domain. So we calculate the cross-correlations between all the combinations of the components. Then, we merge them in a descending order of the cross-correlations. This is a very simple way, and it actually works well.
- For Rank-1 MNMF, we can use another way for the clustering. We can expect that the direct and reverberant components can be modeled by the same bases. So we propose to share the basis matrix T for each source in Rank-1 MNMF. By imposing basis-shared source model in advance, Rank-1 MNMF can automatically cluster the components as the sources. It means that, the separated components are already clustered.
- We conducted a separation experiment. This table is an experimental conditions. We used actual music signals, and impulse responses. We produced 4 channel observed signals that includes 2 sources. The important point is the reverberation time is much longer than the FFT length. Also, we used SDR value that indicates total separation quality.
- We compared 7 methods. The first and second methods are the conventional methods that apply PCA before BSS. The red ones are the proposed methods that utilize extra observations. We also evaluate the performance of conventional MNMF that estimates full-rank spatial model. This one applies maximum SNR beam forming after MNMF. The other one applies multichannel Wiener filtering after MNMF, so this method uses time-variant post filtering. In addition, we show the upper limit of time-invariant filtering methods as a supervised method.
- This is a result of song 1. These methods have (click) the difference model like this. The proposed approaches that separately estimate direct and reverberant components achieve better result than the methods with PCA. Since the proposed methods use time-invariant filters for each of direct and reverberant components, they utilize 2 filters for one source. Therefore, the proposed methods have a potential to outperform the upper limit of ideal time-invariant filter. The performance of conventional MNMF is also high. It is comparable (カンパラボー) with the proposed method, but the results is not stable.
- This is the result of song 2. It is similar to the previous result.
- This table shows the actual computational times of each method. From these results, the proposed methods can achieve comparable separation performance with MNMF even for the reverberant signals while maintaining efficient optimization.
- This is a result of song 3. We can confirm that the proposed method outperform the upper limit.