- 1. Hybrid Multichannel Signal Separation Using Supervised Nonnegative Matrix Factorization Daichi Kitamura, (The Graduate University for Advanced Studies, Japan) Hiroshi Saruwatari, (The University of Tokyo, Japan) Satoshi Nakamura, (Nara Institute of Science and Technology, Japan) Yu Takahashi, (Yamaha Corporation, Japan) Kazunobu Kondo, (Yamaha Corporation, Japan) Hirokazu Kameoka, (The University of Tokyo, Japan) Asia-Pacific Signal and Information Processing Association ASC 2014 Special session – Recent Advances in Audio and Acoustic Signal processing
- 2. Outline • 1. Research background • 2. Conventional methods – Nonnegative matrix factorization – Supervised nonnegative matrix factorization – Multichannel NMF • 3. Proposed method – SNMF with spectrogram restoration and its Hybrid method • 4. Experiments – Closed data experiment – Open data experiment • 5. Conclusions 2
- 3. Outline • 1. Research background • 2. Conventional methods – Nonnegative matrix factorization – Supervised nonnegative matrix factorization – Multichannel NMF • 3. Proposed method – SNMF with spectrogram restoration and its Hybrid method • 4. Experiments – Closed data experiment – Open data experiment • 5. Conclusions 3
- 4. Research background • Signal separation have received much attention. • Music signal separation based on nonnegative matrix factorization (NMF) is a very active research area. • Supervised NMF (SNMF) achieves the highest separation performance. • To improve its performance, SNMF-based multichannel signal separation method is required. 4 • Automatic music transcription • 3D audio system, etc. Applications Separate! Separate the target signal from multichannel signals with high accuracy.
- 5. Outline • 1. Research background • 2. Conventional methods – Nonnegative matrix factorization – Supervised nonnegative matrix factorization – Multichannel NMF • 3. Proposed method – SNMF with spectrogram restoration and its Hybrid method • 4. Experiments – Closed data experiment – Open data experiment • 5. Conclusions 5
- 6. • NMF can extract significant spectral patterns. – Basis matrix has frequently-appearing spectral patterns in . NMF [Lee, et al., 2001] Amplitude Amplitude Observed matrix (spectrogram) Basis matrix (spectral patterns) Activation matrix (Time-varying gain) Time Ω: Number of frequency bins 𝑇: Number of time frames 𝐾: Number of bases Time Frequency Frequency 6 Basis
- 7. • SNMF – Supervised spectral separation method Supervised NMF [Smaragdis, et al., 2007] Separation process Optimize Training process Supervised basis matrix (spectral dictionary) Sample sounds of target signal 7 Fixed Sample sound Target signal Other signalMixed signal
- 8. Problems of SNMF • SNMF is only for a single-channel signal – For multichannel signal, SNMF cannot use information between channels. • When many interference sources exist, separation performance of SNMF markedly degrades. 8 Separate Residual components
- 9. 9 • Multichannel NMF – is a natural extension of NMF for a multichannel signal – uses spatial information for the clustering of bases to achieve the unsupervised separation task. Multichannel NMF [Sawada, et al., 2013] Problems: Multichannel NMF involve strong dependence on initial values and lack robustness. Microphone array
- 10. Outline • 1. Research background • 2. Conventional methods – Nonnegative matrix factorization – Supervised nonnegative matrix factorization – Multichannel NMF • 3. Proposed method – Motivation and strategy – SNMF with spectrogram restoration and its Hybrid method • 4. Experiments – Closed data experiment – Open data experiment • 5. Conclusions 10
- 11. • Sawada’s multichannel NMF – is unified method to solve spatial and spectral separations. – Maximizes a likelihood: – For supervised situation, target spectral patterns is given. – Too much difficult to solve (lack robustness) – Computationally inefficient (much computational time) Motivation and strategy 11 Spatial direction of target signal Source components of all signals Target Other Observed spectrograms
- 12. • Proposed hybrid method – divides the problems as follows: – The spatial separation should be carried out with classical D.O.A. estimation methods. • These methods are very efficient and stable. – Divide and conquer method Motivation and strategy 12 Unsupervised spatial separation Supervised spectral separation Approximation Classical D.O.A. estimation SNMF-based method
- 13. Directional clustering [Araki, et al., 2007] • Directional clustering – Unsupervised spatial separation method – k-means clustering (fast and stable) • Problems – Artificial distortion arises owing to the binary masking. 13 Right L R Center Left L R Center Binary masking Input signal (stereo) Separated signal 1 1 1 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1 1 Frequency Time C C C R L R C L L L R R C C C C R R C R R L L L C C C C C C Frequency Time Binary maskSpectrogram Entry-wise product
- 14. Proposed method: hybrid separation • Hybrid separation method 14 Input stereo signal Spatial separation method (Directional clustering) SNMF-based separation method (SNMF with spectrogram restoration) Separated signal L R
- 15. SNMF with spectrogram restoration : Holes Time Frequency Separated cluster Spectral holes (lost components) The proposed SNMF treats these holes as unseen observations Supervised basis … Extrapolate the fittest bases 15 (dictionary of target signal) Fix up
- 16. SNMF with spectrogram restoration Center RightLeft Direction sourcecomponent z (b) Center RightLeft Direction sourcecomponent (a) Target Center RightLeft Direction sourcecomponent (c) Extrapolated componentsFrequencyofFrequencyofFrequencyof After Input After signal directional clustering super- resolution- based SNMF Binary masking 16 Time FrequencyObserved spectrogram Target Interference Time Time Frequency Extrapolate Frequency Separated cluster Reconstructed data Supervised spectral bases Directional clustering SNMF with spectrogram restoration
- 17. • The divergence is defined at all grids except for the holes by using the Binary mask matrix . Decomposition model and cost function 17 Decomposition model: Supervised bases (Fixed) : Entries of matrices, , and , respectively : Weighting parameters,: Binary complement, : Frobenius norm Cost function: : Binary masking matrix obtained from directional clustering
- 18. • The divergence is defined at all grids except for the holes by using the Binary mask matrix . Decomposition model and cost function 18 Decomposition model: Supervised bases (Fixed) : Entries of matrices, , and , respectively : Weighting parameters,: Binary complement, : Frobenius norm Cost function: : Binary masking matrix obtained from directional clustering Binary index to exclude the holes
- 19. • The divergence is defined at all grids except for the holes by using the Binary mask matrix . Decomposition model and cost function 19 Decomposition model: Supervised bases (Fixed) : Entries of matrices, , and , respectively : Weighting parameters,: Binary complement, : Frobenius norm Regularization term Cost function: : Binary masking matrix obtained from directional clustering Binary index to exclude the holes
- 20. • The divergence is defined at all grids except for the holes by using the Binary mask matrix . Decomposition model and cost function 20 Decomposition model: Supervised bases (Fixed) : Entries of matrices, , and , respectively : Weighting parameters,: Binary complement, : Frobenius norm Regularization term Penalty term [Kitamura, et al. 2014] Cost function: : Binary masking matrix obtained from directional clustering Binary index to exclude the holes
- 21. • : -divergence [Eguchi, et al., 2001] – EUC-distance – KL-divergence – IS-divergence Generalized divergence: b -divergence 21 The best criterion for signal separation [Kitamura, et al., 2014]
- 22. • We used two -divergences for the main cost and the regularization cost as and . Decomposition model and cost function 22 Decomposition model: Cost function: Supervised bases (Fixed)
- 23. Update rules • We can obtain the update rules for the optimization of the variables matrices , , and . 23 Update rules:
- 24. Outline • 1. Research background • 2. Conventional methods – Nonnegative matrix factorization – Supervised nonnegative matrix factorization – Multichannel NMF • 3. Proposed method – SNMF with spectrogram restoration and its Hybrid method • 4. Experiments – Closed data experiment – Open data experiment • 5. Conclusions 24
- 25. • Mixed signal includes four melodies (sources). • Three compositions of instruments – We evaluated the average score of 36 patterns. Experimental condition 25 Center １ ２ ３ ４ Left Right Target source Supervision signal 24 notes that cover all the notes in the target melody Dataset Melody 1 Melody 2 Midrange Bass No. 1 Oboe Flute Piano Trombone No. 2 Trumpet Violin Harpsichord Fagotto No. 3 Horn Clarinet Piano Cello
- 26. 14 12 10 8 6 4 2 0 SDR[dB] 43210 bNMF • Signal-to-distortion ratio (SDR) – total quality of the separation, which includes the degree of separation and absence of artificial distortion. Experimental result: closed data 26 Good Bad Conventional SNMF (single-channel SNMF) Proposed hybrid method Directional clustering Supervised Multichannel NMF [Sawada] KL-divergence EUC-distance
- 27. SNMF with spectrogram restoration • SNMF with spectrogram restoration has two tasks. • The optimal divergence for source separation is KL- divergence ( ). • In contrast, a divergence with higher value is suitable for the basis extrapolation. 27 Source separation SNMF with spectrogram restoration Basis extrapolation
- 28. Trade-off: separation and restoration • The optimal divergence for SNMF with spectrogram restoration and its hybrid method is based on the trade-off between separation and restoration abilities. -10 -8 -6 -4 -2 0 Amplitude[dB] 543210 Frequency [kHz] -10 -8 -6 -4 -2 0 Amplitude[dB] 543210 Frequency [kHz] Sparseness: strong Sparseness: weak 28 Performance Separation Total performance of the hybrid method Restoration 0 1 2 3 4
- 29. • Closed data experiment – used different Tone generator for training and test signals Experimental condition 29 Supervision signal 24 notes that cover all the notes in the target melody Provided by Tone generator A Provided by Tone generator B (more real sound) + back ground noise (SNR = 10 dB) Center １ ２ ３ ４ Left Right Target source
- 30. 10 8 6 4 2 0 -2 -4 SDR[dB] 43210 bNMF • Signal-to-distortion ratio (SDR) – total quality of the separation, which includes the degree of separation and absence of artificial distortion. Experimental result: open data 30 Good Bad Conventional SNMF (single-channel SNMF) Proposed hybrid method Directional clustering Supervised Multichannel NMF [Sawada] KL-divergence EUC-distance
- 31. Conclusions • We proposed a hybrid multichannel signal separation method combining directional clustering and SNMF with spectrogram restoration. • There is a trade-off between separation and restoration abilities. 31 Thank you for your attention! You can hear a demonstration from my HP!

- This is outline of my talk.
- This is outline of my talk.
- Recently, // signal separation technologies have received much attention. These technologies are available for many applications. And nonnegative matrix factorization, // NMF in short, // has been a very active area of the signal separation. Particularly, supervised NMF (SNMF) / achieves good separation performance. However, SNMF can be used for only single-channel signals. To improve its performance, SNMF-based multichannel signal separation method is required.
- This is outline of my talk.
- Before explaining a supervised NMF, I will explain the basic of simple NMF. NMF is a powerful method for extracting significant features from a spectrogram. This method decomposes the input spectrogram Y into a product of basis matrix F and activation matrix G, where basis matrix F / has frequently-appearing spectral patterns / as basis vectors like this, and activation matrix G / has time-varying gains / of each spectral pattern.
- To separate the target signal with NMF, Supervised NMF has been proposed. In SNMF, first, we train the sample sound of the target signal, which is like a musical scale. Then we construct the supervised basis F. This is a spectral dictionary of the target sound. Next, we separate the mixed signal / using the supervised basis F, as FG+HU. Therefore, the target signal obtained as FG, and the other signal is reconstructed by HU.
- The problem of SNMF is that This is only for a single-channel signals. We cannot use any information between channels. But almost all music signals are the stereo format. So we should extend simple SNMF to the a multichannel SNMF. In addition, when many interfering sources exist, the separation performance of SNMF markedly degrades.
- As another means for the multichannel signal separation, Multichannel NMF also has been proposed by Sawada. This is a natural extension of NMF, and uses spatial information for the clustering of bases, to achieve the unsupervised separation. However, this method is very difficult optimization problem mathematically. So, this method strongly depends on the initial values.
- Sawada’s multichannel NMF is a unified method to solve spatial and spectral separations simultaneously. This method maximizes a likelihood like this, where theta is a spatial direction of the target signal, F, G, H, and U is source components of target and other signals, Y is an observed given spectrogram of both channels. For the supervised situation, the target spectral patterns F is given like this. However, even if F is given, this optimization is too much difficult to solve. So it lacks robustness. Also, it requires much computational time.
- Our proposed method approximately divides the problem into the unsupervised spatial separation and supervised spectral separation. Because we can use efficient classical D.O.A. estimation methods for the spatial separation. This is very efficient and stable. Then SNMF is applied for the spectral separation problem. Therefore, this method can be considered as a divide and conquer method. The optimal methods are applied for each separations.
- For the spatial separation, we used a directional clustering because this is very fast and stable. This method utilizes level difference between left and right channels as a clustering cue. So, we can separate the sources direction-wisely. And this is equal to binary masking in the spectrogram domain. We get the binary mask from the result of clustering, and we calculate an entry-wise product. Finally we obtain the separated direction. However, the separated direction has an artificial distortion owing to the binary masking.
- So we proposed a new SNMF-based method named SMNF with spectrogram restoration. This is the concept of our proposed hybrid method. First, the target direction is separated. Then, target signal is extracted by this new SNMF.
- Here, / the separated signal by directional clustering / has many spectral holes owing to the binary masking. This spectrum is an example. There are so many spectral holes owing to the binary masking. However, / the proposed SNMF treats these holes as unseen observations like this. We exclude these components from the cost function. Then, the target bases are extrapolated using the fittest spectral pattern / from the supervised bases F. As a result, the lost components are restored by the supervised basis extrapolation.
- This figure shows the directional distribution of the input stereo signal. The target source is in the center direction, and the other interfering sources are distributed like this. After directional clustering, / left and right source components / leak in the center cluster, // and center sources lose some of their components. These lost components / correspond to the spectral holes. And after SNMF with spectrogram restoration, the target components are separated / and restored using supervised bases. In other words, / the resolution of the target spectrogram / is recovered.
- This is a decomposition model of SNMF with spectrogram restoration. It is the same as the simple SNMF. And, J is the cost function of the proposed SNMF. In this cost function,
- We introduce the binary index i, which is for excluding the holes from the total cost. This index is obtained from the binary mask matrix. Therefore, the divergence is defined at all spectrogram grids / except for the spectral holes.
- For the grids of the holes, we impose a regularization term to avoid the extrapolation error.
- The third term is a penalty term to avoid sharing the same basis between F and H. This penalty improves the separation performance in SNMF.
- For the divergence measure, we propose to use beta-divergence. This is a generalized distance function, which involves EUC-distance, KL-divergence, and IS-divergence when beta = 2, 1, and 0. In SNMF, it is reported that / KL-divergence is the best criterion for the signal separation.
- And we used two beta-divergences for the main cost and regularization cost / as beta_NMF and beta_reg.
- From the minimization of the cost function, / we can obtain the update rules / for the optimization of variable matrices G, H, and U.
- This is outline of my talk.
- This is an experimental condition. The mixed signal includes four melodies. Each sound source located like this figure, / where the target source is always located in the center direction / with other interfering source. And we prepared 3 compositions of instruments and evaluated the average score of 36 patterns. In addition, the supervision signal has 24 notes like this score, which cover all the notes in the target melody.
- This is a result of experiment. We showed the average SDR score, where SDR indicates the total quality of the separation. Directional clustering cannot separate the sources in the same direction, so the result was not good. Multichannel NMF strongly depends on the initial value, and the average score becomes bad. The hybrid method outperforms the conventional SNMF. And the conventional SNMF achieves the highest score when beta equals 1, KL-divergence. However, surprisingly, EUC-distance is preferable for the proposed hybrid method.
- This is because / SNMF with spectrogram restoration has two tasks, namely, Separation of the target signal / and basis extrapolation for the restoration of the spectrogram. And it is reported that the KL-divergence is suitable for the source separation. However, in contrast, a divergence with higher beta value is suitable for the basis extrapolation. This fact is experimentally proven in our paper.
- The reason is that / if we use the smaller beta value, such as a KL-divergence, the obtained basis becomes sparse. (pointing figure) On the other hand, if we use the higher beta value, the sparseness of the basis becomes weak. And the sparse basis is not suitable for the basis extrapolation using only the observable data. Therefore, the optimal divergence for the hybrid method is around EUC-distance / because of the trade-off between separation and restoration abilities / like this graph. The optimal beta is shifted from 1 to 2.
- Also, we conducted an open data experiment. Here we used the different MIDI Tone generator for the training and test signals. Therefore, the waveforms are not same, but similar. In addition, we added the back ground noise to the test signals as SNR = 10 dB.
- This is the result. Even if we use the different training sound, we can achieve good results. Sawada’s multichannel NMF does not work because this method cannot reduce the defuse noise.
- This is conclusions of my talk. Thank you for your attention.
- その他の実験条件はこのようになっています． NMFの距離規範βNMFを0から4まで変化させた時のすべての組み合わせの評価値を比較します． 正則化の距離規範においてはもっとも性能の高いβreg=1のみを示しております． 評価値にはSDRを用いております． SDRは分離度合と人工歪の少なさを含む総合的な分離精度です．
- Supervised method has an inherent problem. That is, we cannot get the perfect supervision sound of the target signal. Even if the supervision sounds are the same type of instrument as the target sound, / these sounds differ / according to various conditions. For example, individual styles of playing / and the timbre individuality for each instrument, and so on. When we want to separate this piano sound from mixed signal, / maybe we can only prepare the similar piano sound, but the timbre is slightly different. However the supervised NMF cannot separate because of the difference of spectra of the target sound.
- To solve this problem, we have proposed a new supervised method / that adapts the supervised bases to the target spectra / by a basis deformation. This is the decomposition model in this method. We introduce the deformable term, / which has both positive and negative values like this. Then we optimize the matrices D, G, H, and U. This figure indicates spectral difference between the real sound and artificial sound.
- This is a result of the experiment using real-recorded signal. From this result, we can confirm that the optimal divergence for the hybrid method is EUC-distance.
- In NMF decomposition, the cost function is defined as a distance or a divergence between input matrix Y and decomposed matrix FG. J_NMF indicates the cost function in NMF, and we minimize this one to find F and G under the constraint of nonnegativity. And there are some criteria for the distance used in the cost function. These 3 criteria are often used in the NMF decomposition.
- The decomposition of NMF is equivalent to a maximum likelihood estimation, / which assumes the generation model of the input data Y, implicitly. If we select the parameter beta, / the assumption of generation model is fixed. In other words, the parameter beta defines the generation model of the input data.
- In this analysis, to compare the net extrapolation ability, we generated a random input data Y, which obey each generation model. Also, we prepared the binary-masked random data YI, and attempt to restore that. In a training process, we construct the supervised basis F using the random data Y. Then we attempt to restore the binary-masked data using the trained basis F.
- The binary mask I was generated by uniform manner, and we generated two types of binary masks / whose densities of holes are 75% and 98%. Therefore, by calculating the similarity between input data Y and restored data FG, / we can evaluate the extrapolation ability and the accuracy of restoration. So SAR indicates the accuracy of restoration.
- These are the results of analysis. The left one is the result for 75%-binary-masked data, and the right one is 98%-binary masked data. Beta equals 1 is the optimal divergence for source separation, which means KL-divergence. But, surprisingly, the optimal divergence for the restoration is that / beta equals around 3.
- Also we conducted an experiment using real-recorded signals. In this experiment, the binaural mixed signal was recorded in the real environment. The other conditions are the same as those in the previous experiment.
- This is a result of the experiment using real-recorded signal. From this result, we can confirm that the optimal divergence for the hybrid method is EUC-distance.
- As I already said, the best divergence depends on the number of holes. If there are many holes, beta = 2 should be used. And if the holes are not so many, beta =1 should be used. Therefore, divergence should be switched to the optimal one with threshold value. We propose frame-wise multi-divergence.
- We define the multi-divergence using cases at each time frame, where r_t means a density of holes at frame t. By the threshold value tau, the divergence are adapted.
- Then we evaluated various patterns of spatial location of the sources / as SP1~SP4. SP4 leads more spectral holes than SP1. From this result, we can confirm that the multi-divergence always achieves the highest performance.
- SDR is the total evaluation score as the performance of separation.