- 1. Superresolution-Based Stereo Signal Separation via Supervised Nonnegative Matrix Factorization Daichi Kitamura, Hiroshi Saruwatari, Yusuke Iwao, Kiyohiro Shikano (Nara Institute of Science and Technology, Nara, Japan) Kazunobu Kondo, Yu Takahashi (Yamaha Corporation Research & Development Center, Shizuoka, Japan) 18th International conference on Digital Signal Processing 2013
- 2. Outline • 1. Research background • 2. Conventional method – Nonnegative matrix factorization – Penalized supervised nonnegative matrix factorization – Directional clustering – Multichannel NMF – Hybrid method • 3. Proposed method – Regularized superresolution-based nonnegative matrix factorization • 4. Experiments • 5. Conclusions 2
- 3. Outline • 1. Research background • 2. Conventional method – Nonnegative matrix factorization – Penalized supervised nonnegative matrix factorization – Directional clustering – Multichannel NMF – Hybrid method • 3. Proposed method – Regularized superresolution-based nonnegative matrix factorization • 4. Experiments • 5. Conclusions 3
- 4. Background • Music signal separation technologies have received much attention. • Music signal separation based on nonnegative matrix factorization (NMF) has been a very active area of the research. • The extraction performance of NMF markedly degrades for the case of many source mixtures. 4 • Automatic music transcription • 3D audio system, etc. Applications We propose a new method for multichannel signal separation with NMF utilizing both spectral and spatial cues included in mixtures of multiple instruments.
- 5. Outline • 1. Research background • 2. Conventional method – Nonnegative matrix factorization – Penalized supervised nonnegative matrix factorization – Directional clustering – Multichannel NMF – Hybrid method • 3. Proposed method – Regularized superresolution-based nonnegative matrix factorization • 4. Experiments • 5. Conclusions 5
- 6. NMF • NMF is a type of sparse representation algorithm that decomposes a nonnegative matrix into two nonnegative matrices. [D. D. Lee, et al., 2001] 6 Time Frequency AmplitudeFrequency Amplitude Observed matrix (Spectrogram) Basis matrix (Spectral bases) Activation matrix (Time-varying gain) Time Ω: Number of frequency bins 𝑇: Number of frames 𝐾: Number of bases 𝒀: Observed matrix 𝑭: Basis matrix 𝑮: Activation matrix
- 7. Penalized Supervised NMF (PSNMF) • In PSNMF, the following decomposition is addressed under the condition that is known in advance. [Yagi, et al., 2012] 7 Separation process Fix trained bases and update . is forced to become uncorrelated with Update Training process Supervised bases of the target soundSupervision sound Problem of PSNMF: When the signal includes many sources, the extraction performance markedly degrades.
- 8. Directional Clustering • Directional clustering can estimate sources and their direction in multichannel signal. [Araki, et al., 2007] [Miyabe, et al., 2009] 8 L R L-chinputsignal R-ch input signal ：Source component ：Centroid vector Center cluster Right clusterLeft cluster Problem of directional clustering: This method cannot separate sources in the same direction.
- 9. 9 • Multichannel NMF also has been proposed [Ozerov, et al., 2010] [Sawada, et al., 2012]. • Natural extension of NMF for a multichannel signal • This method uses spectral and spatial cues to achieve the unsupervised separation task. Multichannel NMF Problem of multichannel NMF: This unified method is very difficult optimization problem mathematically. Many variables should be optimized using only one cost function. Multichannel NMF involve strong dependence on initial values and lack robustness.
- 10. Hybrid method • Conventional hybrid method utilizes PSNMF after the directional clustering. [Iwao, et al., 2012] • This method consists of two techniques. – Directional clustering – PSNMF 10 Directional clustering L R PSNMF Spatial separation Source separation Conventional Hybrid method
- 11. Problem of hybrid method • The signal extracted by the hybrid method has considerable distortion. • There are many spectral chasms in the spectrogram obtained by directional clustering. • The resolution of the spectrogram is degraded. 11 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 1 0 1 1 0 Time Frequency : Target direction Time Frequency Time Frequency: Other direction ：Hadamard product (product of each element) Input spectrogram Binary mask Separated cluster Directional Clustering : Chasms
- 12. Outline • 1. Research background • 2. Conventional method – Nonnegative matrix factorization – Penalized supervised nonnegative matrix factorization – Directional clustering – Multichannel NMF – Hybrid method • 3. Proposed method – Regularized superresolution-based nonnegative matrix factorization • 4. Experiments • 5. Conclusions 12
- 13. Proposed hybrid method 13 Input stereo signal L-ch R-ch STFT Directional clustering Center component L-ch R-ch center cluster Index of based SNMF Superresolution- based SNMF Superresolution- ISTFT ISTFT Mixing Extracted signal Input stereo signal L-ch R-ch STFT Directional clustering Center component PSNMFPSNMF L-ch R-ch ISTFT ISTFT Mixing Extracted signal Conventional hybrid method Proposed hybrid method Employ a new supervised NMF algorithm as an alternative to the conventional PSNMF in the hybrid method.
- 14. Superresolution-based supervised NMF • In proposed supervised NMF, the spectral chasms are treated as unseen observations using index matrix. 14 : Chasms Time Frequency Separated cluster Chasms Treat chasms as unseen observations. 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 1 0 1 1 0 Time Frequency Index matrix 1 : Grid of separated component 0 : Grid of chasm (hole)
- 15. Superresolution-based supervised NMF • The components of the target sound lost after directional clustering can be extrapolated using supervised bases. 15 Time Frequency Separated cluster Time Frequency Reconstructed spectrogram : Chasms Supervised bases Superresolution using supervised bases
- 16. 16 Superresolution-based supervised NMF • Signal flow of the proposed hybrid method Center RightLeft Direction sourcecomponent (a) Frequencyof Observed spectra Target source
- 17. 17 Target direction Superresolution-based supervised NMF • Signal flow of the proposed hybrid method Center RightLeft Direction sourcecomponent z (b) Frequencyof After directional clustering Target source Center RightLeft Direction sourcecomponent (a) Frequencyof Observed spectra Center sources lose some of their components Directional clustering
- 18. 18 Superresolution-based supervised NMF • Signal flow of the proposed hybrid method Center RightLeft Direction sourcecomponent z (b) Frequencyof After directional clustering Center sources lose some of their components
- 19. 19 Superresolution-based supervised NMF • Signal flow of the proposed hybrid method Center RightLeft Direction sourcecomponent z (b) Frequencyof After directional clustering Center sources lose some of their components Superresolution- based NMF Center RightLeft Direction sourcecomponent (c) Frequencyof After super- resolution- based SNMF Extrapolated target source
- 20. Superresolution-based supervised NMF • The basis extrapolation includes an underlying problem. • If the time-frequency spectra are almost unseen in the spectrogram, which means that the indexes are almost zero, a large extrapolation error may occur. • It is necessary to regularize the extrapolation. 20 4 3 2 1 0 Frequency[kHz] 43210 Time [s] Extrapolation error (incorrectly modifying the activation) Time Frequency Separated cluster Almost unseen frame
- 21. Superresolution-based supervised NMF • We propose to introduce the regularization term in the cost function. • The intensity of these regularizations are proportional to the number of chasms in each frame. 21 Regularization of norm minimization 𝑰 : Index matrix ∙ : Binary complement 𝑖 𝜔,𝑡: Entry of index matrix 𝑰 𝑔 𝑘,𝑡: Entry of matrix 𝑮 𝑓𝜔,𝑘: Entry of matrix 𝑭
- 22. Superresolution-based supervised NMF • The cost function in regularized superresolution-based NMF is defined using the index matrix as follows: • Since the divergence is only defined in grids whose index is one, the chasms in the spectrogram are ignored. 22 : Penalty term to force and to become uncorrelated with each other : Weighting parameter Regularization term Penalty term : an arbitrary divergence function
- 23. Superresolution-based supervised NMF • The update rules that minimize the cost function based on KL divergence are obtained as follows: 23
- 24. Superresolution-based supervised NMF • The update rules that minimize the cost function based on Euclidian distance are obtained as follows: 24
- 25. Outline • 1. Research background • 2. Conventional method – Nonnegative matrix factorization – Penalized supervised nonnegative matrix factorization – Directional clustering – Multichannel NMF – Hybrid method • 3. Proposed method – Regularized superresolution-based nonnegative matrix factorization • 4. Experiments • 5. Conclusions 25
- 26. Evaluation experiment • We compared five methods. – Simple directional clustering – Simple PSNMF – Multichannel NMF based on IS-divergence – Conventional hybrid method using PSNMF – Proposed hybrid method using superresolution-based SNMF 26 Input stereo signal L-ch R-ch STFT Directional clustering Center component PSNMFPSNMF L-ch R-ch ISTFT ISTFT Mixing Extracted signal Input stereo signal L-ch R-ch STFT Directional clustering Center component L-ch R-ch center cluster Index of based SNMF Superresolution- based SNMF Superresolution- ISTFT ISTFT Mixing Extracted signal
- 27. Evaluation experiment • We used stereo-panning signals ( , ). • Mixture of four instruments (Ob., Fl., Tb., and Pf.) generated by MIDI synthesizer • We used the same type of MIDI sounds of the target instruments as supervision for training process. 27 Center １ ２ ３ ４ Left Right Target source Supervision sound Two octave notes that cover all notes of the target signal
- 28. Experimental results ( ) • Average SDR, SIR, and SAR scores for each method, where the four instruments are shuffled with 12 combinations. 28 SDR ：quality of the separated target sound SIR ：degree of separation between the target and other sounds SAR ：absence of artificial distortion Good Bad SDR SIR SAR
- 29. Experimental results ( ) • Average SDR, SIR, and SAR scores for each method, where the four instruments are shuffled with 12 combinations. 29 SDR ：quality of the separated target sound SIR ：degree of separation between the target and other sounds SAR ：absence of artificial distortion Good Bad SDR SIR SAR
- 30. Conclusions • We propose a new supervised NMF algorithm for the hybrid method to separate stereo or multichannel signals. • The proposed supervised method recovers the resolution of spectrogram, which is obtained by the binary masking in directional clustering, using supervised basis extrapolation. • The proposed hybrid method can separate the target signal with high performance compared with conventional methods. 30 Thank you for your attention!

- Good afternoon everyone, // I’m Daichi Kitamura from Nara institute of science and technology, Japan. Today // I’d like to talk about Superresolution-based stereo signal separation via supervised nonnegative matrix factorization.
- This is outline of my talk.
- First, // I talk about research background.
- Recently, // music signal separation technologies have received much attention. Music signal separation based on nonnegative matrix factorization, // NMF in short, // has been a very active area of the research. NMF can extract the target signal to some extent , // especially in the case of small number of instruments. However, // for the case of many source mixtures / like more realistic musical tunes, / the extraction performance markedly degrades. To solve this problem, // we propose a new method for multichannel signal separation / with NMF utilizing both spectral and spatial cues / included in mixtures of multiple instruments.
- Next, // we talk about conventional methods.
- NMF is a type of sparse representation algorithm // that decomposes a nonnegative matrix / into two nonnegative matrices like this. Where Y is an observed spectrogram. F is a basis matrix / that involves spectral patterns of the observed signal as column vectors, // and G is an activation matrix / that corresponds to the activation of each spectral pattern.
- And penalized supervised NMF, / PSNMF in short, / has been proposed by Yagi and others. In PSNMF, // an observed matrix is decomposed like this. Where F is a trained bases / using the target supervision sound in training process. So, the target signal is extracted as F and G. This method uses spectral cues for the separation. However, when the input signal includes many instrumental sources, // the extraction performance markedly degrades because several resemble bases arise in both of the target and other instruments. (In addition, // to prevent the simultaneous generation / of similar spectral patterns in the matrices F and H, // a specific penalty is imposed between F and H.)
- Next, // we explain directional clustering method. Directional clustering can estimates sources and their direction in multichannel signal. This method can separate sources with spatial information in an observed signal. However, this method cannot separate sources in the same direction, like this.
- As another means of addressing multichannel signal separation, Multichannel NMF also has been proposed by Ozerov and Sawada. This method is a natural extension of NMF, and uses spectral and spatial cues. But, this unified method is very difficult optimization problem mathematically / because many variables should be optimized by one cost function. So, this method strongly depends on the initial values.
- To solve these problems, / a hybrid method that concatenates PSNMF after directional clustering / has been proposed. This method consists of two techniques. First, / directional clustering is applied to the input signal / to separate the target direction. Then, / we added PSNMF after the directional clustering, and separate the target source. (This method uses suitable decompositions / for each separation problem, i.e., this hybrid method is divide-and-conquer method.)
- But / there is also a problem of the hybrid method. The signal extracted by the hybrid method / suffers from the generation of considerable distortion. This is because, / directional clustering is a binary masking on the time-frequency domain, So, / the separated cluster / has many spectral chasms. In other words, the resolution of the spectrogram is degraded.
- Next, // we talk about proposed method.
- In proposed method, / we employed a new supervised NMF algorithm / as an alternative to the conventional PSNMF in the hybrid method.
- This is an example of spectrum at one frame, which is obtained by directional clustering. There are many spectral chasms. And, this matrix is the index of separated cluster. Where, ones indicate the grids of separated component by directional clustering, and zeros indicate the grids of chasm in the spectrogram. In proposed supervised NMF, / these spectral chasms are treated as unseen observations / using this index matrix, like this. Therefore, / supervised NMF is applied to only the observed valid components / not unseen observations like these chasms. (The directional clustering is hard clustering, binary masking. And the index matrix of directional clustering is obtained from the separated results. So, we can know where is the chasms. The ones mean observations, and zeros mean unseen observations.)
- In addition, the components of the target sound lost after directional clustering / can be extrapolated using supervised bases. In other words, / the resolution of the target spectrogram / is recovered with the superresolution / by the supervised basis extrapolation.
- (pointing (a)) This is a directional source distribution of observed stereo signal. The target source is in the center direction, / and other sources are distributed like this.
- Directional clustering is a binary masking in the time-frequency domain. So, / the boundary lines are determined by the k-means clustering like this, and separated cluster is obtained. Where, / left and right source components / leak in the center cluster, // and center sources lose some of their components. These lost components / correspond to the spectral chasms in the time-frequency domain. In addition, the interference source in the same direction remains.
- Then, after the directional clustering,
- the superresolution-based NMF is applied. This NMF separates the target source / and reconstructs lost components with basis extrapolation using supervised bases.
- However, / this basis extrapolation includes an underlying problem. If the time-frequency spectra are almost unseen in the spectrogram, / a large extrapolation error may occur. So, it is necessary to regularize / this extrapolation.
- To solve this extrapolation error, we propose to introduce the regularization term. This regularization is based on the assumption that // the frame, / which has many spectral chasms, / doesn’t have much of target components intrinsically. Where, I bar means the binary complement of the index. So, / I bar represents the grid of chasms. Therefore, the intensity of these regularizations are proportional to the number of chasms in each frame.
- The cost function is defined using index matrix like this. Where, D is an arbitrary divergence function, and we used generalized Kullback-Leibler divergence and Euclidian distance. This term is regularization, and this is penalty term to avoid sharing the same basis between F and H. Since the divergence is only defined in grids whose index is one, the chasms in the spectrogram are ignored.
- The update rules that minimize the cost function based on KL divergence are obtained like this.
- In addition, the update rules based on Euclidian distance are obtained like this.
- Then, // we talk about experiments.
- In the experiment, we compared five methods. Simple directional clustering, / simple PSNMF, / multichannel NMF, / conventional hybrid method using PSNMF, / and proposed hybrid method using superresolution-based SNMF.
- And, we used stereo-panning signals. Left and right side sources are located in 40-degree and 15-degree. The signal contains 5 instruments, namely, oboe, flute, trombone, and piano, / generated by MIDI synthesizer. These sources are mixed as the same power, / and the target source is always located in the center. In addition, / we used the same type of MIDI sounds of the target instruments / as the supervision sound / like this (pointing supervision score). This supervision sound consists two octave notes that cover all notes of the target signal.
- These results are average of evaluation scores / for 40-degree signals Where, / SDR indicates the quality of the separated target sound, / SIR indicates degree of separation / between the target and other sounds, / and SAR indicates absence of artificial distortion. Therefore, SDR is the total evaluation score that involves SIR and SAR. From these results, proposed hybrid method outperforms other methods.
- And, this is result for 15-degree signals. Similar to the results of 40-degree signal, / proposed hybrid method is effective and robust for the multichannel signal separation. We can confirm that directional clustering and multichannel NMF do not have sufficient performance because they cannot discriminate the sources in the same direction. In contrast, the methods using SNMF can give better results and the proposed method with superresolution-based SNMF outperforms all other methods.
- This is conclusions of my talk. Thank you for your attention.
- (The directional clustering is hard clustering, binary masking. And the index matrix of directional clustering is obtained from the separated results. So, we can know where is the chasms. The ones mean observations, and zeros mean unseen observations.)
- In addition, / the spectrogram of the target sound is reconstructed / using more matched bases / in the proposed NMF. (pointing (a)) This is a directional source distribution of observed stereo signal. The target source is in the center direction, / and other sources are distributed like this. After directional clustering, / separated cluster loses some of their components. And after superresolution-based NMF, the target components are restored using supervised bases. In other words, / the resolution of the target spectrogram / is recovered with the superresolution / by the supervised basis extrapolation.
- If the target sources increase in the same direction with target instruments, the separation performance of supervised NMF markedly degrades. This is because, the several resemble bases arise in both of the target and other instruments.
- If the left and right sources close to the center direction, the separation ↓ become difficult, because directional clustering cannot separate well. In addition, bases extrapolation also become difficult because the number of chasms in the separated cluster / are increased in this case. In contrast, if the theta become larger, the separation ↓ become easy.
- This is a signal flow of the proposed hybrid method. In our experiment, superresolution-based supervised NMF is applied to only the center direction because the target source is located in the center direction. However, if the target source is located in the left or right side, we should apply this NMF to the direction that have the target source whether or not there is the other source in that direction.
- SDR ：quality of the separated target sound SIR ：degree of separation between the target and other sounds SAR ：absence of artificial distortion
- SDR is the total evaluation score as the performance of separation.
- And penalized supervised NMF, / PSNMF in short, / has been proposed by Yagi and others. In PSNMF, // an observed matrix is decomposed like this. Where F is a nonnegative matrix / that involves the target sound basis as column vectors. G is an activation matrix / that corresponds to F, // and H and U are nonnegative matrices. So, the target signal is extracted as F and G. In addition, // to prevent the simultaneous formulation / of similar spectral patterns in the matrices F and H, // a specific penalty is imposed between F and H. However, // PSNMF has a problem. When the input signal includes many instrumental sources, // the extraction performance markedly degrades. (because several resemble bases arise in both of the target and other instruments.)