Presented at Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2014 (APSIPA 2014, international conference)
Daichi Kitamura, Hiroshi Saruwatari, Satoshi Nakamura, Yu Takahashi, Kazunobu Kondo, Hirokazu Kameoka, "Hybrid multichannel signal separation using supervised nonnegative matrix factorization with spectrogram restoration," Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2014 (APSIPA 2014), Siem Reap, Cambodia, December 2014 (invited paper).
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
Hybrid multichannel signal separation using supervised nonnegative matrix factorization with spectrogram restoration
1. Hybrid Multichannel Signal Separation Using
Supervised Nonnegative Matrix Factorization
Daichi Kitamura, (The Graduate University for Advanced Studies, Japan)
Hiroshi Saruwatari, (The University of Tokyo, Japan)
Satoshi Nakamura, (Nara Institute of Science and Technology, Japan)
Yu Takahashi, (Yamaha Corporation, Japan)
Kazunobu Kondo, (Yamaha Corporation, Japan)
Hirokazu Kameoka, (The University of Tokyo, Japan)
Asia-Pacific Signal and Information Processing Association ASC 2014
Special session – Recent Advances in Audio and Acoustic Signal processing
2. Outline
• 1. Research background
• 2. Conventional methods
– Nonnegative matrix factorization
– Supervised nonnegative matrix factorization
– Multichannel NMF
• 3. Proposed method
– SNMF with spectrogram restoration and its Hybrid method
• 4. Experiments
– Closed data experiment
– Open data experiment
• 5. Conclusions
2
3. Outline
• 1. Research background
• 2. Conventional methods
– Nonnegative matrix factorization
– Supervised nonnegative matrix factorization
– Multichannel NMF
• 3. Proposed method
– SNMF with spectrogram restoration and its Hybrid method
• 4. Experiments
– Closed data experiment
– Open data experiment
• 5. Conclusions
3
4. Research background
• Signal separation have received much attention.
• Music signal separation based on nonnegative matrix
factorization (NMF) is a very active research area.
• Supervised NMF (SNMF) achieves the highest
separation performance.
• To improve its performance, SNMF-based
multichannel signal separation method is required.
4
• Automatic music transcription
• 3D audio system, etc.
Applications
Separate!
Separate the target signal from multichannel
signals with high accuracy.
5. Outline
• 1. Research background
• 2. Conventional methods
– Nonnegative matrix factorization
– Supervised nonnegative matrix factorization
– Multichannel NMF
• 3. Proposed method
– SNMF with spectrogram restoration and its Hybrid method
• 4. Experiments
– Closed data experiment
– Open data experiment
• 5. Conclusions
5
6. • NMF can extract significant spectral patterns.
– Basis matrix has frequently-appearing spectral patterns
in .
NMF [Lee, et al., 2001]
Amplitude
Amplitude
Observed matrix
(spectrogram)
Basis matrix
(spectral patterns)
Activation matrix
(Time-varying gain)
Time
Ω: Number of frequency bins
𝑇: Number of time frames
𝐾: Number of bases
Time
Frequency
Frequency
6
Basis
7. • SNMF
– Supervised spectral separation method
Supervised NMF [Smaragdis, et al., 2007]
Separation process Optimize
Training process
Supervised basis matrix
(spectral dictionary)
Sample sounds
of target signal
7
Fixed
Sample sound
Target signal Other signalMixed signal
8. Problems of SNMF
• SNMF is only for a single-channel signal
– For multichannel signal, SNMF cannot use information
between channels.
• When many interference sources exist, separation
performance of SNMF markedly degrades.
8
Separate
Residual
components
9. 9
• Multichannel NMF
– is a natural extension of NMF for a multichannel signal
– uses spatial information for the clustering of bases to
achieve the unsupervised separation task.
Multichannel NMF [Sawada, et al., 2013]
Problems:
Multichannel NMF involve strong dependence on initial values
and lack robustness.
Microphone array
10. Outline
• 1. Research background
• 2. Conventional methods
– Nonnegative matrix factorization
– Supervised nonnegative matrix factorization
– Multichannel NMF
• 3. Proposed method
– Motivation and strategy
– SNMF with spectrogram restoration and its Hybrid method
• 4. Experiments
– Closed data experiment
– Open data experiment
• 5. Conclusions
10
11. • Sawada’s multichannel NMF
– is unified method to solve spatial and spectral separations.
– Maximizes a likelihood:
– For supervised situation, target spectral patterns is given.
– Too much difficult to solve (lack robustness)
– Computationally inefficient (much computational time)
Motivation and strategy
11
Spatial direction
of target signal
Source components
of all signals
Target Other
Observed spectrograms
12. • Proposed hybrid method
– divides the problems as follows:
– The spatial separation should be carried out with classical
D.O.A. estimation methods.
• These methods are very efficient and stable.
– Divide and conquer method
Motivation and strategy
12
Unsupervised
spatial separation
Supervised
spectral separation
Approximation
Classical D.O.A. estimation SNMF-based method
13. Directional clustering [Araki, et al., 2007]
• Directional clustering
– Unsupervised spatial separation method
– k-means clustering (fast and stable)
• Problems
– Artificial distortion arises owing to the binary masking.
13
Right
L R
Center
Left
L R
Center
Binary masking
Input signal (stereo) Separated signal
1 1 1 0 0 0
1 0 0 0 0 0
1 1 1 1 0 0
1 0 0 0 0 0
1 1 1 1 1 1
Frequency
Time
C C C R L R
C L L L R R
C C C C R R
C R R L L L
C C C C C C
Frequency
Time
Binary maskSpectrogram
Entry-wise product
14. Proposed method: hybrid separation
• Hybrid separation method
14
Input stereo signal
Spatial separation method
(Directional clustering)
SNMF-based separation method
(SNMF with spectrogram restoration)
Separated signal
L R
15. SNMF with spectrogram restoration
: Holes
Time
Frequency
Separated cluster
Spectral holes (lost components)
The proposed SNMF treats these
holes as unseen observations
Supervised basis
…
Extrapolate the
fittest bases
15
(dictionary of target signal)
Fix up
16. SNMF with spectrogram restoration
Center RightLeft
Direction
sourcecomponent
z
(b)
Center RightLeft
Direction
sourcecomponent
(a)
Target
Center RightLeft
Direction
sourcecomponent
(c)
Extrapolated
componentsFrequencyofFrequencyofFrequencyof
After
Input
After
signal
directional
clustering
super-
resolution-
based SNMF
Binary
masking
16
Time
FrequencyObserved spectrogram
Target
Interference
Time
Time
Frequency
Extrapolate
Frequency
Separated cluster
Reconstructed data
Supervised
spectral bases
Directional
clustering
SNMF with
spectrogram restoration
17. • The divergence is defined at all grids except for the
holes by using the Binary mask matrix .
Decomposition model and cost function
17
Decomposition model:
Supervised bases (Fixed)
: Entries of matrices, , and , respectively
: Weighting parameters,: Binary complement, : Frobenius norm
Cost function:
: Binary masking matrix obtained from directional clustering
18. • The divergence is defined at all grids except for the
holes by using the Binary mask matrix .
Decomposition model and cost function
18
Decomposition model:
Supervised bases (Fixed)
: Entries of matrices, , and , respectively
: Weighting parameters,: Binary complement, : Frobenius norm
Cost function:
: Binary masking matrix obtained from directional clustering
Binary index to exclude the holes
19. • The divergence is defined at all grids except for the
holes by using the Binary mask matrix .
Decomposition model and cost function
19
Decomposition model:
Supervised bases (Fixed)
: Entries of matrices, , and , respectively
: Weighting parameters,: Binary complement, : Frobenius norm
Regularization term
Cost function:
: Binary masking matrix obtained from directional clustering
Binary index to exclude the holes
20. • The divergence is defined at all grids except for the
holes by using the Binary mask matrix .
Decomposition model and cost function
20
Decomposition model:
Supervised bases (Fixed)
: Entries of matrices, , and , respectively
: Weighting parameters,: Binary complement, : Frobenius norm
Regularization term
Penalty term
[Kitamura, et al. 2014]
Cost function:
: Binary masking matrix obtained from directional clustering
Binary index to exclude the holes
21. • : -divergence [Eguchi, et al., 2001]
– EUC-distance
– KL-divergence
– IS-divergence
Generalized divergence: b -divergence
21
The best criterion for
signal separation
[Kitamura, et al., 2014]
22. • We used two -divergences for the main cost and
the regularization cost as and .
Decomposition model and cost function
22
Decomposition model:
Cost function:
Supervised bases (Fixed)
23. Update rules
• We can obtain the update rules for the optimization of
the variables matrices , , and .
23
Update rules:
24. Outline
• 1. Research background
• 2. Conventional methods
– Nonnegative matrix factorization
– Supervised nonnegative matrix factorization
– Multichannel NMF
• 3. Proposed method
– SNMF with spectrogram restoration and its Hybrid method
• 4. Experiments
– Closed data experiment
– Open data experiment
• 5. Conclusions
24
25. • Mixed signal includes four melodies (sources).
• Three compositions of instruments
– We evaluated the average score of 36 patterns.
Experimental condition
25
Center
1
2 3
4
Left Right
Target source
Supervision
signal
24 notes that cover all the notes in the target melody
Dataset Melody 1 Melody 2 Midrange Bass
No. 1 Oboe Flute Piano Trombone
No. 2 Trumpet Violin Harpsichord Fagotto
No. 3 Horn Clarinet Piano Cello
26. 14
12
10
8
6
4
2
0
SDR[dB]
43210
bNMF
• Signal-to-distortion ratio (SDR)
– total quality of the separation, which includes the degree of
separation and absence of artificial distortion.
Experimental result: closed data
26
Good
Bad
Conventional SNMF
(single-channel SNMF)
Proposed hybrid method
Directional
clustering
Supervised
Multichannel
NMF [Sawada]
KL-divergence EUC-distance
27. SNMF with spectrogram restoration
• SNMF with spectrogram restoration has two tasks.
• The optimal divergence for source separation is KL-
divergence ( ).
• In contrast, a divergence with higher value is
suitable for the basis extrapolation.
27
Source
separation
SNMF with
spectrogram restoration
Basis
extrapolation
28. Trade-off: separation and restoration
• The optimal divergence for SNMF with spectrogram
restoration and its hybrid method is based on the
trade-off between separation and restoration abilities.
-10
-8
-6
-4
-2
0
Amplitude[dB]
543210
Frequency [kHz]
-10
-8
-6
-4
-2
0
Amplitude[dB]
543210
Frequency [kHz]
Sparseness: strong Sparseness: weak
28
Performance
Separation
Total performance of the hybrid method
Restoration
0 1 2 3 4
29. • Closed data experiment
– used different Tone generator for training and test signals
Experimental condition
29
Supervision
signal
24 notes that cover all the notes in the target melody
Provided by Tone generator A
Provided by Tone generator B
(more real sound)
+ back ground noise (SNR = 10 dB)
Center
1
2 3
4
Left Right
Target source
30. 10
8
6
4
2
0
-2
-4
SDR[dB]
43210
bNMF
• Signal-to-distortion ratio (SDR)
– total quality of the separation, which includes the degree of
separation and absence of artificial distortion.
Experimental result: open data
30
Good
Bad
Conventional SNMF
(single-channel SNMF)
Proposed hybrid method
Directional
clustering
Supervised
Multichannel
NMF [Sawada]
KL-divergence EUC-distance
31. Conclusions
• We proposed a hybrid multichannel signal separation
method combining directional clustering and SNMF
with spectrogram restoration.
• There is a trade-off between separation and
restoration abilities.
31
Thank you for your attention!
You can hear a
demonstration
from my HP!
Editor's Notes
This is outline of my talk.
This is outline of my talk.
Recently, // signal separation technologies have received much attention.
These technologies are available for many applications.
And nonnegative matrix factorization, // NMF in short, // has been a very active area of the signal separation.
Particularly, supervised NMF (SNMF) / achieves good separation performance.
However, SNMF can be used for only single-channel signals.
To improve its performance, SNMF-based multichannel signal separation method is required.
This is outline of my talk.
Before explaining a supervised NMF, I will explain the basic of simple NMF.
NMF is a powerful method for extracting significant features from a spectrogram.
This method decomposes the input spectrogram Y into a product of basis matrix F and activation matrix G,
where basis matrix F / has frequently-appearing spectral patterns / as basis vectors like this,
and activation matrix G / has time-varying gains / of each spectral pattern.
To separate the target signal with NMF, Supervised NMF has been proposed.
In SNMF, first, we train the sample sound of the target signal, which is like a musical scale.
Then we construct the supervised basis F. This is a spectral dictionary of the target sound.
Next, we separate the mixed signal / using the supervised basis F, as FG+HU.
Therefore, the target signal obtained as FG, and the other signal is reconstructed by HU.
The problem of SNMF is that
This is only for a single-channel signals. We cannot use any information between channels.
But almost all music signals are the stereo format. So we should extend simple SNMF to the a multichannel SNMF.
In addition, when many interfering sources exist, the separation performance of SNMF markedly degrades.
As another means for the multichannel signal separation, Multichannel NMF also has been proposed by Sawada.
This is a natural extension of NMF, and uses spatial information for the clustering of bases, to achieve the unsupervised separation.
However, this method is very difficult optimization problem mathematically.
So, this method strongly depends on the initial values.
Sawada’s multichannel NMF is a unified method to solve spatial and spectral separations simultaneously.
This method maximizes a likelihood like this, where theta is a spatial direction of the target signal, F, G, H, and U is source components of target and other signals, Y is an observed given spectrogram of both channels.
For the supervised situation, the target spectral patterns F is given like this.
However, even if F is given, this optimization is too much difficult to solve.
So it lacks robustness.
Also, it requires much computational time.
Our proposed method approximately divides the problem into the unsupervised spatial separation and supervised spectral separation.
Because we can use efficient classical D.O.A. estimation methods for the spatial separation. This is very efficient and stable.
Then SNMF is applied for the spectral separation problem.
Therefore, this method can be considered as a divide and conquer method.
The optimal methods are applied for each separations.
For the spatial separation, we used a directional clustering because this is very fast and stable.
This method utilizes level difference between left and right channels as a clustering cue.
So, we can separate the sources direction-wisely.
And this is equal to binary masking in the spectrogram domain.
We get the binary mask from the result of clustering, and we calculate an entry-wise product.
Finally we obtain the separated direction.
However, the separated direction has an artificial distortion owing to the binary masking.
So we proposed a new SNMF-based method named SMNF with spectrogram restoration.
This is the concept of our proposed hybrid method.
First, the target direction is separated.
Then, target signal is extracted by this new SNMF.
Here, / the separated signal by directional clustering / has many spectral holes owing to the binary masking.
This spectrum is an example. There are so many spectral holes owing to the binary masking.
However, / the proposed SNMF treats these holes as unseen observations like this.
We exclude these components from the cost function.
Then, the target bases are extrapolated using the fittest spectral pattern / from the supervised bases F.
As a result, the lost components are restored by the supervised basis extrapolation.
This figure shows the directional distribution of the input stereo signal.
The target source is in the center direction, and the other interfering sources are distributed like this.
After directional clustering, / left and right source components / leak in the center cluster, // and center sources lose some of their components.
These lost components / correspond to the spectral holes.
And after SNMF with spectrogram restoration, the target components are separated / and restored using supervised bases.
In other words, / the resolution of the target spectrogram / is recovered.
This is a decomposition model of SNMF with spectrogram restoration. It is the same as the simple SNMF.
And, J is the cost function of the proposed SNMF.
In this cost function,
We introduce the binary index i, which is for excluding the holes from the total cost.
This index is obtained from the binary mask matrix.
Therefore, the divergence is defined at all spectrogram grids / except for the spectral holes.
For the grids of the holes, we impose a regularization term to avoid the extrapolation error.
The third term is a penalty term to avoid sharing the same basis between F and H.
This penalty improves the separation performance in SNMF.
For the divergence measure, we propose to use beta-divergence.
This is a generalized distance function, which involves EUC-distance, KL-divergence, and IS-divergence when beta = 2, 1, and 0.
In SNMF, it is reported that / KL-divergence is the best criterion for the signal separation.
And we used two beta-divergences for the main cost and regularization cost / as beta_NMF and beta_reg.
From the minimization of the cost function, / we can obtain the update rules / for the optimization of variable matrices G, H, and U.
This is outline of my talk.
This is an experimental condition. The mixed signal includes four melodies.
Each sound source located like this figure, / where the target source is always located in the center direction / with other interfering source.
And we prepared 3 compositions of instruments and evaluated the average score of 36 patterns.
In addition, the supervision signal has 24 notes like this score, which cover all the notes in the target melody.
This is a result of experiment.
We showed the average SDR score, where SDR indicates the total quality of the separation.
Directional clustering cannot separate the sources in the same direction, so the result was not good.
Multichannel NMF strongly depends on the initial value, and the average score becomes bad.
The hybrid method outperforms the conventional SNMF.
And the conventional SNMF achieves the highest score when beta equals 1, KL-divergence.
However, surprisingly, EUC-distance is preferable for the proposed hybrid method.
This is because / SNMF with spectrogram restoration has two tasks, namely, Separation of the target signal / and basis extrapolation for the restoration of the spectrogram.
And it is reported that the KL-divergence is suitable for the source separation.
However, in contrast, a divergence with higher beta value is suitable for the basis extrapolation.
This fact is experimentally proven in our paper.
The reason is that / if we use the smaller beta value, such as a KL-divergence, the obtained basis becomes sparse. (pointing figure)
On the other hand, if we use the higher beta value, the sparseness of the basis becomes weak.
And the sparse basis is not suitable for the basis extrapolation using only the observable data.
Therefore, the optimal divergence for the hybrid method is around EUC-distance / because of the trade-off between separation and restoration abilities / like this graph.
The optimal beta is shifted from 1 to 2.
Also, we conducted an open data experiment.
Here we used the different MIDI Tone generator for the training and test signals.
Therefore, the waveforms are not same, but similar.
In addition, we added the back ground noise to the test signals as SNR = 10 dB.
This is the result.
Even if we use the different training sound, we can achieve good results.
Sawada’s multichannel NMF does not work because this method cannot reduce the defuse noise.
This is conclusions of my talk.
Thank you for your attention.
Supervised method has an inherent problem.
That is, we cannot get the perfect supervision sound of the target signal.
Even if the supervision sounds are the same type of instrument as the target sound, / these sounds differ / according to various conditions.
For example, individual styles of playing / and the timbre individuality for each instrument, and so on.
When we want to separate this piano sound from mixed signal, / maybe we can only prepare the similar piano sound, but the timbre is slightly different.
However the supervised NMF cannot separate because of the difference of spectra of the target sound.
To solve this problem, we have proposed a new supervised method / that adapts the supervised bases to the target spectra / by a basis deformation.
This is the decomposition model in this method.
We introduce the deformable term, / which has both positive and negative values like this.
Then we optimize the matrices D, G, H, and U.
This figure indicates spectral difference between the real sound and artificial sound.
This is a result of the experiment using real-recorded signal.
From this result, we can confirm that the optimal divergence for the hybrid method is EUC-distance.
In NMF decomposition, the cost function is defined as a distance or a divergence between input matrix Y and decomposed matrix FG.
J_NMF indicates the cost function in NMF, and we minimize this one to find F and G under the constraint of nonnegativity.
And there are some criteria for the distance used in the cost function.
These 3 criteria are often used in the NMF decomposition.
The decomposition of NMF is equivalent to a maximum likelihood estimation, / which assumes the generation model of the input data Y, implicitly.
If we select the parameter beta, / the assumption of generation model is fixed.
In other words, the parameter beta defines the generation model of the input data.
In this analysis, to compare the net extrapolation ability, we generated a random input data Y, which obey each generation model.
Also, we prepared the binary-masked random data YI, and attempt to restore that.
In a training process, we construct the supervised basis F using the random data Y.
Then we attempt to restore the binary-masked data using the trained basis F.
The binary mask I was generated by uniform manner, and we generated two types of binary masks / whose densities of holes are 75% and 98%.
Therefore, by calculating the similarity between input data Y and restored data FG, / we can evaluate the extrapolation ability and the accuracy of restoration.
So SAR indicates the accuracy of restoration.
These are the results of analysis.
The left one is the result for 75%-binary-masked data, and the right one is 98%-binary masked data.
Beta equals 1 is the optimal divergence for source separation, which means KL-divergence.
But, surprisingly, the optimal divergence for the restoration is that / beta equals around 3.
Also we conducted an experiment using real-recorded signals.
In this experiment, the binaural mixed signal was recorded in the real environment.
The other conditions are the same as those in the previous experiment.
This is a result of the experiment using real-recorded signal.
From this result, we can confirm that the optimal divergence for the hybrid method is EUC-distance.
As I already said, the best divergence depends on the number of holes.
If there are many holes, beta = 2 should be used.
And if the holes are not so many, beta =1 should be used.
Therefore, divergence should be switched to the optimal one with threshold value.
We propose frame-wise multi-divergence.
We define the multi-divergence using cases at each time frame,
where r_t means a density of holes at frame t.
By the threshold value tau, the divergence are adapted.
Then we evaluated various patterns of spatial location of the sources / as SP1~SP4.
SP4 leads more spectral holes than SP1.
From this result, we can confirm that the multi-divergence always achieves the highest performance.
SDR is the total evaluation score as the performance of separation.