Presented at 5th International Conference on 3D Systems and Applications (3DSA 2013) (international conference)
Daichi Kitamura, Hiroshi Saruwatari, Kiyohiro Shikano, Kazunobu Kondo, Yu Takahashi, "Regularized superresolution-based binaural signal separation with nonnegative matrix factorization," Proceedings of 5th International Conference on 3D Systems and Applications (3DSA 2013), S10-4, Osaka, Japan, June 2013.
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
Regularized superresolution-based binaural signal separation with nonnegative matrix factorization
1. Regularized Superresolution-Based
Binaural Signal Separation
with Nonnegative Matrix Factorization
Daichi Kitamura, Hiroshi Saruwatari,
Yusuke Iwao, Kiyohiro Shikano
(Nara Institute of Science and Technology, Nara, Japan)
Kazunobu Kondo, Yu Takahashi
(Yamaha Corporation Research & Development Center, Shizuoka, Japan)
4. Background
• Music signal separation technologies have received much
attention.
• Music signal separation based on nonnegative matrix
factorization (NMF) has been a very active area of the
research.
• The extraction performance of NMF markedly degrades for the
case of many source mixtures.
4
• Automatic music transcription
• 3D audio system, etc.
Applications
We propose a new method for multichannel signal
separation with NMF utilizing both spectral and spatial
cues included in mixtures of multiple instruments.
6. NMF
• NMF is a type of sparse representation algorithm that
decomposes a nonnegative matrix into two nonnegative
matrices. [D. D. Lee, et al., 2001]
6
Time
Frequency
AmplitudeFrequency
Amplitude
Observed matrix
(Spectrogram)
Basis matrix
(Spectral bases)
Activation matrix
(Time-varying gain)
Time
Ω: Number of frequency bins
𝑇: Number of frames
𝐾: Number of bases
𝒀: Observed matrix
𝑭: Basis matrix
𝑮: Activation matrix
7. Penalized Supervised NMF (PSNMF)
• In PSNMF, the following decomposition is addressed under
the condition that is known in advance. [Yagi, et al., 2012]
7
Separation process Fix trained bases and update .
is forced to become uncorrelated with
Update
Training process
Supervised bases
of the target sound
Supervision sound
8. Penalized Supervised NMF (PSNMF)
• In PSNMF, the following decomposition is addressed under
the condition that is known in advance. [Yagi, et al., 2012]
8
Separation process Fix trained bases and update .
is forced to become uncorrelated with
Update
Training process
Supervised bases
of the target sound
Supervision sound
Problem of PSNMF: When the signal includes many sources,
the extraction performance markedly degrades.
9. Directional Clustering
• Directional clustering can estimate sources and their direction
in multichannel signal. [Araki, et al., 2007] [Miyabe, et al., 2009]
• This method can separate sources with spatial information in
an observed signal.
9
L R
L-chinputsignal
R-ch input signal
:Source component
:Centroid vector
10. Directional Clustering
• Directional clustering can estimate sources and their direction
in multichannel signal. [Araki, et al., 2007] [Miyabe, et al., 2009]
• This method can separate sources with spatial information in
an observed signal.
10
L R
L-chinputsignal
R-ch input signal
:Source component
:Centroid vector
Problem of directional clustering:
This method cannot separate sources in the same direction.
11. Hybrid method
• Conventional hybrid method utilizes PSNMF after the
directional clustering. [Iwao, et al., 2012]
• This method consists of two techniques.
– Directional clustering
– PSNMF
11
Directional
clustering
L R PSNMF
Spatial
separation
Source
separation
Conventional Hybrid method
12. Problem of hybrid method
• The signal extracted by the hybrid method suffers from the
generation of considerable distortion due to the binary
masking in directional clustering.
• The signal in the target direction, which is obtained by
directional clustering, has many spectral chasms.
• The resolution of the spectrogram is degraded.
12
1 0 0 0 0 0 0
0 1 1 0 0 1 1
1 0 0 0 0 0 0
0 1 0 1 1 0 1
1 0 0 0 0 0 0
1 1 1 0 1 1 0
Time
Frequency
: Target direction Time
Frequency
TimeFrequency
: Other direction :Hadamard product (product of each element)
Input spectrogram Binary mask Separated cluster
Directional Clustering
14. Proposed hybrid method
14
Input stereo signal
L-ch R-ch
STFT
Directional clustering
Center component
L-ch R-ch
center cluster
Index of
based SNMF
Superresolution-
based SNMF
Superresolution-
ISTFT ISTFT
Mixing
Extracted signal
Input stereo signal
L-ch R-ch
STFT
Directional clustering
Center component
PSNMFPSNMF
L-ch R-ch
ISTFT ISTFT
Mixing
Extracted signal
Conventional
hybrid method
Proposed
hybrid method
Employ a new supervised NMF algorithm as an alternative
to the conventional PSNMF in the hybrid method.
15. Regularized superresolution-based NMF
• In proposed supervised NMF, the spectral chasms are treated
as unseen observations using index matrix.
15
: Chasms
Time
Frequency
Separated cluster
Chasms
Treat chasms as
unseen observations.
1 0 0 0 0 0 0
0 1 1 0 0 1 1
1 0 0 0 0 0 0
0 1 0 1 1 0 1
1 0 0 0 0 0 0
1 1 1 0 1 1 0
Time
Frequency
Index matrix
16. Regularized superresolution-based NMF
• The spectrogram of the target sound is reconstructed using
more matched bases because chasms are treated as unseen.
• The components of the target sound lost after directional
clustering can be extrapolated using supervised bases.
16
Time
Frequency
Separated cluster
Time
Frequency
Reconstructed spectrogram
: Chasms
Supervised
bases
Superresolution
using supervised
bases
17. 17
Regularized superresolution-based NMF
• Signal flow of the proposed hybrid method
Center RightLeft
Direction
sourcecomponent
(a)
Frequencyof
Observed
spectra
Target source
18. 18
Target direction
Regularized superresolution-based NMF
• Signal flow of the proposed hybrid method
Center RightLeft
Direction
sourcecomponent
z
(b)
Frequencyof
After
directional
clustering
Target source
Center RightLeft
Direction
sourcecomponent
(a)
Frequencyof
Observed
spectra
Center sources lose some
of their components
Directional
clustering
19. 19
Regularized superresolution-based NMF
• Signal flow of the proposed hybrid method
Center RightLeft
Direction
sourcecomponent
z
(b)
Frequencyof
After
directional
clustering Center sources lose some
of their components
20. 20
Regularized superresolution-based NMF
• Signal flow of the proposed hybrid method
Center RightLeft
Direction
sourcecomponent
z
(b)
Frequencyof
After
directional
clustering Center sources lose some
of their components
Superresolution-
based NMF
Center RightLeft
Direction
sourcecomponent
(c)
Frequencyof
After
super-
resolution-
based SNMF
Extrapolated
target source
21. Regularized superresolution-based NMF
• The basis extrapolation includes an underlying problem.
• If the time-frequency spectra are almost unseen in the
spectrogram, which means that the indexes are almost zero, a
large extrapolation error may occur.
• It is necessary to regularize the extrapolation.
21
4
3
2
1
0
Frequency[kHz]
43210
Time [s]
Extrapolation error
(incorrectly modifying the activation)
Time
Frequency
Separated cluster
Almost unseen frame
22. Regularized superresolution-based NMF
• We propose two types of regularizations.
22
Regularization of the temporal continuity
Regularization of the norm minimization
𝑰 : Index matrix ∙ : Binary complement
𝑖 𝜔,𝑡: Entry of index matrix 𝑰 𝑔 𝑘,𝑡: Entry of matrix 𝑮
𝑓𝜔,𝑘: Entry of matrix 𝑭
Previous
frame
The intensity of these regularizations are proportional to the
number of chasms in each frame.
23. Regularized superresolution-based NMF
• The cost function in regularized superresolution-based NMF is
defined using the index matrix as
23
: Regularization term
: Penalty term to force and to
become uncorrelated with each other
: Weighting parameter
26. Evaluation experiment
• We compared four methods.
– Conventional hybrid method using PSNMF (Conventional method)
– Proposed hybrid method using superresolution-based NMF without
regularization (Proposed method 1)
– Proposed hybrid method using superresolution-based NMF with
regularization of the temporal continuity (Proposed method 2)
– Proposed hybrid method using superresolution-based NMF with
regularization of the norm minimization (Proposed method 3)
26
Input stereo signal
L-ch R-ch
STFT
Directional clustering
Center component
PSNMFPSNMF
L-ch R-ch
ISTFT ISTFT
Mixing
Extracted signal
Input stereo signal
L-ch R-ch
STFT
Directional clustering
Center component
L-ch R-ch
center cluster
Index of
based SNMF
Superresolution-
based SNMF
Superresolution-
ISTFT ISTFT
Mixing
Extracted signal
27. Evaluation experiment
• We used stereo-panning signals ( ) and binaural-
recorded signals ( ) containing four instruments, Ob.,
Fl., Tb., and Pf., generated by MIDI synthesizer.
• The sources are mixed as the same power.
• Target source is always located in the center direction (no.1).
• We used the same type of MIDI sounds of the target
instruments as supervision for training process.
27
Center
1
2 3
4
Left Right
Target source
Supervision
sound
Two octave notes that cover all notes of the target signal
28. Experimental results (panning signal)
• Average SDR, SIR, and SAR scores for each method, where the 4
instruments are shuffled with 12 combinations.
28
12
10
8
6
4
2
0
SDR[dB]
24
20
16
12
8
4
0
SIR[dB]
10
8
6
4
2
0
SAR[dB]
SDR :quality of the separated target sound
SIR :degree of separation between the target and other sounds
SAR :absence of artificial distortion
Proposed method 1 :no regularization
Proposed method 2 :regularization of temporal continuity
Proposed method 3 :regularization of norm minimization
SDR SIR SARGood
Bad
29. Experimental results (binaural signal)
• Average SDR, SIR, and SAR scores for each method, where the 4
instruments are shuffled with 12 combinations.
29
6
5
4
3
2
1
0
SAR[dB]
20
16
12
8
4
0
SIR[dB]
10
8
6
4
2
0
SDR[dB]
SDR :quality of the separated target sound
SIR :degree of separation between the target and other sounds
SAR :absence of artificial distortion
SDR SIR SAR
Proposed method 1 :no regularization
Proposed method 2 :regularization of temporal continuity
Proposed method 3 :regularization of norm minimization
Bad
Good
30. Conclusions
• We propose a new supervised NMF algorithm, which is
superresolution-based method, for the hybrid method to
separate stereo or binaural signals.
• The proposed hybrid method can separate the target signal
with high performance compared with conventional method.
• The regularization of norm minimization is effective for the
proposed supervised NMF algorithm.
30
Thank you for your attention!
Editor's Notes
Thank you chires.
Good afternoon everyone, // I’m Daichi Kitamura from Nara institute of science and technology, Japan.
Today // I’d like to talk about Binaural signal separation / using regularized superresolution-based nonnegative matrix factorization.
This is outline of my talk.
First, // I talk about research background.
Recently, // music signal separation technologies have received much attention.
These technologies are available / for controlling each source in a music signal / for 3D audio system.
Music signal separation based on nonnegative matrix factorization, // NMF in short, // has been a very active area of the research.
NMF can extract the target signal to some extent , // especially in the case of small number of instruments.
However, // for the case of many source mixtures / like more realistic musical tunes, / the extraction performance markedly degrades.
To solve this problem, // we propose a new method for multichannel signal separation / with NMF utilizing both spectral and spatial cues / included in mixtures of multiple instruments.
Next, // we talk about conventional methods.
NMF is a type of sparse representation algorithm // that decomposes a nonnegative matrix / into two nonnegative matrices like this.
Where Y is an observed spectrogram.
F is a nonnegative matrix / that involves spectral patterns of the observed signal as column vectors, //
and G is a nonnegative matrix / that corresponds to the activation of each spectral pattern.
And penalized supervised NMF, / PSNMF in short, / has been proposed by Yagi and others.
In PSNMF, // an observed matrix is decomposed like this.
Where F is a trained bases / using the target supervision sound in training process.
So, the target signal is extracted as F and G.
In addition, // to prevent the simultaneous generation / of similar spectral patterns in the matrices F and H, // a specific penalty is imposed between F and H.
This method uses spectral cues for the separation.
However, // PSNMF has a problem.
When the input signal includes many instrumental sources, // the extraction performance markedly degrades because several resemble bases arise in both of the target and other instruments.
Next, // we explain directional clustering method.
Directional clustering can estimates sources and their direction in multichannel signal.
This method can separate sources with spatial information in an observed signal.
However, this method cannot separate sources in the same direction, like this.
To solve these problems, / a hybrid method that concatenates PSNMF after directional clustering / has been proposed.
This method consists of two techniques.
First, / directional clustering is applied to the input signal / to separate the target direction.
However, / directional clustering cannot separate the sources in the same direction.
So, / we added PSNMF after the directional clustering, and separate the target source.
(This method uses suitable decompositions / for each separation problem, i.e., this hybrid method is divide-and-conquer method.)
But / there is also a problem of the hybrid method.
The signal extracted by the hybrid method / suffers from the generation of considerable distortion / due to the binary masking in directional clustering.
So, / the separated cluster / has many spectral chasms.
In other words, the resolution of the spectrogram is degraded.
Next, // we talk about proposed method.
In proposed method, / we employed a new supervised NMF algorithm / as an alternative to the conventional PSNMF in the hybrid method.
This is an example of spectrum at one frame.
There are many spectral chasms.
And, this matrix is the index of separated cluster.
Indexes of zero indicate the grids of chasm in the spectrogram.
In proposed supervised NMF, / the spectral chasms are treated as unseen observations / using this index matrix, like this.
Therefore, / supervised NMF is applied to only the observed valid components / not unseen observations like these chasms.
(The directional clustering is hard clustering, binary masking. And the index matrix of directional clustering is obtained from the separated results.
So, we can know where is the chasms. The ones mean observations, and zeros mean unseen observations.)
In addition, / the spectrogram of the target sound is reconstructed / using more matched bases / in the proposed NMF.
The components of the target sound lost after directional clustering / can be extrapolated using supervised bases.
In other words, / the resolution of the target spectrogram / is recovered with the superresolution / by the supervised basis extrapolation.
(pointing (a)) This is a directional source distribution of observed stereo signal.
The target source is in the center direction, / and other sources are distributed like this.
Directional clustering is a binary masking in the time-frequency domain.
So, / the separated cluster is obtained like this.
Left and right source components / leak in the center cluster, // and center sources lose some of their components.
These lost components / correspond to the spectral chasms in the time-frequency domain.
Then, after the directional clustering,
we apply the superresolution-based NMF.
This NMF separates the target source / and reconstructs lost components with basis extrapolation using supervised bases.
However, / this basis extrapolation includes an underlying problem.
If the time-frequency spectra are almost unseen in the spectrogram, / a large extrapolation error may occur.
So, it is necessary to regularize / this extrapolation.
We propose two types of regularizations.
First one / uses temporal continuity / with a previous frame in the spectrogram.
And second one, / norm minimization is based on the assumption that // the frame, / which has many spectral chasms, / doesn’t have much of target components intrinsically.
Where I bar means the binary complement of the index.
So, / I bar represents the grid of chasms.
Therefore, intensity of these regularizations are proportional to the number of chasms in each frame.
The cost function in regularized superresolution-based NMF / is defined like this.
Where, / Rn is the regularization term, and n represents the type of regularization.
n equals one, / is the regularization of time continuity.
And, n equals two, / is the norm minimization.
In addition, this (pointing |FtH|^2) term is a penalty term / that forces F and H / to become uncorrelated with each other to avoid sharing the same basis.
The update rules that minimize the cost function are obtained like this.
Then, // we talk about experiments.
In the experiment, we compared 4 methods, / namely, conventional hybrid method using PSNMF, / proposed hybrid method using superresolution-based NMF without regularization, / and proposed hybrid method with two types of the regularizations.
And, we used stereo-panning and binaural-recorded signals / containing 4 instruments, namely, oboe, flute, trombone, and piano, / generated by MIDI.
These sources are mixed as the same power, / and the target source is always located in the center.
No.1 is the target source / and Nos.2,3,4 are the other sources.
In addition, / we used the same type of MIDI sounds of the target instruments / as the supervision sound / like this (pointing supervision score).
This supervision sound consists two octave notes that cover all notes of the target signal.
These results are average of evaluation scores / for the stereo-panning signal.
Where, / SDR indicates the quality of the separated target sound, / SIR indicates degree of separation / between the target and other sounds, / and SAR indicates absence of artificial distortion.
From these results, Proposed method 3, / superresolution-based NMF with norm minimization, / outperforms all other methods.
And, this is result for the binaural signal.
Similar to the results of panning signal, / Proposed method 3 was the highest scores.
SIR of the conventional method was high score, / but the quality of separated signal is not good because of the spectral chasms.
Also, Proposed method 1 has a risk / to cause the extrapolation error.
From SAR results, proposed regularizations can avoid such error, / and norm minimization is better for the hybrid method totally.
(This is because, / the norm minimization compresses residual components of the other sources. This phenomenon is a side-effect / of the regularization.)
This is conclusions of my talk.
Thank you for your attention.
(The directional clustering is hard clustering, binary masking.
And the index matrix of directional clustering is obtained from the separated results.
So, we can know where is the chasms. The ones mean observations, and zeros mean unseen observations.)
In addition, / the spectrogram of the target sound is reconstructed / using more matched bases / in the proposed NMF.
(pointing (a)) This is a directional source distribution of observed stereo signal.
The target source is in the center direction, / and other sources are distributed like this.
After directional clustering, / separated cluster loses some of their components.
And after superresolution-based NMF, the target components are restored using supervised bases.
In other words, / the resolution of the target spectrogram / is recovered with the superresolution / by the supervised basis extrapolation.
If the target sources increase in the same direction with target instruments, the separation performance of supervised NMF markedly degrades.
This is because, the several resemble bases arise in both of the target and other instruments.
If the left and right sources close to the center direction, the separation ↓ become difficult, because directional clustering cannot separate well.
In addition, bases extrapolation also become difficult because the number of chasms in the separated cluster / are increased in this case.
In contrast, if the theta become larger, the separation ↓ become easy.
This is a signal flow of the proposed hybrid method.
In our experiment, superresolution-based supervised NMF is applied to only the center direction because the target source is located in the center direction.
However, if the target source is located in the left or right side, we should apply this NMF to the direction that have the target source whether or not there is the other source in that direction.
SDR :quality of the separated target sound
SIR :degree of separation between the target and other sounds
SAR :absence of artificial distortion
SDR is the total evaluation score as the performance of separation.
And penalized supervised NMF, / PSNMF in short, / has been proposed by Yagi and others.
In PSNMF, // an observed matrix is decomposed like this.
Where F is a nonnegative matrix / that involves the target sound basis as column vectors.
G is an activation matrix / that corresponds to F, // and H and U are nonnegative matrices.
So, the target signal is extracted as F and G.
In addition, // to prevent the simultaneous formulation / of similar spectral patterns in the matrices F and H, // a specific penalty is imposed between F and H.
However, // PSNMF has a problem.
When the input signal includes many instrumental sources, // the extraction performance markedly degrades.
(because several resemble bases arise in both of the target and other instruments.)