Daichi Kitamura, Nobutaka Ono, and Hiroshi Saruwatari, "Experimental analysis of optimal window length for independent low-rank matrix analysis," Proceedings of The 2017 European Signal Processing Conference (EUSIPCO 2017), pp. 1210–1214, Kos, Greece, August 2017 (Invited Special Session).
Presented at 25th European Signal Processing Conference (EUSIPCO) 2017, "SS14: Multivariate Analysis for Audio Signal Source Enhancement," 14:30-16:10, August 30, 2017.
Experimental analysis of optimal window length for independent low-rank matrix analysis
1. Experimental analysis of optimal window length
for independent low-rank matrix analysis
Daichi Kitamura
Nobutaka Ono
Hiroshi Saruwatari
25th European Signal Processing Conference (EUSIPCO) 2017
SS14: Multivariate Analysis for Audio Signal Source Enhancement
August 30, 14:30-16:10
The University of Tokyo, Japan
National Institute of Informatics, Japan
The University of Tokyo, Japan
2. Contents
• Background
– Blind source separation (BSS) for audio signals
– Motivation: fundamental limitation in frequency-domain BSS
• Methods
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Experimental analysis
– Optimal window length
• Music signals and speech signals
• Ideal case and more practical case
• Conclusion
2
3. Contents
• Background
– Blind source separation (BSS) for audio signals
– Motivation: fundamental limitation in frequency-domain BSS
• Methods
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Experimental analysis
– Optimal window length
• Music signals and speech signals
• Ideal case and more practical case
• Conclusion
3
4. • Blind source separation (BSS) for audio signals
– separates original audio sources
– does not require prior information of recording conditions
• locations of mics and sources, room geometry, timbres, etc.
– can be available for many audio app.
• Consider only “determined” situation
Background
4
Recording mixture Separated guitar
BSS
Sources Observed Estimated
Mixing system Demixing system
# of mics
# of sources
5. • Basic theories and their evolution
History of BSS for audio signals
5
1994
1998
2013
1999
2012
Age
Many permutation
solvers for FDICA
Apply NMF to many tasks
Generative models in NMF
Many extensions of NMF
Independent component analysis (ICA)
Nonnegative matrix factorization (NMF)
Frequency-domain ICA (FDICA)
Itakura–Saito NMF (ISNMF)
Independent vector analysis (IVA)
Multichannel NMF
Independent low-rank matrix analysis (ILRMA)
*Depicting only popular methods
2016
2009
2006
2011 Auxiliary-function-based IVA (AuxIVA)
Time-varying Gaussian IVA
6. • Basic theories and their evolution
History of BSS for audio signals
6
1994
1998
2013
1999
2012
Age
Many permutation
solvers for FDICA
Apply NMF to many tasks
Generative models in NMF
Many extensions of NMF
Independent component analysis (ICA)
Nonnegative matrix factorization (NMF)
Frequency-domain ICA (FDICA)
Itakura–Saito NMF (ISNMF)
Independent vector analysis (IVA)
Multichannel NMF
Independent low-rank matrix analysis (ILRMA)
*Depicting only popular methods
2016
2009
2006
2011 Auxiliary-function-based IVA (AuxIVA)
Time-varying Gaussian IVA
7. Motivation: fundamental limitation of BSS
• Mixing assumption in frequency-domain BSS
– “Linear time-invariant mixture” or “rank-1 spatial model”
– Valid only when
• Too long window also causes another problem
– Number of time frames (samples) decreases
• Trade-off between short and long window [S. Araki+, 2003]
– FDICA suffers from the trade-off
– What about for BSS methods
with structural source model?
• IVA and ILRMA 7
: frequency binsObserved
multichannel signal
Source signalsFrequency-wise mixing matrix
: time frames
Statistical bias will increase and estimation becomes unstable
window length used in STFT length of room reverberation
Performance
Window length
Optimal length
8. Contents
• Background
– Blind source separation (BSS) for audio signals
– Motivation: fundamental limitation in frequency-domain BSS
• Methods
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Experimental analysis
– Optimal window length
• Music signals and speech signals
• Ideal case and more practical case
• Conclusion
8
9. • Frequency-domain ICA (FDICA) [P. Smaragdis, 1998]
• Independent vector analysis (IVA) [A. Hiroe, 2006], [T. Kim, 2006]
BSS methods: FDICA and IVA
9
Observed
Update separation filter so that the estimated
signals obey non-Gaussian distribution we assumed
Estimated
Demixing
matrix
Current
empirical dist.
Non-Gaussian
source dist.
STFT
Frequency
Time
Frequency
Time
Observed Estimated
Current
empirical dist.
STFT
Frequency
Time
Frequency
Time
Non-Gaussian
spherical
source dist.
Scalar r.v.s
Vector
(multivariate) r.v.s
Update separation filter so that the estimated
signals obey non-Gaussian distribution we assumed
Mixture is close to Gaussian
signal because of CLT
Source obeys non-
Gaussian dist.
Mutually
independent
Demixing
matrix Mutually
independent
10. • Spherical Laplace distribution in IVA
• Zero-mean complex Gaussian distribution with TF-
varying variance (Itakura-Saito NMF)[C. Févotte+, 2009]
10
Frequency-uniform scale
Extension of source distribution in IVA
Zero-mean complex
Gaussian in each TF bin Low-rank decomposition
with NMF
Spherical Laplace (bivariate)
Frequency vector
(I-dimensional)
Time-frequency-varying variance
Time-frequency matrix
(IJ-dimensional)
Extended to a more flexible model
11. • Power spectrogram corresponds to variances in TF
plane
Generative source model in ISNMF
11
Frequencybin
Time frame
: Power spectrogram
Small value of power
Large value of power
Complex Gaussian distribution with TF-varying variance
If we marginalize in terms of time or frequency, the distribution
becomes non-Gaussian even though each TF grid is defined in
Gaussian distribution
Grayscale shows the
value of variance
12. BSS methods: ILRMA
• Independent low-rank matrix analysis (ILRMA) [D. Kitamura+,2016]
– Unification of IVA and ISNMF
– Source model in ILRMA
12
Frequency
Basis
Basis
Time
Number of bases can be set to arbitrary value
Frequency
Time
Observed Estimated
Low-rank decomposition
Time
Frequency
Frequency
Time
Update demixing matrix so that estimated signals
have low-rank structure in time-frequency domain
STFT
Demixing
matrix
13. Comparison of source models
13
FDICA source model
Non-Gaussian scalar variable
IVA source model
Non-Gaussian vector variable
with higher-order correlation
ILRMA source model
Non-Gaussian matrix variable
with low-rank time-frequency
structure
Rank of TF matrix
of mixture
Rank of TF matrix
of each source
14. Contents
• Background
– Blind source separation (BSS) for audio signals
– Motivation: fundamental limitation in frequency-domain BSS
• Methods
– Frequency-domain independent component analysis (FDICA)
– Independent vector analysis (IVA)
– Independent low-rank matrix analysis (ILRMA)
• Experimental analysis
– Optimal window length
• Music signals and speech signals
• Ideal case and more practical case
• Conclusion
14
15. Experimental analysis
• Window length in STFT
– If window length is too short
• Mixing assumption does not hold anymore
– If window length is too long
• Estimation becomes unstable (# of time frames decreases)
15
Frequency
Time
…
DFT
DFT
DFT
Spectrogram
…
Window length (= DFT length)
Shift length
Window function
Waveform
• Our expectation
– Full time-frequency modeling of sources in ILRMA may improve the
robustness to a decrease in the number of time frames
16. Experimental analysis
• Dataset: 4 music and 4 speech from SiSEC [S. Araki+, 2012]
• Mixing: convolution with RIR in RWCP [S. Nakamura+, 2000]
16
Signal Data name Source (1/2) Length [s]
Music bearlin-roads acoustic_guit_main/vocals 14.6
Music another_dreamer-the_ones_we_love guitar/vocals 25.6
Music fort_minor-remember_the_name violins_synth/vocals 24.6
Music ultimate_nz_tour guitar/synth 18.6
Speech dev1_female4 src_1/src_2 10.0
Speech dev1_female4 src_3/src_4 10.0
Speech dev1_male4 src_1/src_2 10.0
Speech dev1_male4 src_3/src_4 10.0
2 m
Source 1
5.66cm
50 50
2 m
5.66cm
60 60
Impulse response E2A
(reverberation time: T60 = 300 ms)
Impulse response JR2
(reverberation time: T60 = 470 ms)
Source 2 Source 1 Source 2
17. Experimental analysis
• Compared methods
– FDICA+IPS (ideal permutation solver)
• Align permutation of estimated components using the reference
(oracle) source spectrogram (upper limit performance of FDICA)
– FDICA+DOA (DOA-based permutation solver) [S. Kurita+, 2000]
• Align permutation of estimated components using DOA after FDICA
– IVA [N. Ono, 2011]
• using auxiliary function method (a.k.a. MM algorithm) in optimization
– ILRMA [D. Kitamura+, 2016]
• with several numbers of bases
• Other conditions
– Window function: Hamming window
– Window length: 32 ~ 2048 ms
– Shift length: Always quarter of window length
17
18. Comparison using ideal initialization: condition
• Set initial value of demixing matrix to oracle:
– This initial value provides the best separation performance
under the assumption
• Set initial value of source model as oracle
(only for ILRMA):
18
Power spectrogram of th source
FDICA+DOA & IVA: spatial oracle initialization
FDICA+IPS & ILRMA: spatial and spectral oracle initialization
19. Comparison using ideal initialization: results
19
Music
T60 =0.30 s
Music
T60 =0.47 s
Speech
T60 =0.30 s
Speech
T60 =0.47 s
20. Comparison using random initialization: condition
• Set initial value of demixing matrix to identity
matrix
• Set initial value of source model to uniform
random value between [0,1] (only for ILRMA)
20
FDICA+DOA, IVA, & ILRMA: fully blind method
FDICA+IPS: using oracle spectrogram
21. Comparison using random initialization: results
21
Music
T60 =0.30 s
Music
T60 =0.47 s
Speech
T60 =0.30 s
Speech
T60 =0.47 s
22. Conclusion
• In the case of ILRMA with oracle initialization, the
robustness to long windows (fewer time frames) can
be improved
– optimal window length is longer than that in FDICA or IVA
– thanks to employing not only the independence between
sources but also a full modeling of time-frequency structure
for the estimation of the demixing matrix
• In a practical situation (fully blind case),
– optimal window length is similar to that in FDICA or IVA
– difficulty of the blind estimation of a precise spectral model
in ILRMA
22
Thank you for your attention!
Editor's Notes
This talk treats blind source separation problem, BSS, which is a separation technique of individual sources from the recorded mixture.
The word “blind” means that the method does not require any prior information about the recording conditions, such as locations of microphones, sources, and room geometry.
This kind of technique is very useful for many audio applications as a front-end system.
In this talk, we only consider a “determined” situation, namely, the number of microphones is always equal to the number of sources.
This slid shows a history of basic theories in audio BSS.
For acoustic signals, independent component analysis, ICA, was applied to the frequency domain signals as FDICA. After that, many permutation solvers for FDICA have been proposed, but eventually, an elegant solution, independent vector analysis, IVA was proposed. It is still extended to more flexible models.
On the other hand, nonnegative matrix factorization, NMF, is also developed and extended to a multichannel signals for source separation problems.
Recently, we have developed a new framework, which unifies these two powerful theories, called independent low-rank matrix analysis, ILRMA.
I will explain about the detail, but in this talk,
we only focus on only these algorithms, FDICA, IVA, and ILRMA.
I here explain the motivation of this talk.
In many frequency-domain BSS techniques, this equation, x=As, is always assumed, where x is a multichannel mixture signal in the frequency domain, i is a frequency bin and j is a time frames, A is a frequency-wise mixing matrix, and s is an original source.
This is often called “linear time-invariant mixture” or a “rank-1 spatial model,” and this assumption is valid only when the window length in STFT is much longer than the length of room reverberation. So, we must use a longer window in STFT for validating this mixing assumption.
However, if we use too much long window in STFT, the statistical bias will increase and the estimation becomes unstable. This is because the number of time frames J decreases.
Therefore, there is a trade-off between short and long window lengths like this figure. In the paper in 2003, this trade-off was revealed only for FDICA. But we don’t know about this issue for the BSS methods that employ s structural source model, I mean it’s an IVA or ILRMA.
So, in this talk, we experimentally confirm about this point, about the optimal window length for the new BSS techniques including ILRMA.
I briefly explain the separation mechanism in FDICA and IVA.
In FDICA, ICA is applied to each frequency bin considering the scalar time-series signal as a random variable, and we maximize its non-Gaussianity to estimate the frequency-wise demixing matrix.
In IVA, we consider a vector time-series random variable including all frequencies like this figure, then we assume a multivariate non-Gaussian distribution with a spherical property. Since spherical property ensures higher-order correlation among frequency bins, the permutation problem can be avoided in IVA.
The spherical source distribution in IVA can be extended to a more flexible model. We have extend it to a local Gaussian model, which employs a zero-mean complex Gaussian distribution with time-frequency-varying variance.
Namely, in each time-frequency slot, i and j, complex Gaussian distribution is defined, and its variance, r, can fluctuate depending on time and frequency.
This generative model is equivalent to that in Itakura-Saito NMF, and the variance r can be decomposed into a basis matrix T and an activation matrix V.
This is a graphical interpretation of the source model in ISNMF.
In each time-frequency slot, zero-mean complex Gaussian distribution is defined, and they are mutually independent in all time, frequency, and sources.
Now, the variance of these Gaussians is corresponding to the power spectrogram.
Therefore, in the slot that has a strong power, such as a spectral peak, the Gaussian becomes wider, and the large power component can easily be generated.
Note that, even though each slot is Gaussian, the marginal distribution in terms of time is non-Gaussian, because the variance fluctuates.
So, since this matrix generative model is non-Gaussian, we can use this distribution as a source model in ICA-based method
resulting in an independent low-rank matrix analysis (ILRMA). Therefore, ILRMA is a unified method of IVA and ISNMF, and we employed NMF source model to capture the low-rank time-frequency structures of each source.
This source model can improve the estimation accuracy of the demixing matrix.
This is a comparison of source models in FDICA, IVA, and ILRMA again.
The important idea used in ILRMA is that the rank of TF matrix of mixture signal is always grater than the rank of TF matrix of each source before mixing.
So, if we assume not only the independence between source but also a low-rank TF structure for each source, the separation will be done accurately.
As I already explained, the window length in STFT affects the performance of ICA-based separation.
If we use too short window, the mixing assumption, x=As, does not hold anymore, and if we use too long window, the estimation becomes unstable because the number of time frames J decreases.
However, ILRMA employs a full time-frequency modeling of sources. This model may improve the robustness to a decrease in J. This is our expectation.
Let’s check about this issue by the experiment.
Here we used 4 music and 4 speech signals obtained from SiSEC database, and we produced the observed signal by convoluting the impulse response shown in the bottom.
We used two types of impulse response, one has 300-ms-long reverberation, and the other one is 470 ms.
We compared 4 methods, FDICA + ideal permutation solver, FDICA + DOA-based permutation solver, IVA, and ILRMA.
In FDICA+IPS, we used the reference, oracle source spectrogram. So this is an upper limit of FDICA.
FDICA+DOA is a blind method that uses DOA clustering for solving the permutation problem.
Of cause IVA and ILRMA are also blind method.
Then, we used Hamming window with various window lengths.
First, we show the results with ideal initialization case. Namely, we first give a correct answer of demixing matrix as an initial value, which can be calculated using the oracle source s. So, the initial value provides the best separation performance here.
In addition, only for ILRMA, we set the initial value of NMF model T and V as the oracle values.
Therefore, FDICA+DOA and IVA are using the spatial oracle initialization, and FDICA+IPS and ILRMA are using spatial and spectral oracle initialization.
This is the result. The left ones are music, and right ones are the speech, and the reverberation time is short (top) and long (bottom).
The horizontal axis shows the window length, and the vertical axis shows the separation performance.
The colored lines are the results of ILRMA with various numbers of NMF bases.
In the music results, we can see that FDICA and IVA could not achieve the good separation when the window becomes long.
In ILRMA, the performance maintains even in a long long windows. This is obtained from the full modeling of time-frequency structure of each source.
However, for the speech signals, the performance of ILRMA becomes worse. We guess this is because speech does not have a low-rank time-frequency structures, and the source model could not capture the precise speech structures even if we set the source model as an oracle one.
Next, we show the results with fully blind situation. Initial W is set to identity matrix, and the initial source model is randomized.
Note that FDICA+IPS still uses the oracle spectrogram for solving the permutation.
This is the result. We could not obtain the same results as the previous case with ideal initialization.
The performance of all the methods is degraded when the window length becomes long.
Therefore, at least we can say that, ILRMA has a good potential to separate the sources even in a long window case, but in practice, the blind estimation of precise source model is a difficult problem.
This figure shows the difference of source models in IVA and ILRMA.
Since IVA assumes frequency-uniform scale, it is almost an NMF with only one flat basis.
On the other hand, ILRMA has more flexible source model with arbitrary number of spectral bases. So we can capture more precise TF structure of each source.