This document proposes a new method for linear multichannel blind source separation (BSS) based on time-frequency masks obtained from harmonic/percussive sound separation (HPSS). The proposed method applies HPSS independently to temporarily estimated sources to generate harmonic and percussive masks, then smooths the masks and uses them in time-frequency masking-based BSS. Experiments show the proposed method achieves higher source separation quality than single-channel HPSS and outperforms other multichannel BSS methods, demonstrating the effectiveness of integrating HPSS with multichannel BSS.
Linear multichannel blind source separation using harmonic/percussive separation time-frequency masks
1. Linear multichannel blind source separation
based on time-frequency mask obtained by
harmonic/percussive sound separation
Oyabu Soichiro1, Daichi Kitamura1, and Kohei Yatabe2
1National Institute of Technology, Kagawa College, Japan
2The University of Waseda, Japan
2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
2. Background
• Blind source separation (BSS)
– aims at separating audio sources such as speech, noise,
musical instruments, and so on
– can be used for many audio applications
– extracts individual sources without any prior information or
training (unsupervised technique)
2
Mixture signal Separated drum source
3. • Multichannel BSS
– estimates demixing system (inverse system of )
– High separation quality because of spatial information
• Independent vector analysis (IVA) [Kim+, 2007]
• Independent low-rank matrix (ILRMA) [Kitamura+, 2016]
• Time-frequency-masking-based BSS (TFMBSS) [Yatabe+, 2019]
• Single-channel BSS
– Monaural audio source separation problem
– More difficult than multichannel BSS
• Harmonic/percussive sound separation (HPSS) [Ono+, 2008]
Background
3
Mixing system Demixing system
4. Background
• HPSS [Ono+, 2008], [FitzGerald, 2010], [Duong+, 2011], [Tachibana+, 2012], etc.
– separates harmonic and percussive sound sources in a
fully blind manner
– can be used for music analysis and remixing
• Ex. estimation of chords, tempo, rhythm, notes, and genre
4
Mixture signal
Harmonic estimate
Percussive estimate
Aim: high-quality multichannel blind HPSS
5. Source Models in BSS
• In BSS, assumption of time-frequency structure in
each source (source model) is required
– IVA [Kim+, 2007]
• All the frequencies of each source
simultaneously have large power
• Linear spatial demixing
– ILRMA [Kitamura+, 2016]
• Power spectrogram of each source
have a low-rank time-frequency
structure
• Linear spatial demixing
– HPSS [Ono+, 2008], [Tachibana+, 2012]
• Harmonic: time-continuous structure
• Percussive: freq.-continuous structure
• Non-linear separation
5
Freq.
Freq.
Freq.
Time
Time
Time
Percussive src.
Harmonic src.
Source2
Source1
Source2
Source1
6. Conventional Method: TFMBSS [Yatabe+, 2019]
• Time-frequency-masking-based BSS (TFMBSS)
– Linear multichannel BSS with plug-and-play source models
– Source model is input as a time-frequency mask
6
Generate time-frequency
mask based on temporal
estimated sources
Masking process
: entrywise product
7. • Harmonic/percussive sound separation (HPSS)
– Separate sources by focusing on “smoothness” along with
time or frequency directions in spectrogram
– Estimate and by iteratively minimizing the cost function
Conventional Method: HPSS [Ono+, 2008]
7
Harmonic estimate
Mixture signal
Harmonic
components
Percussive
components
Time
Frequency
Percussive estimate
8. Proposed Algorithm: Process Flow
8
Inverse
STFT
STFT
Back
projection
Smoothing
Smoothing
HPSS
HPSS
Old masks
Initialize and with
Initialize and with
Back projection
Back projection
Observed signal Percussive estimate
Harmonic estimate
Masks
TFMBSS
Smoothed
masks
Ch 1
Ch 2
Temporal
estimates
9. • Two HPSS are independently
applied to each of temporarily
estimated signals and
• Two Wiener-like masks and , are
constructed using the results of HPSS
– These masks enhance the harmonic or percussive
components by eliminating the other components
Proposed Algorithm: Mask Calculation
9
HPSS
HPSS
10. Proposed Algorithm: Mask Smoothing
10
• In TFMBSS, drastic change of masks in each
iteration will cause instability of parameter
optimization
• Introduce mask smoothing process based on
weighted geometric mean
– Intensity of smoothing can be controlled by and
Mask calculated in
the previous iteration
Mask calculated in
the current iteration
Entrywise
product
11. Proposed Algorithm: Process Flow
11
Inverse
STFT
STFT
Back
projection
Smoothing
Smoothing
HPSS
HPSS
Old masks
Initialize and with
Initialize and with
Back projection
Back projection
Observed signal Percussive estimate
Harmonic estimate
Masks
TFMBSS
Smoothed
masks
Ch 1
Ch 2
Temporal
estimates
12. • Mixing condition of dry sources
Experiments: Conditions
12
Music Dataset
(dry sources)
SiSEC2016 MUS [Liutkus+, 2016]
“Drums” and “Other” sources of 20 songs
Windowing in STFT 128-ms-long Hann window with half-overlap shifting
Number of iterations
in TFMBSS
500
Subjective evaluation
score
Improvement of source-to-distortion ratio
(SDR) [Vincent+, 2006]
2 m
5.66cm
50 50
Impulse response E2A in RWCP database [Nakamura+, 2000]
(reverberation time: 300 ms)
Other source
(harmonic)
Drum source
(Percussive)
13. Experiment 1: Conditions
• Investigate the optimal number of iterations in HPSS
blocks
• Compare average SDR imp. of the proposed method
13
14. • Average SDR improvements of the proposed method
with various numbers of iterations in HPSS blocks
0
2
4
6
8
10
12
1 3 5 7 9 11 13 15 20
Average
SDR
improvements
[dB]
Number of iterations in HPSS blocks
Experiment 1: Results
14
15. Experiment 2: Conditions
• Investigate the optimal smoothing parameter
• Compare average SDR imp. of the proposed method
15
16. • Typical example of SDR behaviors in the proposed
method with various smoothing parameters
-2
0
2
4
6
8
10
12
14
16
0 100 200 300 400 500
SDR
improvements
[dB]
Numbers of iterations of proposed method
βold=0
βold=0.5
βold=0.75
βold=0.875
βold=0.9375
Experiment 2: Results
16
17. 0
2
4
6
8
10
12
0 0.5 0.75 0.875 0.9375
Average
SDR
improvements
[dB]
Smoothing parameter
Experiment 2: Results
17
• Average SDR improvements of the proposed method
with various smoothing parameters
21. Conclusion
• Novelty
– Integration of conventional HPSS with TFMBSS
– Wiener-filter-based iterative mask update
– Iterative mask smoothing for stabilizing optimization
• Results
– Mask smoothing process drastically stabilizes the TFMBSS
optimization and improves the separation performance
– Achieved high-quality linear multichannel HPSS
21
Thank you for your attention!
Editor's Notes
1
Blind source separation, / BSS in short, / aims at separating audio sources / such as speech, noise, musical instruments, and so on.
This technique(テクニーク)can be used for many audio applications.
BSS is an unsupervised method / that extracts individual sources without any prior(プライオァ)information or training (トゥレイニン).
When we use multiple microphones in a recording, / the BSS problem is called “multichannel BSS.”
This method estimates the separation system W, / which is an inverse system of the mixing system A, / as shown in this figure. (ポインタ指す)
Multichannel BSS provides better separation quality / because we can use / “spatial information” / for estimating the demixing system W.
Many algorithms have been proposed, / and the most succeeded methods are / independent vector analysis, / IVA, // independent low-rank matrix analysis, / ILRMA, // and time-frequency-masking-based BSS, / TFMBSS in short.
On the other hand, / when the observed mixture is obtained as a monaural(モノーゥラル)format, the single-channel BSS / must be applied / to achieve the separation.
This is more difficult problem compared with the multichannel BSS.
In this presentation, / we only focus on the method called / “harmonic/percussive sound separation,” / HPSS in short.
HPSS aims to separate / harmonic and percussive audio sources, / such as drums and vocals, / in a fully blind manner.
Such technique(テクニーク)is crucial for many applications including music analysis, / for example, / estimation of chords, tempo, rhythm, notes, and genre(ジャヌルァ).
Our research aim is / to propose a high-quality multichannel blind HPSS method, / which is useful for the applications we explained.
To achieve the source separation in a fully blind manner, / an assumption of time-frequency structure in each source is required, / where this assumption is called the “source model(マドー).”
For example, / in IVA, / we assume that / all the frequencies of each source / simultaneously have large power like this figure.(ポインタ指す)
In ILRMA, / power spectrogram of each source / have a low-rank time-frequency structure. (ポインタ指す)
Both IVA and ILRMA estimate the spatial demixing matrix, / which provides multichannel linear separation results.
In HPSS, / we assume two source models.(マドース)(ポインタ指す)
For the harmonic source, / a time-continuous structure is assumed, / whereas a frequency-continuous structure is assumed for the percussive source.
Since HPSS is a single-channel BSS, / the separation mechanism is non-linear, / and the artificial distortions are sometimes generated.
As a generalized multichannel linear BSS framework, / time-frequency-masking-based BSS, / TFMBSS in short, // has been proposed.
In TFMBSS, / the source models(マドース)can be easily replaced / as a plug-and-play manner, / where the source model(マドー)is input as a time-frequency mask.
This slide shows the optimization algorithm in TFMBSS.
For the detail explanation of this algorithm, / please see the reference paper.
In the fourth line, / we generate a time-frequency mask, / a source model,(マドー)/ based on the temporal estimated sources z.(ズィー)
Then, / we apply the time-frequency masking in the fifth line / to update the estimated sources.
Thus, / any kind of time-frequency mask can be used / as a source model(マドー)in this BSS framework, / and we can achieve the linear multichannel BSS / as well as IVA and ILRMA.
Next, / I introduce the detail of HPSS.
HPSS separates harmonic and percussive sounds / by focusing on the “smoothness” / along with the time / or frequency directions in the spectrogram.
This is because / the harmonic components become the horizontal stripe patterns(パタンズ), and the percussive components become the vertical stripe patterns(パタンズ).(それぞれポインタで指す)
HPSS optimizes the separated signal H and P / by iteratively minimizing this cost function(ポインタで指す), / which enhances the vertical smoothness of H / and the horizontal smoothness of P, respectively.
The smoothness assumption in HPSS / is reasonable for separating harmonic and percussive sounds.
However, / since this single-channel BSS is a non-linear process, / artificial distortions may be noticeable.
To achieve a high-quality linear multichannel HPSS, / we propose to integrate conventional HPSS / with the TFMBSS framework.
This is the process flow of the proposed linear multichannel HPSS.
In our method, TFMBSS(ポインタで指しながら)estimates the linear demixing system / using iterative optimization.
In each iteration of TFMBSS, / the time-frequency masks of harmonic and percussive sources are updated / by the processes shown in the above(アバブ,ポインタで指しながら).
In each iteration of TFMBSS, / temporal estimates of harmonic and percussive sources, / ZH(ズィーエイチ)and ZP(ズィーピー), / are respectively input to HPSS blocks(ポインタで指しながら).
Then we obtain H and P components for each of ZH and ZP(ポインタで指しながら).
From these components, / we calculate a new time-frequency masks, / MH and MP, / that enhance harmonic and percussive components in ZH and ZP, / respectively(ポインタで指しながら).
After that, / a smoothing process is applied with the old masks / calculated in the previous iteration.
Finally, / the smoothed masks are returned to TFMBSS(ポインタで指しながら).
This operation is iterated / until TFMBSS is converged.
The estimated harmonic and percussive source signals are obtained by inverse STFT.
In the next two slides, / we explain the details of “the mask calculation,” // here,(指しながら)// and “the smoothing process,” // here(指しながら).
In the proposed algorithm, / two HPSS are independently applied to each of temporarily estimated signals, / ZH and ZP.
From ZH, / we obtain H super H and P super H.(superとはsuperscriptの略で,日本語で言う「H上付きH」のような意味)
Also, / from ZP, / we obtain H super P and P super P.
Using these HPSS results, we construct two Wiener-like masks MH and MP / as shown in these equations.
These masks enhance the harmonic or percussive components / by eliminating the other components.
In TFMBSS, / since the optimization is performed by the primal-dual splitting algorithm, / the drastic change of the masks in each iteration / will cause instability of the parameter optimization.
To avoid this problem, / we introduce / “mask smoothing process” / based on the weighted geometric mean, / like this calculation(指しながら), / where M in the left-hand side is a mask of the current iteration, / Mold is the mask of the previous iteration, / and this operation(指しながら)is an entrywise product.
Beta and beta old are the parameters that control the intensity of smoothing, / and the sum of beta and beta old equals unity.
After the mask calculation and smoothing processes, / we input the new mask to the TFMBSS.
This mask update is iterated / until TFMBSS converges.
To evaluate our proposed method, / we conducted BSS experiment(イクスペリメント).
This slide shows the experimental conditions.
For the dry sources, we used SiSEC2016 MUS dataset.
In particular, / we used “drums” and “other” musical sources of 20 songs, / where “other” sources include various(ベァリアス)types of harmonic musical instruments, / such as guitar, synthesizer, and so on.
These dry sources are convoluted / and spatially mixed with the impulse responses recorded in this condition(指しながら), / and we produced / two-channel / and two-source observed signals.
As the evaluation score, / we used an improvement of source-to-distortion ratio, / SDR, / which shows both the degree of separation / and the sound quality of separated signals.
We conducted three experiments.
In the first experiment, / we investigated the optimal number of iterations in the HPSS blocks.
Since HPSS is also an iterative optimization algorithm, / we compare the average SDR improvements of the proposed method / with various settings of HPSS iteration.
This is the result of the first experiment.
The horizontal axis shows the number of iterations in the HPSS blocks, / and the vertical axis shows the average SDR improvements.
We can confirm that / the fewer iterations in the HPSS blocks is not preferable(指しながら), / and the performance is saturated for more than 9 iterations.(指しながら)
For the other experiments, / we set the number of iterations in HPSS / to 15.
In the second experiment, / we investigated the optimal smoothing parameter “beta old” / in the smoothing blocks.
This parameter controls the intensity of mask smoothing.
If we set “beta old” to zero, / no smoothing is applied.
Again, / we compare the SDR improvements of the proposed method.
This graph is an example of SDR behaviors of the proposed method, / namely, / the horizontal axis shows the global iteration of the proposed method.
The colors show the intensity of smoothing process.
The brightest color corresponds to / “beta old equals zero,” / no smoothing, // and the darkest color represents strong smoothing.
We can confirm that / the smoothing process can stabilize the behavior of the SDR improvement, / although the convergence speed slows down.
This behavior was common / for all the songs.
This figure shows the average SDR results of 20 songs with 500 global iterations.
The horizontal axis shows the value of “beta old.”
The condition “beta old equals 0.75(ズィロポイントセブンファイブ,指しながら)” / achieves the highest performance.
Thus, in the third experiment, / we used this value.
In the final experiment, / we compared the proposed method with five conventional BSS methods, namely, single-channel HPSS, multichannel HPSS, AuxIVA(オーグズアイヴィーエー), and ILRMA.
This is the average result of each method.
Please note that / only the single-channel HPSS is a non-linear method.
From this result, the proposed method greatly outperforms the other methods / on the average of 20 songs.
Finally, we demonstrate some examples of blind HPSS.
The observed signal is a mixture of drums and guitar.
混合→Single-channel HPSSのH→AuxIVAのH→ILRMAのH→ProposedのHの順で次々再生(セリフは不要)
続いてSingle-channel HPSSのP→AuxIVAのP→ILRMAのP→ProposedのPの順で次々再生(セリフは不要)
※時間が無ければILRMAはスキップしてもOK
再生が終わったら無言でまとめページにページ送り
This is a conclusion.
時間があれば説明する
That’s all. Thank you for your attention.