analog-vs-digital-communication (concept of analog and digital).pptx
Stereophonic Music Separation Based on Non-negative Tensor Factorization with Cepstrum Regularization
1. Stereophonic Music Separation
Based on Non-negative Tensor Factorization
with Cepstrum Regularization
Shogo Seki, Tomoki Toda, Kazuya Takeda
(Nagoya University, Japan)
AASP-L4
2. Background
Music signals in CDs or streaming media
‐ Composed of many source signals
(e.g. bass, drum set, vocals)
‐ Represented as a two-channel (stereophonic) signal
Source separation for music signals
‐ Automatic music transcription [Smaragids+03]
‐ Source localization [Ohtani+16]
‐ Vocal extraction [Vembu&Baumann05, Ikemiya+15]
Stereophonic music signal separation
L R
1
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
3. What is a stereophonic music signal?
Two-channel signal
→ Multichannel signal processing
Contains many source signals
(# of channel signals) < (# of source signals)
→ Underdetermined Blind Source Separation (BSS) problem
Usually manually synthesized (e.g. CD music)
‐ Individual source signals recorded separately
→ mixed with gain controls (i.e. panning)
‐ Pseudo spatial information:
No valid phase information for the separation
→ Use only magnitude spectrum information
2
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
4. Research purpose
Develop stereophonic music signal separation method
BSS methods Signal(s) Condition Spatial cues
IVA
[Kim06]
Multichannel Overdetermined Use
NMF
[Lee&Seung99]
Single Underdetermined None
MNMF
[Sawada+13]
Multichannel Underdetermined Use
ILRMA
[Kitamura+16]
Multichannel Overdetermined Use
Proposed Multichannel Underdetermined None
3
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
5. Modeling music generation process
1. Multi-channel signal:
Linear combinations of
sources with mixing gains
① Panning operation
2. Source signals:
Low-rank structures in
magnitude spectral domain
② NMF decomposition
Breaking spectrogram into:
- Spectral patterns
- Time-varying gains
SourcesGain
z
Basis ActivationGain
① Panning operation
② NMF decomposition
Magnitude
spectrograms
Multi-ch. signal
NTF framework
(Tensor decomposition)
4
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
6. Utilizing source information
Data accessibility
‐ Hard to prepare actual source information of target signals
‐ Possible to utilize similar source information
Supervised separation framework [Smaragdis+07]
‐ Learn basis spectra from training data
‐ Use them as;
‐ Fixed value (Supervised)
‐ Initial value (Lightly supervised)
SourcesGain
z
Basis ActivationGainMulti-ch. signal
5
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
Training
data
Fix or initialize
7. Regularization for source timbre
Cepstral Distance Regularization [Li+16]
‐ Used in Semi-supervised speech enhancements
‐ Jointly enhance both spectrogram & features (MFCCs)
→ Constrains spectral envelopes (timbre information) of the sources
Summary of the proposed method
GMM
modeling
MFCC
extraction
Training
data
6
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
SourcesGain
z
Basis ActivationGainMulti-ch. signal
Fix or initializePrior information
8. Objective function (to be minimized)
‐ : KL-divergence b/w observation & estimate
‐ : Regularization parameter
‐ : Cepstral Distance Regularization term
(Negative log-likelihood of GMMs for MFCCs of sources)
→ Can be optimized by auxiliary function method
Formulation
7
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
SourcesGain
z
Basis ActivationGainMulti-ch. signal
Extracted MFCC sequences
× log
Mel-filterbank
IDCT
9. Experimental evaluation
Investigation
‐ Effect of regularization
‐ Effect of supervised & lightly-supervised separation performance
• Updating: lightly-supervised (SS)
• Fixing: Supervised (S)
Performance measurements (Larger is better)
‐ SDR (Signal-to-distortion ratio): sound quality
‐ SIR (Signal-to-Interfere ratio): suppression of non-target
‐ SAR (Signal-to-Artificial ratio): distortion through the process
3 songs (of 1 artist) in Cambridge Music Technology
‐ 2 songs : training (Dictionary/GMM for regularization)
‐ 1 song : evaluation (30 - 45 s) & development (GMM parameter)
8
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
10. Mixing setting
# of sources 3
Mixing gains (L:R)
(Following figure)
2:1 (Ba)
1:2 (Dr)
1:1 (Vo)
Sampling frequency 16 kHz
Frame size 32 ms
Shift size 16 ms
# of basis vectors/source 50
# of iterations (parameter updating) 400
# of mel-filter banks 64
Experimental conditions
Left Center Right
VoBa Dr
9
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
11. Stronger regularization
Results
Betterperformance
w/o regularization
Comparison w/ or w/o regularization
‐ Better performance w/ regularization
→ Effective constraint on timbre
Comparison b/w semi-/supervised
‐ Large improvement in semi-supervised
w/ regularization
→ Effective mismatch compensation
Effect of hyperparameter setting
‐ Optimum setting shared for any sources
‐ Need to be optimized manually
→ Hyperparameter setting to be tuned
10
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4
12. Conclusion
Proposed stereophonic music signal separation method
‐ NTF-based decomposition
‐ Panning operation for observed multi-channel signal
‐ Low-rankness for the source spectrograms
‐ Supervised separation framework
‐ Regularization on timbre of individual sources by CDR
Demonstrated effectiveness in supervised framework
‐ SS w/o reg. < S w/o reg. < S w/ reg. < SS w/ reg.
‐ Better separation performance w/ the regularization
Future works
‐ Hyperparameter setting investigation
‐ Investigation for other various music sources
Thank you for the listening!
11
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4