Stereophonic Music Separation Based on Non-negative Tensor Factorization with Cepstrum Regularization

Stereophonic Music Separation
Based on Non-negative Tensor Factorization
with Cepstrum Regularization
Shogo Seki, Tomoki Toda, Kazuya Takeda
(Nagoya University, Japan)
AASP-L4

Background
 Music signals in CDs or streaming media
‐ Composed of many source signals
(e.g. bass, drum set, vocals)
‐ Represented as a two-channel (stereophonic) signal
 Source separation for music signals
‐ Automatic music transcription [Smaragids+03]
‐ Source localization [Ohtani+16]
‐ Vocal extraction [Vembu&Baumann05, Ikemiya+15]
Stereophonic music signal separation
L R
1
EUSIPCO 2017, Aug. 30, 14:30-16:10, AASP-L4

What is a stereophonic music signal?
 Two-channel signal
→ Multichannel signal processing
 Contains many source signals
(# of channel signals) < (# of source signals)
→ Underdetermined Blind Source Separation (BSS) problem
 Usually manually synthesized (e.g. CD music)
‐ Individual source signals recorded separately
→ mixed with gain controls (i.e. panning)
‐ Pseudo spatial information:
 No valid phase information for the separation
→ Use only magnitude spectrum information
2

Research purpose
 Develop stereophonic music signal separation method
BSS methods Signal(s) Condition Spatial cues
IVA
[Kim06]
Multichannel Overdetermined Use
NMF
[Lee&Seung99]
Single Underdetermined None
MNMF
[Sawada+13]
Multichannel Underdetermined Use
ILRMA
[Kitamura+16]
Multichannel Overdetermined Use
Proposed Multichannel Underdetermined None
3

Modeling music generation process
1. Multi-channel signal:
Linear combinations of
sources with mixing gains
① Panning operation
2. Source signals:
Low-rank structures in
magnitude spectral domain
② NMF decomposition
Breaking spectrogram into:
- Spectral patterns
- Time-varying gains
SourcesGain
z
Basis ActivationGain
① Panning operation
② NMF decomposition
Magnitude
spectrograms
Multi-ch. signal
NTF framework
(Tensor decomposition)
4

Utilizing source information
 Data accessibility
‐ Hard to prepare actual source information of target signals
‐ Possible to utilize similar source information
 Supervised separation framework [Smaragdis+07]
‐ Learn basis spectra from training data
‐ Use them as;
‐ Fixed value (Supervised)
‐ Initial value (Lightly supervised)
SourcesGain
z
Basis ActivationGainMulti-ch. signal
5
Training
data
Fix or initialize

Regularization for source timbre
 Cepstral Distance Regularization [Li+16]
‐ Used in Semi-supervised speech enhancements
‐ Jointly enhance both spectrogram & features (MFCCs)
→ Constrains spectral envelopes (timbre information) of the sources
 Summary of the proposed method
GMM
modeling
MFCC
extraction
Training
data
6
SourcesGain
z
Fix or initializePrior information

 Objective function (to be minimized)
‐ : KL-divergence b/w observation & estimate
‐ : Regularization parameter
‐ : Cepstral Distance Regularization term
(Negative log-likelihood of GMMs for MFCCs of sources)
→ Can be optimized by auxiliary function method
Formulation
7
SourcesGain
z
Extracted MFCC sequences
× log
Mel-filterbank
IDCT

Experimental evaluation
 Investigation
‐ Effect of regularization
‐ Effect of supervised & lightly-supervised separation performance
• Updating: lightly-supervised (SS)
• Fixing: Supervised (S)
 Performance measurements (Larger is better)
‐ SDR (Signal-to-distortion ratio): sound quality
‐ SIR (Signal-to-Interfere ratio): suppression of non-target
‐ SAR (Signal-to-Artificial ratio): distortion through the process
 3 songs (of 1 artist) in Cambridge Music Technology
‐ 2 songs : training (Dictionary/GMM for regularization)
‐ 1 song : evaluation (30 - 45 s) & development (GMM parameter)
8

 Mixing setting
# of sources 3
Mixing gains (L：R)
(Following figure)
2:1 (Ba)
1:2 (Dr)
1:1 (Vo)
Sampling frequency 16 kHz
Frame size 32 ms
Shift size 16 ms
# of basis vectors/source 50
# of iterations (parameter updating) 400
# of mel-filter banks 64
Experimental conditions
Left Center Right
VoBa Dr
9

Stronger regularization
Results
Betterperformance
w/o regularization
 Comparison w/ or w/o regularization
‐ Better performance w/ regularization
→ Effective constraint on timbre
 Comparison b/w semi-/supervised
‐ Large improvement in semi-supervised
w/ regularization
→ Effective mismatch compensation
 Effect of hyperparameter setting
‐ Optimum setting shared for any sources
‐ Need to be optimized manually
→ Hyperparameter setting to be tuned
10

Conclusion
 Proposed stereophonic music signal separation method
‐ NTF-based decomposition
‐ Panning operation for observed multi-channel signal
‐ Low-rankness for the source spectrograms
‐ Supervised separation framework
‐ Regularization on timbre of individual sources by CDR
 Demonstrated effectiveness in supervised framework
‐ SS w/o reg. < S w/o reg. < S w/ reg. < SS w/ reg.
‐ Better separation performance w/ the regularization
 Future works
‐ Hyperparameter setting investigation
‐ Investigation for other various music sources
Thank you for the listening!
11

Stereophonic Music Separation Based on Non-negative Tensor Factorization with Cepstrum Regularization

Recommended

Recommended

More Related Content

Similar to Stereophonic Music Separation Based on Non-negative Tensor Factorization with Cepstrum Regularization

Similar to Stereophonic Music Separation Based on Non-negative Tensor Factorization with Cepstrum Regularization (8)

Recently uploaded

Recently uploaded (20)

Stereophonic Music Separation Based on Non-negative Tensor Factorization with Cepstrum Regularization