Apsipa2016for ss

H. Nakajima (UTokyo), D. Kitamura (SOKENDAI),
N. Takamune (UTokyo), S. Koyama (UTokyo), H. Saruwatari (UTokyo),
Y. Takahashi (Yamaha R&D), K. Kondo (Yamaha R&D)
Audio Signal Separation Using Supervised NMF with
Time-Variant All-Pole-Model-Based Basis Deformation
APSIPA2016 Organized Session on Advances in Acoustic Signal Processing

Nonnegative Matrix Factorization (NMF) [Lee, et al., 2001]
• Feature extraction based on low-rank representation
Amplitude
Amplitude
Observation
(spectrogram)
Basis matrix
(frequently appeared spectrum)
Activation matrix
(gain variation)
Time
𝑓 : frequency bin
𝑡 : time frame
k: # of bases
Time
Frequency
Frequency
𝑭 𝑮
𝑡
𝒀
𝑡
Extracted basis can be used for infromed source separation,
e.g., music demixing, speech enhancement, etc.

• Source separation using target-signal basis (supervision)
Supervised NMF (SNMF) [Smaragdis, et al., 2007]
Basis trained using
target-signal samples
Separation Estimate given supervised basis
Separated spectrogram
𝒀mix
Training

Objective of This Study
• Drawback of SNMF
→Accuracy decreases when variant trained basis is used.
We propose a new algorithm for deformation of
trained basis to make it fit to open data.
Training
Separation

SNMF with Additive Basis Deformation (SNMF-ABD)
[Kitamura, et al., 2013]
• Open-data adaptation by modifying supervised
basis 𝑭 with additive term 𝑫
Signal model:
Many orthogonal penalty parameters are needed but
uncontrollable.
Strong sensitivity to initial
value
𝒀mix ≈ 𝑭 + 𝑫 𝑮 + 𝑯𝑼
𝑭
𝑯 𝑫

SNMF with Time-Invariant Basis Deformation (TID)
[Nakajima, et al., EUSIPCO2016]
Training
Separation
Supervision
𝑭org
・Source separation and basis deformation are independently processed.
・Basis deformation is performed via target given by generalized MMSE-STSA estimator.
・Iterative basis deformation [Breithaupt, et al., 2008]

SNMF with Time-Invariant Basis Deformation (TID)
[Nakajima, et al., EUSIPCO2016]
Training
Separation
Generation of target
by generalized MMSE-STSA
estimator
Basis deformation
Supervision
𝑭org
Interference
𝒀mix − 𝑭𝑮
Estimated target 𝒀
Binary mask 𝑰
𝑭 ← 𝑨𝑭org
𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org 𝑮)
Hereafter we propose an improved algorithm introducing time variance.
Diagonal matrix with all-pole-
model-based deformation
・Source separation and basis deformation are independently processed.
・Basis deformation is performed via target given by generalized MMSE-STSA estimator.
・Iterative basis deformation
To extract convincing 𝒀
[Breithaupt, et al., 2008]

Proposed Discriminative Time-Variant Deformation
① Supervised basis is classified to 2 parts, capturing time-variant nature.
② Exceeding deformation is avoided by discriminative training.
Training
Separation
estimator
Basis deformation
Supervision
𝑭org
Interference
Binary mask 𝑰
𝑭 ← 𝑨𝑭org

Proposed Discriminative Time-Variant Deformation
Supervision
𝑭org
= [𝑭atk, 𝑭sus]
𝑭 ← [𝑨𝑭atk, 𝑩𝑭sus]
① Supervised basis is classified to 2 parts, capturing time-variant nature.
② Exceeding deformation is avoided by discriminative training.
Training
Separation
estimator
Interference
Binary mask 𝑰
Discriminative basis
deformation considering
interference
①
②

Proposed ①: Time Variance in Instruments
Basis deformation model should be changed in accordance
with difference in physical mechanism of articulation.
Ex: Piano articulation
String
Hammer
• Physical mechanism is different in Attack and Sustain in music
instruments. [N. H. Fletcher, 1991]
Initial state
Flip string
(transitional)
Free vibration

Proposed ①: Basis Classification
• Bases is classified in accordance with frequency of attack and
sustain generation.
• In each basis group, we apply difference deformation model.
≈ 𝑭org 𝑮atk ≈ 𝑭org 𝑮sus
Classify 𝑭org into 𝑭1 and 𝑭2 based on k-means method
Frequency of attack part for each basis Frequency of sustain part for each basis
Truncate sustain part in
training sample
Truncate attack part in
training sample Time
Time Time

Proposed ①: Deformation Model
𝒀 : Estimated target by generalized MMSE-STSA estimator
𝑰 : Binary mask for sampling convincing components
𝑭 𝟏 : Supervised basis trained using attack part only
𝑭 𝟐 : Supervised basis trained using sustain part only
𝑨 : Diagonal matrix with all−pole−model spectrum to deform 𝑭 𝟏
𝑩 : Diagonal matrix with all−pole−model spectrum to deform 𝑭 𝟐
𝑮 𝟏, 𝑮 𝟐 : Activation matrices corresponding to 𝑭 𝟏, 𝑭 𝟐
: Hadamard product
𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭1 𝑮 𝟏 + 𝑩𝑭2 𝑮 𝟐)
Deformation
parameters
• We prepare different deformation models for attack and sustain.

Proposed ①: Parameter Update
Cost function
based on KL div.
Parameter update
by auxiliary-
function method

Proposed ②：Discriminative Basis Deformation
• Large degree of freedom in A, B often allows to represent interference,
resulting in deterioration of separation accuracy.
• Discriminative deformation can mitigate such side effects.
Formulation as Bilevel Optimization
→ 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 is hard to represent interference component in 𝒀.
Owing to this cost, target and interference components are separately modeled.
Target component Interference
component
subject to
𝑮 𝟏, 𝑮 𝟐 = arg min
𝑮 𝟏,𝑮 𝟐,𝑯,𝑼
(𝑰 ∘ 𝒀mix|𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼))
𝑨, 𝑩 = arg min
𝑨,𝑩
(𝑰 ∘ 𝒀|𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐))
Fitness for
target Y only
Fitness for
mixture 𝒀mix
Unfortunately this problem is hard to be solved, so we propose
an approximated solver algorithm.

Proposed ②：Approximated Algorithm
• Step 1: Initialization (the same as conventional one)
min
𝑨,𝑮 𝟏,𝑩,𝑮 𝟐
𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 )
• Step 2: Modeling of mixture Ymix
min
𝑮 𝟏,𝑮 𝟐,𝑯,𝑼
𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼))
• Step 3: Modeling of target Y
min
𝑨,𝑩
𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐))
Fixing basis deformation matrix, we estimate activation.
Fixing activation matrix, we estimate deformation matrix.
We iteratively search set of deformation matrices that represent
target spectrogram in the vicinity of those that fit for mixture.

Proposed ②：Approximated Algorithm
• Step 1: Initialization (the same as conventional one)
min
𝑨,𝑮 𝟏,𝑩,𝑮 𝟐
𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 )
• Step 2: Modeling of mixture Ymix
min
𝑮 𝟏,𝑮 𝟐,𝑯,𝑼
𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼))
• Step 3: Modeling of target Y
min
𝑨,𝑩
𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐))
We iteratively search set of deformation matrices that represent
target spectrogram in the vicinity of those that fit for mixture.
Fixing basis deformation matrix, we estimate activation.
Fixing activation matrix, we estimate deformation matrix.

Experimental Evaluation: Condition
Instruments Oboe (Ob.), Piano (Pf.), Trombone (Tb.)
Training (MIDI) Garritan Professional Orchestra
Open target (MIDI) Microsoft GS Wavetable SW Synth
Sampling freq. 44100 Hz
FFT length 4096 points (100 ms)
Shift length 512 points (15 ms)
# of bases Target: 100, Interference: 30
Truncation period for
extraction of attack
50 ms
Comparison
Conventional methods: SNMF, SNMF-ABD, TID
Proposed method
Evaluation score
Signal-to-Distortion Ratio (SDR) [dB]
(for evaluating total quality of separated signal)
• Different MIDI generators were used for training and open data.
• Source separation for 2-sound mixture using supervised basis.

Music Score Used in Experiment
・Open data (mixture)
・Training samples
Oboe
Piano
Trombone
Oboe
Piano
Trombone
• 2 octave
chromatic scale
• Test song for NMF
research
[Kitamura, 2014]

Results 1: Example
Ex. Piano-sound extraction from mixture of oboe and piano
Better SDR rather
than conventional
methods

Results 2: Overall Evaluation
SNMF
[dB]
SNMF-
ABD [dB]
TID
[dB]
Proposed
[dB]
Ob. & Pf. 6.7 8.1 6.7 7.0
Ob. & Tb. 2.4 2.6 2.8 2.9
Pf. & Ob. 4.1 3.6 5.2 6.1
Pf. & Tb. 3.1 3.2 4.5 4.5
Tb. & Ob. 0.7 0.2 2.4 2.8
Tb. & Pf. 2.9 2.6 3.9 4.4
“A & B” means task for extraction of “A” from mixture of A and B.
SNMF-ABD: Basis deformation NMF in parallel with separation
TID: Time-invariant deformation NMF without considering interference

Results 2: Overall Evaluation
SNMF
[dB]
SNMF-
ABD [dB]
TID
[dB]
Proposed
[dB]
Ob. & Pf. 6.7 8.1 6.7 7.0
Ob. & Tb. 2.4 2.6 2.8 2.9
Pf. & Ob. 4.1 3.6 5.2 6.1
Pf. & Tb. 3.1 3.2 4.5 4.5
Tb. & Ob. 0.7 0.2 2.4 2.8
Tb. & Pf. 2.9 2.6 3.9 4.4
Proposed method outperforms
SNMF and TID in all combination.
In only one case, SNMF-ABD wins
but loses in the other cases.

Conclusion
• In this study, we propose a new advanced SNMF that
includes time-variant (attack & sustain) deformation of
the trained basis to make it fit the target sound.
• Also, to avoid the exceeding deformation, we propose
a discriminative basis deformation. In order to solve
the bilevel optimization problem, we introduce an
approximated algorithm.
• From the experimental results, it was confirmed that
the proposed method outperforms the conventional
methods in many cases.
Thank you for your attention!

Apsipa2016for ss

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Apsipa2016for ss

Similar to Apsipa2016for ss (20)

Recently uploaded

Recently uploaded (20)

Apsipa2016for ss