Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apsipa2016for ss

5,810 views

Published on

Invited Talk in APSIPA2016 "Advances in Acoustic Signal Processing"

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Apsipa2016for ss

  1. 1. H. Nakajima (UTokyo), D. Kitamura (SOKENDAI), N. Takamune (UTokyo), S. Koyama (UTokyo), H. Saruwatari (UTokyo), Y. Takahashi (Yamaha R&D), K. Kondo (Yamaha R&D) Audio Signal Separation Using Supervised NMF with Time-Variant All-Pole-Model-Based Basis Deformation APSIPA2016 Organized Session on Advances in Acoustic Signal Processing
  2. 2. Nonnegative Matrix Factorization (NMF) [Lee, et al., 2001] • Feature extraction based on low-rank representation Amplitude Amplitude Observation (spectrogram) Basis matrix (frequently appeared spectrum) Activation matrix (gain variation) Time 𝑓 : frequency bin 𝑡 : time frame k: # of bases Time Frequency Frequency 𝑭 𝑮 𝑡 𝒀 𝑡 Extracted basis can be used for infromed source separation, e.g., music demixing, speech enhancement, etc.
  3. 3. • Source separation using target-signal basis (supervision) Supervised NMF (SNMF) [Smaragdis, et al., 2007] Basis trained using target-signal samples Separation Estimate given supervised basis Separated spectrogram 𝒀mix Training
  4. 4. Objective of This Study • Drawback of SNMF →Accuracy decreases when variant trained basis is used. We propose a new algorithm for deformation of trained basis to make it fit to open data. Training Separation
  5. 5. SNMF with Additive Basis Deformation (SNMF-ABD) [Kitamura, et al., 2013] • Open-data adaptation by modifying supervised basis 𝑭 with additive term 𝑫 Signal model: Many orthogonal penalty parameters are needed but uncontrollable. Strong sensitivity to initial value 𝒀mix ≈ 𝑭 + 𝑫 𝑮 + 𝑯𝑼 𝑭 𝑯 𝑫
  6. 6. SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016] Training Separation Supervision 𝑭org ・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation [Breithaupt, et al., 2008]
  7. 7. SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016] Training Separation Generation of target by generalized MMSE-STSA estimator Basis deformation Supervision 𝑭org Interference 𝒀mix − 𝑭𝑮 Estimated target 𝒀 Binary mask 𝑰 𝑭 ← 𝑨𝑭org 𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org 𝑮) Hereafter we propose an improved algorithm introducing time variance. Diagonal matrix with all-pole- model-based deformation ・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation To extract convincing 𝒀 [Breithaupt, et al., 2008]
  8. 8. Proposed Discriminative Time-Variant Deformation ① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training. Training Separation Generation of target by generalized MMSE-STSA estimator Basis deformation Supervision 𝑭org Interference 𝒀mix − 𝑭𝑮 Estimated target 𝒀 Binary mask 𝑰 𝑭 ← 𝑨𝑭org 𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org 𝑮)
  9. 9. Proposed Discriminative Time-Variant Deformation Supervision 𝑭org = [𝑭atk, 𝑭sus] 𝑭 ← [𝑨𝑭atk, 𝑩𝑭sus] ① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training. Training Separation Generation of target by generalized MMSE-STSA estimator Interference 𝒀mix − 𝑭𝑮 Estimated target 𝒀 Binary mask 𝑰 𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org 𝑮) Discriminative basis deformation considering interference ① ②
  10. 10. Proposed ①: Time Variance in Instruments Basis deformation model should be changed in accordance with difference in physical mechanism of articulation. Ex: Piano articulation String Hammer • Physical mechanism is different in Attack and Sustain in music instruments. [N. H. Fletcher, 1991] Initial state Flip string (transitional) Free vibration
  11. 11. Proposed ①: Basis Classification • Bases is classified in accordance with frequency of attack and sustain generation. • In each basis group, we apply difference deformation model. ≈ 𝑭org 𝑮atk ≈ 𝑭org 𝑮sus Classify 𝑭org into 𝑭1 and 𝑭2 based on k-means method Frequency of attack part for each basis Frequency of sustain part for each basis Truncate sustain part in training sample Truncate attack part in training sample Time Time Time
  12. 12. Proposed ①: Deformation Model 𝒀 : Estimated target by generalized MMSE-STSA estimator 𝑰 : Binary mask for sampling convincing components 𝑭 𝟏 : Supervised basis trained using attack part only 𝑭 𝟐 : Supervised basis trained using sustain part only 𝑨 : Diagonal matrix with all−pole−model spectrum to deform 𝑭 𝟏 𝑩 : Diagonal matrix with all−pole−model spectrum to deform 𝑭 𝟐 𝑮 𝟏, 𝑮 𝟐 : Activation matrices corresponding to 𝑭 𝟏, 𝑭 𝟐 : Hadamard product 𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭1 𝑮 𝟏 + 𝑩𝑭2 𝑮 𝟐) Deformation parameters • We prepare different deformation models for attack and sustain.
  13. 13. Proposed ①: Parameter Update Cost function based on KL div. Parameter update by auxiliary- function method
  14. 14. Proposed ②:Discriminative Basis Deformation • Large degree of freedom in A, B often allows to represent interference, resulting in deterioration of separation accuracy. • Discriminative deformation can mitigate such side effects. Formulation as Bilevel Optimization → 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 is hard to represent interference component in 𝒀. Owing to this cost, target and interference components are separately modeled. Target component Interference component subject to 𝑮 𝟏, 𝑮 𝟐 = arg min 𝑮 𝟏,𝑮 𝟐,𝑯,𝑼 (𝑰 ∘ 𝒀mix|𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼)) 𝑨, 𝑩 = arg min 𝑨,𝑩 (𝑰 ∘ 𝒀|𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐)) Fitness for target Y only Fitness for mixture 𝒀mix Unfortunately this problem is hard to be solved, so we propose an approximated solver algorithm.
  15. 15. Proposed ②:Approximated Algorithm • Step 1: Initialization (the same as conventional one) min 𝑨,𝑮 𝟏,𝑩,𝑮 𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 ) • Step 2: Modeling of mixture Ymix min 𝑮 𝟏,𝑮 𝟐,𝑯,𝑼 𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼)) • Step 3: Modeling of target Y min 𝑨,𝑩 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐)) Fixing basis deformation matrix, we estimate activation. Fixing activation matrix, we estimate deformation matrix. We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.
  16. 16. Proposed ②:Approximated Algorithm • Step 1: Initialization (the same as conventional one) min 𝑨,𝑮 𝟏,𝑩,𝑮 𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 ) • Step 2: Modeling of mixture Ymix min 𝑮 𝟏,𝑮 𝟐,𝑯,𝑼 𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼)) • Step 3: Modeling of target Y min 𝑨,𝑩 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐)) Fixing basis deformation matrix, we estimate activation. Fixing activation matrix, we estimate deformation matrix. We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.
  17. 17. Proposed ②:Approximated Algorithm • Step 1: Initialization (the same as conventional one) min 𝑨,𝑮 𝟏,𝑩,𝑮 𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 ) • Step 2: Modeling of mixture Ymix min 𝑮 𝟏,𝑮 𝟐,𝑯,𝑼 𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼)) • Step 3: Modeling of target Y min 𝑨,𝑩 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐)) We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture. Fixing basis deformation matrix, we estimate activation. Fixing activation matrix, we estimate deformation matrix.
  18. 18. Proposed ②:Approximated Algorithm • Step 1: Initialization (the same as conventional one) min 𝑨,𝑮 𝟏,𝑩,𝑮 𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 ) • Step 2: Modeling of mixture Ymix min 𝑮 𝟏,𝑮 𝟐,𝑯,𝑼 𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐 + 𝑯𝑼)) • Step 3: Modeling of target Y min 𝑨,𝑩 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭 𝟏 𝑮 𝟏 + 𝑩𝑭 𝟐 𝑮 𝟐)) Fixing basis deformation matrix, we estimate activation. Fixing activation matrix, we estimate deformation matrix. We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.
  19. 19. Experimental Evaluation: Condition Instruments Oboe (Ob.), Piano (Pf.), Trombone (Tb.) Training (MIDI) Garritan Professional Orchestra Open target (MIDI) Microsoft GS Wavetable SW Synth Sampling freq. 44100 Hz FFT length 4096 points (100 ms) Shift length 512 points (15 ms) # of bases Target: 100, Interference: 30 Truncation period for extraction of attack 50 ms Comparison Conventional methods: SNMF, SNMF-ABD, TID Proposed method Evaluation score Signal-to-Distortion Ratio (SDR) [dB] (for evaluating total quality of separated signal) • Different MIDI generators were used for training and open data. • Source separation for 2-sound mixture using supervised basis.
  20. 20. Music Score Used in Experiment ・Open data (mixture) ・Training samples Oboe Piano Trombone Oboe Piano Trombone • 2 octave chromatic scale • Test song for NMF research [Kitamura, 2014]
  21. 21. Results 1: Example Ex. Piano-sound extraction from mixture of oboe and piano Better SDR rather than conventional methods
  22. 22. Results 2: Overall Evaluation SNMF [dB] SNMF- ABD [dB] TID [dB] Proposed [dB] Ob. & Pf. 6.7 8.1 6.7 7.0 Ob. & Tb. 2.4 2.6 2.8 2.9 Pf. & Ob. 4.1 3.6 5.2 6.1 Pf. & Tb. 3.1 3.2 4.5 4.5 Tb. & Ob. 0.7 0.2 2.4 2.8 Tb. & Pf. 2.9 2.6 3.9 4.4 “A & B” means task for extraction of “A” from mixture of A and B. SNMF-ABD: Basis deformation NMF in parallel with separation TID: Time-invariant deformation NMF without considering interference
  23. 23. Results 2: Overall Evaluation SNMF [dB] SNMF- ABD [dB] TID [dB] Proposed [dB] Ob. & Pf. 6.7 8.1 6.7 7.0 Ob. & Tb. 2.4 2.6 2.8 2.9 Pf. & Ob. 4.1 3.6 5.2 6.1 Pf. & Tb. 3.1 3.2 4.5 4.5 Tb. & Ob. 0.7 0.2 2.4 2.8 Tb. & Pf. 2.9 2.6 3.9 4.4 Proposed method outperforms SNMF and TID in all combination. In only one case, SNMF-ABD wins but loses in the other cases.
  24. 24. Conclusion • In this study, we propose a new advanced SNMF that includes time-variant (attack & sustain) deformation of the trained basis to make it fit the target sound. • Also, to avoid the exceeding deformation, we propose a discriminative basis deformation. In order to solve the bilevel optimization problem, we introduce an approximated algorithm. • From the experimental results, it was confirmed that the proposed method outperforms the conventional methods in many cases. Thank you for your attention!

×