SlideShare a Scribd company logo
1 of 18
Missing Component Restoration
For Masked Speech Signals
Based On Time-Domain Spectrogram Factorization
Shogo Seki†
Hirokazu Kameoka‡
Tomoki Toda†
Kazuya Takeda†
(†Nagoya University, Japan)
(‡NTT Communication Science Laboratories, Japan)
MLSP 2017
Sep. 28, 14:00-16:00
Lecture Session 6: Special session on new extensions and
applications of non-negative audio modeling
l Time-Frequency masking (TF-masking)
- Estimates “mask” passing the TF-slots target speech dominate
- Aggressively suppresses noise (J powerful in terms of SNR)
⇔
- Results in overly sparse spectrograms
- Can severely damage acoustical features L
- Degrades following system using acoustical features
e.g. speech recognition, voice conversion, etc.
Missing component restoration for masked speech
Background
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
1
Masked speech c.f. Clean speech
What is missing component restoration?
l Observation (Input)
Complex spectrogram with missing components
l Assumption
Missing components take zero values
- : Frequency index
- : Time index
- : Set of unmasked (observable) components
l Estimate (Output)
Waveform signal associated with restored complex spectrogram
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
2
Frequency
Frame
Missing
Restoration
Restored
STFT
Conventional method
l NMF-based spectrogram restoration [Smaragdis+10]
Algorithm
1. Restore magnitude spectrogram of observation by NMF
2. Reconstruct phase information [Griffin&Lim84]
3. Obtain waveform signal by inverse transformation (e.g. ISTFT)
- NMF approximates spectrogram as low-rank matrix
L Uses too few acoustical clues to archive accurate restoration
L No guarantee to enhance speech in feature domain
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
3
Spectral templates
(Basis)
Activations
Magnitude spectrogram
Proposed ideas
l How to restore missing components and improve features?
Focus on following 3 clues
1. Low-rank structure assumption on magnitude spectrogram ← NMF
+
2. “Redundancy” of spectrograms
3. Distribution of clean speech feature
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
4
Frame
Frequency
1.
What is redundancy of spectrogram?
l Relationship between waveform signal and spectrogram
- Described by matrix calculation
- (# of TF-slots) > (# of time-domain samples)
→ Spectrogram is redundant representation of waveform signal
Elements of complex spectrogram are interdependent on each other
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
5
reshape
⋯
Complex
spectrogram
⋮
⋮
Frame 1
Frame 2
Entire
signal
Frame 1
Frame 2
⋮
⋮
⋮
Set of TF basis functions
Frame
overlapping
l TSF: Time-domain Spectrogram Factorization [Kameoka15]
- Originally proposed as source separation method
- Realizes NMF-like signal decomposition in time domain
- Directly decomposes time-domain signal by taking account of
• low-rank assumptions on magnitude spectrogram
• Redundancy of complex spectrogram
Related works (1/2)
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
6
Time domain
Magnitude spectrogram domain
Observation
|STFT| |STFT| |STFT|
Hirokazu Kameoka, “Multi-resolution signal decomposition with time-domain spectrogram factorization,” in
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015,
pp. 86–90.
Proposed ideas
l How to restore missing components and improve features?
Focus on following 3 clues
1. Low-rank structure assumption on magnitude spectrogram ← NMF
+
2. “Redundancy” of spectrograms ← TSF
Interdependency of elements of complex spectrogram
3. Distribution of clean speech feature
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
7
Frame
Frequency
1.
Frequency
2.
Frame
l Mapping from spectrogram to feature space
- Masked spectra can deviate far away from distribution of clean features
l How to enhance features
1. Build GMM for clean features
2. Use GMM log-likelihood as regularizer to enforce restored
spectrogram to follow clean speech distribution in feature domain
GMM
Distribution of clean speech feature
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
8
Masked speech Clean speech
Feature space
l CDR: Cepstral Distance Regularization [Li+16]
- Regularization in NMF-based semi-supervised speech enhancement
- Enhances both spectral components and cepstral features
Objective function (to be minimized)
• Negative log-likelihood of GMMs for features of reconstructed components
• GMM is trained by speech of target speaker in advance
CDR
Related works (2/2)
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
9
GMM
Features (e.g. MFCCs)
Li Li, Hirokazu Kameoka, Takuya Higuchi, and Hiroshi Saruwatari, “Semi-supervised joint enhancement of
spectral and cepstral sequences of noisy speech,” Interspeech 2016, pp. 3753–3757, 2016.
Proposed ideas
l How to restore missing components and improve features?
Focus on following 3 clues
1. Low-rank structure assumption on magnitude spectrogram ← NMF
+
2. “Redundancy” of spectrograms ← TSF
Interdependency of elements of complex spectrogram
3. Distribution of clean speech feature ← CDR
Prior information for target speech in feature space
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
10
Frame
Frequency
1.
Frequency
Frame
3.
Feature
Frequency
2.
Frame
Formulation overview
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
11
Time domain
Magnitude
spectrogram
domain
Feature
domain
Complex
spectrogram
domain
Estimate
Observation
|・|
STFT
ISTFT
|・|
①
③
②
④
Formulation
l Objective function (to be minimized)
- : parameter
- : weights
- : error function (KL-divergence)
- Iteratively optimized by Majorization-Minimization principle
- Effect depends on representation power of GMM 12
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
Global structures
in magnitude spectrogram
CDR
Term relating
and
①
②
③
④
Interdependencies
in complex spectrogram
Experimental evaluation
l Restoration performance for masked spectrogram
l Evaluation dataset building
1. Prepare clean speech and noise data
2. Synthesize noisy speech
3. Build Ideal Binary Masks (IBMs)
+ random binary mask (missing rate; 0 – 90 %)
4. Make masked spectrograms ← restoration target
l Data
- Clean speech: 200 utterances of 20 speakers (females and males)
- Noise data: babble noise
- (Extra 50 utterances/speaker for GMM training of CDR)
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
13
Experimental condition
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
14
Sampling frequency 16 kHz
Frame size 32 ms
Shift size 16 ms
# of basis 30
# of iterations 200
CDR
setting
# of filter bank dimensions 20
# of MFCC dimensions 0-13
# of GMM mixtures 30
# of initial setting 3
Experimental settings
l Comparison (4 patterns)
- NMF-based method (NMF)
- TSF-based method w/ or w/o CDR (TSF & TSF w/ Reg.)
- Direct inversion w/o any restoration (Unprocessed)
l Performance measurement
- MFCC distance between clean speech and restored speech
(Lower is better)
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
15
Experimental results
16
0 15 30 45 60 75 90
Missing rate [%]
0
2
4
6
8
10
12
MFCCdistance[dB]
Unprocessed
NMF
TSF
TSF w/ Reg.
Better
More suppressed
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
Conclusion
l Proposed TSF-based restoration
Focusing on following 3 clues
1. Low-rank structure assumption on magnitude spectrogram ← NMF
2. “Redundancy” of spectrograms ← TSF
3. Distribution of clean speech feature ← CDR
l Demonstrated effectiveness in experimental evaluation
- TSF-based restorations > NMF-based restoration
l Future works
- Restoration performance for practical TF-masking methods
- Weight optimization in the objective function
Thank you for your attention!
MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
17

More Related Content

What's hot

What's hot (20)

Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
 
Speech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognitionSpeech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognition
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic models
 
Sound analysis and processing with MATLAB
Sound analysis and processing with MATLABSound analysis and processing with MATLAB
Sound analysis and processing with MATLAB
 
Simulating communication systems with MATLAB: An introduction
Simulating communication systems with MATLAB: An introductionSimulating communication systems with MATLAB: An introduction
Simulating communication systems with MATLAB: An introduction
 
Pycon apac 2014
Pycon apac 2014Pycon apac 2014
Pycon apac 2014
 
H0814247
H0814247H0814247
H0814247
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
Blind audio source separation based on time-frequency structure models
Blind audio source separation based on time-frequency structure modelsBlind audio source separation based on time-frequency structure models
Blind audio source separation based on time-frequency structure models
 
Presentation
PresentationPresentation
Presentation
 
Chennai python augustmeetup
Chennai python augustmeetupChennai python augustmeetup
Chennai python augustmeetup
 
I phone 10
I phone 10I phone 10
I phone 10
 
Signal and image processing on satellite communication using MATLAB
Signal and image processing on satellite communication using MATLABSignal and image processing on satellite communication using MATLAB
Signal and image processing on satellite communication using MATLAB
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
DNN-based frequency component prediction for frequency-domain audio source se...
DNN-based frequency component prediction for frequency-domain audio source se...DNN-based frequency component prediction for frequency-domain audio source se...
DNN-based frequency component prediction for frequency-domain audio source se...
 
Fast Fourier Transform
Fast Fourier TransformFast Fourier Transform
Fast Fourier Transform
 
Prior distribution design for music bleeding-sound reduction based on nonnega...
Prior distribution design for music bleeding-sound reduction based on nonnega...Prior distribution design for music bleeding-sound reduction based on nonnega...
Prior distribution design for music bleeding-sound reduction based on nonnega...
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 

Similar to Missing Component Restoration for Masked Speech Signals based on Time-Domain Spectrogram Factorization

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 

Similar to Missing Component Restoration for Masked Speech Signals based on Time-Domain Spectrogram Factorization (20)

APSIPA2017: Trajectory smoothing for vocoder-free speech synthesis
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesisAPSIPA2017: Trajectory smoothing for vocoder-free speech synthesis
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesis
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Fusion Approach for Robust Speaker Identification System
Fusion Approach for Robust Speaker Identification System Fusion Approach for Robust Speaker Identification System
Fusion Approach for Robust Speaker Identification System
 
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
 
L046056365
L046056365L046056365
L046056365
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
 
What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication? What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication?
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Limited Data Speaker Verification: Fusion of Features
Limited Data Speaker Verification: Fusion of FeaturesLimited Data Speaker Verification: Fusion of Features
Limited Data Speaker Verification: Fusion of Features
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
Une18apsipa
Une18apsipaUne18apsipa
Une18apsipa
 

More from NU_I_TODALAB

More from NU_I_TODALAB (20)

異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech production
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
 
Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modeling
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 

Recently uploaded

Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
Kamal Acharya
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DrGurudutt
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
MohammadAliNayeem
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 

Recently uploaded (20)

E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
solid state electronics ktu module 5 slides
solid state electronics ktu module 5 slidessolid state electronics ktu module 5 slides
solid state electronics ktu module 5 slides
 
ANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 

Missing Component Restoration for Masked Speech Signals based on Time-Domain Spectrogram Factorization

  • 1. Missing Component Restoration For Masked Speech Signals Based On Time-Domain Spectrogram Factorization Shogo Seki† Hirokazu Kameoka‡ Tomoki Toda† Kazuya Takeda† (†Nagoya University, Japan) (‡NTT Communication Science Laboratories, Japan) MLSP 2017 Sep. 28, 14:00-16:00 Lecture Session 6: Special session on new extensions and applications of non-negative audio modeling
  • 2. l Time-Frequency masking (TF-masking) - Estimates “mask” passing the TF-slots target speech dominate - Aggressively suppresses noise (J powerful in terms of SNR) ⇔ - Results in overly sparse spectrograms - Can severely damage acoustical features L - Degrades following system using acoustical features e.g. speech recognition, voice conversion, etc. Missing component restoration for masked speech Background MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 1 Masked speech c.f. Clean speech
  • 3. What is missing component restoration? l Observation (Input) Complex spectrogram with missing components l Assumption Missing components take zero values - : Frequency index - : Time index - : Set of unmasked (observable) components l Estimate (Output) Waveform signal associated with restored complex spectrogram MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 2 Frequency Frame Missing Restoration Restored STFT
  • 4. Conventional method l NMF-based spectrogram restoration [Smaragdis+10] Algorithm 1. Restore magnitude spectrogram of observation by NMF 2. Reconstruct phase information [Griffin&Lim84] 3. Obtain waveform signal by inverse transformation (e.g. ISTFT) - NMF approximates spectrogram as low-rank matrix L Uses too few acoustical clues to archive accurate restoration L No guarantee to enhance speech in feature domain MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 3 Spectral templates (Basis) Activations Magnitude spectrogram
  • 5. Proposed ideas l How to restore missing components and improve features? Focus on following 3 clues 1. Low-rank structure assumption on magnitude spectrogram ← NMF + 2. “Redundancy” of spectrograms 3. Distribution of clean speech feature MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 4 Frame Frequency 1.
  • 6. What is redundancy of spectrogram? l Relationship between waveform signal and spectrogram - Described by matrix calculation - (# of TF-slots) > (# of time-domain samples) → Spectrogram is redundant representation of waveform signal Elements of complex spectrogram are interdependent on each other MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 5 reshape ⋯ Complex spectrogram ⋮ ⋮ Frame 1 Frame 2 Entire signal Frame 1 Frame 2 ⋮ ⋮ ⋮ Set of TF basis functions Frame overlapping
  • 7. l TSF: Time-domain Spectrogram Factorization [Kameoka15] - Originally proposed as source separation method - Realizes NMF-like signal decomposition in time domain - Directly decomposes time-domain signal by taking account of • low-rank assumptions on magnitude spectrogram • Redundancy of complex spectrogram Related works (1/2) MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 6 Time domain Magnitude spectrogram domain Observation |STFT| |STFT| |STFT| Hirokazu Kameoka, “Multi-resolution signal decomposition with time-domain spectrogram factorization,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 86–90.
  • 8. Proposed ideas l How to restore missing components and improve features? Focus on following 3 clues 1. Low-rank structure assumption on magnitude spectrogram ← NMF + 2. “Redundancy” of spectrograms ← TSF Interdependency of elements of complex spectrogram 3. Distribution of clean speech feature MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 7 Frame Frequency 1. Frequency 2. Frame
  • 9. l Mapping from spectrogram to feature space - Masked spectra can deviate far away from distribution of clean features l How to enhance features 1. Build GMM for clean features 2. Use GMM log-likelihood as regularizer to enforce restored spectrogram to follow clean speech distribution in feature domain GMM Distribution of clean speech feature MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 8 Masked speech Clean speech Feature space
  • 10. l CDR: Cepstral Distance Regularization [Li+16] - Regularization in NMF-based semi-supervised speech enhancement - Enhances both spectral components and cepstral features Objective function (to be minimized) • Negative log-likelihood of GMMs for features of reconstructed components • GMM is trained by speech of target speaker in advance CDR Related works (2/2) MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 9 GMM Features (e.g. MFCCs) Li Li, Hirokazu Kameoka, Takuya Higuchi, and Hiroshi Saruwatari, “Semi-supervised joint enhancement of spectral and cepstral sequences of noisy speech,” Interspeech 2016, pp. 3753–3757, 2016.
  • 11. Proposed ideas l How to restore missing components and improve features? Focus on following 3 clues 1. Low-rank structure assumption on magnitude spectrogram ← NMF + 2. “Redundancy” of spectrograms ← TSF Interdependency of elements of complex spectrogram 3. Distribution of clean speech feature ← CDR Prior information for target speech in feature space MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 10 Frame Frequency 1. Frequency Frame 3. Feature Frequency 2. Frame
  • 12. Formulation overview MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 11 Time domain Magnitude spectrogram domain Feature domain Complex spectrogram domain Estimate Observation |・| STFT ISTFT |・| ① ③ ② ④
  • 13. Formulation l Objective function (to be minimized) - : parameter - : weights - : error function (KL-divergence) - Iteratively optimized by Majorization-Minimization principle - Effect depends on representation power of GMM 12 MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 Global structures in magnitude spectrogram CDR Term relating and ① ② ③ ④ Interdependencies in complex spectrogram
  • 14. Experimental evaluation l Restoration performance for masked spectrogram l Evaluation dataset building 1. Prepare clean speech and noise data 2. Synthesize noisy speech 3. Build Ideal Binary Masks (IBMs) + random binary mask (missing rate; 0 – 90 %) 4. Make masked spectrograms ← restoration target l Data - Clean speech: 200 utterances of 20 speakers (females and males) - Noise data: babble noise - (Extra 50 utterances/speaker for GMM training of CDR) MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 13
  • 15. Experimental condition MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 14 Sampling frequency 16 kHz Frame size 32 ms Shift size 16 ms # of basis 30 # of iterations 200 CDR setting # of filter bank dimensions 20 # of MFCC dimensions 0-13 # of GMM mixtures 30 # of initial setting 3
  • 16. Experimental settings l Comparison (4 patterns) - NMF-based method (NMF) - TSF-based method w/ or w/o CDR (TSF & TSF w/ Reg.) - Direct inversion w/o any restoration (Unprocessed) l Performance measurement - MFCC distance between clean speech and restored speech (Lower is better) MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 15
  • 17. Experimental results 16 0 15 30 45 60 75 90 Missing rate [%] 0 2 4 6 8 10 12 MFCCdistance[dB] Unprocessed NMF TSF TSF w/ Reg. Better More suppressed MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6
  • 18. Conclusion l Proposed TSF-based restoration Focusing on following 3 clues 1. Low-rank structure assumption on magnitude spectrogram ← NMF 2. “Redundancy” of spectrograms ← TSF 3. Distribution of clean speech feature ← CDR l Demonstrated effectiveness in experimental evaluation - TSF-based restorations > NMF-based restoration l Future works - Restoration performance for practical TF-masking methods - Weight optimization in the objective function Thank you for your attention! MLSP 2017, Sep. 28, 14:00-16:00, Lecture Session 6 17