SlideShare a Scribd company logo
Efficient Transformer-based
Speech Enhancement
Using Long Frames and STFT
Magnitudes
Presenter : 何冠勳 61047017s
Date : 2022/09/29
Danilo de Oliveira et al.,
“Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes,”
in INTERSPEECH 2022
Outline
2
1 3 5
Introduction Experiments Conclusion
4
2
Architecture Results &
Discussion
Introduction
What’s the task?
1
3
Introduction
◉ A neural network-based approach that has shown good results and flourished with the increased
capacity and computational power of modern processors is learned-domain masking-based
speech processing.
(learn latent representations of the audio inputs and perform processing steps on them)
◉ In <GTF>, equivalent results were obtained by replacing the learned encoder with a multi-phase
gammatone analysis filterbank. In <Demy>, the authors show that gains from ConvTasNet can be
attributed to the high time resolution and the time-domain loss.
◉ Learned-domain methods usually work on short frames (2ms), which implies having to deal with
a larger number of frames if compared to traditional STFT frame sizes (∼32ms).
4
Introduction
◉ Dual-path methods have managed to alleviate some issues related to the modeling of long
sequences for speech applications.
◉ Another drawback of learned-domain approaches is that the models usually work with 8kHz
audio data, which is a considerable disadvantage against wideband methods.
◉ Additionally, the learned-encoder features have reduced interpretability compared to well-
established, fixed filters such as the STFT.
◉ Therefore, using larger frame time-frequency representations still presents itself as a desirable
feature, though working with complex representations brings additional challenges.
5
Introduction
◉ From multiple studies, one phenomenon has shown:
➢ the loss of spectral resolution renders the magnitude less relevant
at very short frames (≤ 2ms).
➢ the magnitude is more important than phase
for larger frames (~ 32ms).
◉ In this paper, we investigate what compromises and benefits can be attained when working with
magnitudes of longer frames.
6
Architecture
- Masking-based approach
- Encoder/Decoder
- Masker
2
7
Masking-based Approach
8
Encoder/Decoder
◉ Learned en/decoder pair:
9
◉ STFT/iSTFT:
where f , l, and n are the indices of frequency bin,
frame, and local time, respectively, W is the window
length, and H is the hop size.
Masker
10
Masker
11
Experiments
- Dataset
- Model
3
12
Experiments
◉ Dataset:
○ The models were trained on the DNS-Challenge dataset. We generated 100 hours of 4 second
long noisy mixtures sampled at 16kHz, with 20% reserved for validation.
○ Testing was performed on clean samples from a subset of the WSJ0 corpus mixed with noise
from the CHiME3 Challenge dataset, at SNRs ranging from -10dB to 15dB, at 5dB intervals.
◉ Model:
○ Learned En/Decoder basis 256 ; Kernel size 32 (2 ms for 16 kHz); Stride 16 ;
○ STFT window function Hann ; Learned En/Decoder basis 512 (32 ms for 16 kHz);
Overlap 50 / 75% ;
○ Chunk size 250 ; Transformers 4 ; SepFormer blocks 2 ; FFW-Dimension 256 ;
○ Optimizer Adam ; Loss SI-SNR ; Learning rate 1e-3 ;
13
Results &
Discussion
- Enhancement
- Executing profiling
3
14
Enhancement
◉ The estimated utterances were evaluated on instrumental perceptual metrics:
- POLQA for speech quality
- ESTOI for intelligibility.
15
Enhancement
◉ In the learned-domain case, the chunk size of 250 as in [16] performs best against a setup with
shorter chunks, hinting at the importance of modeling short-term relations in the sequence.
◉ The configuration with chunks size 50 seems to find a balance between short- and long-term, if
compared to the models with 25 and 100.
◉ The learned-domain method estimates the signal containing a buzzing sound that is absent from
the magnitude STFT outputs.
<Audio Samples>
16
Executing profiling
17
Conclusions
5
18
Conclusions
◉ We obtained equivalent quality and intelligibility evaluation scores while reducing the number of
operations by a factor of approximately 8 for a 10-second utterance.
◉ Motivated by previous contributions on learned and traditional filterbanks and on the relation
between frame size and magnitude/phase processing, we show that by replacing the learned
features with STFT magnitudes, we can obtain equivalent performance in terms of perceptually-
motivated metrics while considerably reducing resource allocation and processing time.
◉ These findings are a big step towards making the implementation of state-of-the-art
transformer-based speech enhancement systems possible in real-life applications, especially on
embedded devices.
19
Bi-Sep:
A Multi-Resolution Cross-Domain
Monaural Speech Separation Framework
Presenter : 何冠勳 61047017s
Date : 2022/09/29
Kuan-Hsun Ho et al.,
“Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework,”
in TAAI 2022
Brief
21
Brief
22
Brief
23
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
24

More Related Content

Similar to Transformer-based SE.pptx

Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
ssuser849b73
 
IRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time Domain
IRJET Journal
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
karthik annam
 
Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...
IRJET Journal
 
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
a3labdsp
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizer
a3labdsp
 
Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...
karthik annam
 
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
simonhackett1
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
IJERA Editor
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
IJECEIAES
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
ssuser849b73
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Provectus
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
IRJET Journal
 
Dc3210881096
Dc3210881096Dc3210881096
Dc3210881096
IJMER
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...
Videoguy
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
kevig
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
ssuser849b73
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET Journal
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
威華 王
 

Similar to Transformer-based SE.pptx (20)

Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
IRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time Domain
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...
 
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizer
 
Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...
 
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Dc3210881096
Dc3210881096Dc3210881096
Dc3210881096
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
 

Recently uploaded

22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 

Recently uploaded (20)

22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 

Transformer-based SE.pptx

  • 1. Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes Presenter : 何冠勳 61047017s Date : 2022/09/29 Danilo de Oliveira et al., “Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes,” in INTERSPEECH 2022
  • 2. Outline 2 1 3 5 Introduction Experiments Conclusion 4 2 Architecture Results & Discussion
  • 4. Introduction ◉ A neural network-based approach that has shown good results and flourished with the increased capacity and computational power of modern processors is learned-domain masking-based speech processing. (learn latent representations of the audio inputs and perform processing steps on them) ◉ In <GTF>, equivalent results were obtained by replacing the learned encoder with a multi-phase gammatone analysis filterbank. In <Demy>, the authors show that gains from ConvTasNet can be attributed to the high time resolution and the time-domain loss. ◉ Learned-domain methods usually work on short frames (2ms), which implies having to deal with a larger number of frames if compared to traditional STFT frame sizes (∼32ms). 4
  • 5. Introduction ◉ Dual-path methods have managed to alleviate some issues related to the modeling of long sequences for speech applications. ◉ Another drawback of learned-domain approaches is that the models usually work with 8kHz audio data, which is a considerable disadvantage against wideband methods. ◉ Additionally, the learned-encoder features have reduced interpretability compared to well- established, fixed filters such as the STFT. ◉ Therefore, using larger frame time-frequency representations still presents itself as a desirable feature, though working with complex representations brings additional challenges. 5
  • 6. Introduction ◉ From multiple studies, one phenomenon has shown: ➢ the loss of spectral resolution renders the magnitude less relevant at very short frames (≤ 2ms). ➢ the magnitude is more important than phase for larger frames (~ 32ms). ◉ In this paper, we investigate what compromises and benefits can be attained when working with magnitudes of longer frames. 6
  • 7. Architecture - Masking-based approach - Encoder/Decoder - Masker 2 7
  • 9. Encoder/Decoder ◉ Learned en/decoder pair: 9 ◉ STFT/iSTFT: where f , l, and n are the indices of frequency bin, frame, and local time, respectively, W is the window length, and H is the hop size.
  • 13. Experiments ◉ Dataset: ○ The models were trained on the DNS-Challenge dataset. We generated 100 hours of 4 second long noisy mixtures sampled at 16kHz, with 20% reserved for validation. ○ Testing was performed on clean samples from a subset of the WSJ0 corpus mixed with noise from the CHiME3 Challenge dataset, at SNRs ranging from -10dB to 15dB, at 5dB intervals. ◉ Model: ○ Learned En/Decoder basis 256 ; Kernel size 32 (2 ms for 16 kHz); Stride 16 ; ○ STFT window function Hann ; Learned En/Decoder basis 512 (32 ms for 16 kHz); Overlap 50 / 75% ; ○ Chunk size 250 ; Transformers 4 ; SepFormer blocks 2 ; FFW-Dimension 256 ; ○ Optimizer Adam ; Loss SI-SNR ; Learning rate 1e-3 ; 13
  • 14. Results & Discussion - Enhancement - Executing profiling 3 14
  • 15. Enhancement ◉ The estimated utterances were evaluated on instrumental perceptual metrics: - POLQA for speech quality - ESTOI for intelligibility. 15
  • 16. Enhancement ◉ In the learned-domain case, the chunk size of 250 as in [16] performs best against a setup with shorter chunks, hinting at the importance of modeling short-term relations in the sequence. ◉ The configuration with chunks size 50 seems to find a balance between short- and long-term, if compared to the models with 25 and 100. ◉ The learned-domain method estimates the signal containing a buzzing sound that is absent from the magnitude STFT outputs. <Audio Samples> 16
  • 19. Conclusions ◉ We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance. ◉ Motivated by previous contributions on learned and traditional filterbanks and on the relation between frame size and magnitude/phase processing, we show that by replacing the learned features with STFT magnitudes, we can obtain equivalent performance in terms of perceptually- motivated metrics while considerably reducing resource allocation and processing time. ◉ These findings are a big step towards making the implementation of state-of-the-art transformer-based speech enhancement systems possible in real-life applications, especially on embedded devices. 19
  • 20. Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework Presenter : 何冠勳 61047017s Date : 2022/09/29 Kuan-Hsun Ho et al., “Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework,” in TAAI 2022
  • 24. Any questions ? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 24