SlideShare a Scribd company logo
1 of 24
Efficient Transformer-based
Speech Enhancement
Using Long Frames and STFT
Magnitudes
Presenter : 何冠勳 61047017s
Date : 2022/09/29
Danilo de Oliveira et al.,
“Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes,”
in INTERSPEECH 2022
Outline
2
1 3 5
Introduction Experiments Conclusion
4
2
Architecture Results &
Discussion
Introduction
What’s the task?
1
3
Introduction
◉ A neural network-based approach that has shown good results and flourished with the increased
capacity and computational power of modern processors is learned-domain masking-based
speech processing.
(learn latent representations of the audio inputs and perform processing steps on them)
◉ In <GTF>, equivalent results were obtained by replacing the learned encoder with a multi-phase
gammatone analysis filterbank. In <Demy>, the authors show that gains from ConvTasNet can be
attributed to the high time resolution and the time-domain loss.
◉ Learned-domain methods usually work on short frames (2ms), which implies having to deal with
a larger number of frames if compared to traditional STFT frame sizes (∼32ms).
4
Introduction
◉ Dual-path methods have managed to alleviate some issues related to the modeling of long
sequences for speech applications.
◉ Another drawback of learned-domain approaches is that the models usually work with 8kHz
audio data, which is a considerable disadvantage against wideband methods.
◉ Additionally, the learned-encoder features have reduced interpretability compared to well-
established, fixed filters such as the STFT.
◉ Therefore, using larger frame time-frequency representations still presents itself as a desirable
feature, though working with complex representations brings additional challenges.
5
Introduction
◉ From multiple studies, one phenomenon has shown:
➢ the loss of spectral resolution renders the magnitude less relevant
at very short frames (≤ 2ms).
➢ the magnitude is more important than phase
for larger frames (~ 32ms).
◉ In this paper, we investigate what compromises and benefits can be attained when working with
magnitudes of longer frames.
6
Architecture
- Masking-based approach
- Encoder/Decoder
- Masker
2
7
Masking-based Approach
8
Encoder/Decoder
◉ Learned en/decoder pair:
9
◉ STFT/iSTFT:
where f , l, and n are the indices of frequency bin,
frame, and local time, respectively, W is the window
length, and H is the hop size.
Masker
10
Masker
11
Experiments
- Dataset
- Model
3
12
Experiments
◉ Dataset:
○ The models were trained on the DNS-Challenge dataset. We generated 100 hours of 4 second
long noisy mixtures sampled at 16kHz, with 20% reserved for validation.
○ Testing was performed on clean samples from a subset of the WSJ0 corpus mixed with noise
from the CHiME3 Challenge dataset, at SNRs ranging from -10dB to 15dB, at 5dB intervals.
◉ Model:
○ Learned En/Decoder basis 256 ; Kernel size 32 (2 ms for 16 kHz); Stride 16 ;
○ STFT window function Hann ; Learned En/Decoder basis 512 (32 ms for 16 kHz);
Overlap 50 / 75% ;
○ Chunk size 250 ; Transformers 4 ; SepFormer blocks 2 ; FFW-Dimension 256 ;
○ Optimizer Adam ; Loss SI-SNR ; Learning rate 1e-3 ;
13
Results &
Discussion
- Enhancement
- Executing profiling
3
14
Enhancement
◉ The estimated utterances were evaluated on instrumental perceptual metrics:
- POLQA for speech quality
- ESTOI for intelligibility.
15
Enhancement
◉ In the learned-domain case, the chunk size of 250 as in [16] performs best against a setup with
shorter chunks, hinting at the importance of modeling short-term relations in the sequence.
◉ The configuration with chunks size 50 seems to find a balance between short- and long-term, if
compared to the models with 25 and 100.
◉ The learned-domain method estimates the signal containing a buzzing sound that is absent from
the magnitude STFT outputs.
<Audio Samples>
16
Executing profiling
17
Conclusions
5
18
Conclusions
◉ We obtained equivalent quality and intelligibility evaluation scores while reducing the number of
operations by a factor of approximately 8 for a 10-second utterance.
◉ Motivated by previous contributions on learned and traditional filterbanks and on the relation
between frame size and magnitude/phase processing, we show that by replacing the learned
features with STFT magnitudes, we can obtain equivalent performance in terms of perceptually-
motivated metrics while considerably reducing resource allocation and processing time.
◉ These findings are a big step towards making the implementation of state-of-the-art
transformer-based speech enhancement systems possible in real-life applications, especially on
embedded devices.
19
Bi-Sep:
A Multi-Resolution Cross-Domain
Monaural Speech Separation Framework
Presenter : 何冠勳 61047017s
Date : 2022/09/29
Kuan-Hsun Ho et al.,
“Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework,”
in TAAI 2022
Brief
21
Brief
22
Brief
23
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
24

More Related Content

Similar to Transformer-based SE.pptx

IRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET Journal
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
 
Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...IRJET Journal
 
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...a3labdsp
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizera3labdsp
 
Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...karthik annam
 
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...simonhackett1
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...IJECEIAES
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisProvectus
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
Dc3210881096
Dc3210881096Dc3210881096
Dc3210881096IJMER
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...Videoguy
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...kevig
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET Journal
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]威華 王
 

Similar to Transformer-based SE.pptx (20)

Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
IRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time Domain
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...Enhanced modulation spectral subtraction incorporating various real time nois...
Enhanced modulation spectral subtraction incorporating various real time nois...
 
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizer
 
Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...Investigations on the role of analysis window shape parameter in speech enhan...
Investigations on the role of analysis window shape parameter in speech enhan...
 
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
High Spectral Efficiency Field Trial Using Time-Packed Terabit/s DP-DQPSK Sup...
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Dc3210881096
Dc3210881096Dc3210881096
Dc3210881096
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
 

Recently uploaded

Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesRashidFaridChishti
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementDr. Deepak Mudgal
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesChandrakantDivate1
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...vershagrag
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...gragchanchal546
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 

Recently uploaded (20)

Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 

Transformer-based SE.pptx

  • 1. Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes Presenter : 何冠勳 61047017s Date : 2022/09/29 Danilo de Oliveira et al., “Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes,” in INTERSPEECH 2022
  • 2. Outline 2 1 3 5 Introduction Experiments Conclusion 4 2 Architecture Results & Discussion
  • 4. Introduction ◉ A neural network-based approach that has shown good results and flourished with the increased capacity and computational power of modern processors is learned-domain masking-based speech processing. (learn latent representations of the audio inputs and perform processing steps on them) ◉ In <GTF>, equivalent results were obtained by replacing the learned encoder with a multi-phase gammatone analysis filterbank. In <Demy>, the authors show that gains from ConvTasNet can be attributed to the high time resolution and the time-domain loss. ◉ Learned-domain methods usually work on short frames (2ms), which implies having to deal with a larger number of frames if compared to traditional STFT frame sizes (∼32ms). 4
  • 5. Introduction ◉ Dual-path methods have managed to alleviate some issues related to the modeling of long sequences for speech applications. ◉ Another drawback of learned-domain approaches is that the models usually work with 8kHz audio data, which is a considerable disadvantage against wideband methods. ◉ Additionally, the learned-encoder features have reduced interpretability compared to well- established, fixed filters such as the STFT. ◉ Therefore, using larger frame time-frequency representations still presents itself as a desirable feature, though working with complex representations brings additional challenges. 5
  • 6. Introduction ◉ From multiple studies, one phenomenon has shown: ➢ the loss of spectral resolution renders the magnitude less relevant at very short frames (≤ 2ms). ➢ the magnitude is more important than phase for larger frames (~ 32ms). ◉ In this paper, we investigate what compromises and benefits can be attained when working with magnitudes of longer frames. 6
  • 7. Architecture - Masking-based approach - Encoder/Decoder - Masker 2 7
  • 9. Encoder/Decoder ◉ Learned en/decoder pair: 9 ◉ STFT/iSTFT: where f , l, and n are the indices of frequency bin, frame, and local time, respectively, W is the window length, and H is the hop size.
  • 13. Experiments ◉ Dataset: ○ The models were trained on the DNS-Challenge dataset. We generated 100 hours of 4 second long noisy mixtures sampled at 16kHz, with 20% reserved for validation. ○ Testing was performed on clean samples from a subset of the WSJ0 corpus mixed with noise from the CHiME3 Challenge dataset, at SNRs ranging from -10dB to 15dB, at 5dB intervals. ◉ Model: ○ Learned En/Decoder basis 256 ; Kernel size 32 (2 ms for 16 kHz); Stride 16 ; ○ STFT window function Hann ; Learned En/Decoder basis 512 (32 ms for 16 kHz); Overlap 50 / 75% ; ○ Chunk size 250 ; Transformers 4 ; SepFormer blocks 2 ; FFW-Dimension 256 ; ○ Optimizer Adam ; Loss SI-SNR ; Learning rate 1e-3 ; 13
  • 14. Results & Discussion - Enhancement - Executing profiling 3 14
  • 15. Enhancement ◉ The estimated utterances were evaluated on instrumental perceptual metrics: - POLQA for speech quality - ESTOI for intelligibility. 15
  • 16. Enhancement ◉ In the learned-domain case, the chunk size of 250 as in [16] performs best against a setup with shorter chunks, hinting at the importance of modeling short-term relations in the sequence. ◉ The configuration with chunks size 50 seems to find a balance between short- and long-term, if compared to the models with 25 and 100. ◉ The learned-domain method estimates the signal containing a buzzing sound that is absent from the magnitude STFT outputs. <Audio Samples> 16
  • 19. Conclusions ◉ We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance. ◉ Motivated by previous contributions on learned and traditional filterbanks and on the relation between frame size and magnitude/phase processing, we show that by replacing the learned features with STFT magnitudes, we can obtain equivalent performance in terms of perceptually- motivated metrics while considerably reducing resource allocation and processing time. ◉ These findings are a big step towards making the implementation of state-of-the-art transformer-based speech enhancement systems possible in real-life applications, especially on embedded devices. 19
  • 20. Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework Presenter : 何冠勳 61047017s Date : 2022/09/29 Kuan-Hsun Ho et al., “Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework,” in TAAI 2022
  • 24. Any questions ? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 24