SlideShare a Scribd company logo
1 of 69
Download to read offline
Speech Separation under
Reverberant Condition
Presenter : 何冠勳 61047017s
Date : 2022/08/31
1: Tobias Cord-Landwehr et al., “Monaural Source Separation: from Anechoic to Reverberant Enviorments”, IWAENC 2022
2: Cem Subakan et al., “On Using Transformers for Speech-Separation”, IEEE Signal Processing Letters
3: Jens Heitkaemper et al., “Demystifying TasNet: a Dissecting Approach”, ICASSP 2020
Outline
2
1 3 5
Introduction Architecture Conclusions
4
2
Prior
Knowledge
Experiments
Introduction
What is Speech Separation? Why? How?
1
3
What is Speech Separation?
See page 1 ~ 5 in PPT made by professor HUNG-YI LEE.
4
<Speech Separation - HUNG-YI LEE>
History
◉ Starting with the seminal papers on deep clustering and permutation invariant training (PIT),
improvements have been achieved by combining the two in a multi-objective training criterion,
or replacing the STFT with a learnable encoder and decoder, e.g. Conv-TasNet,
or accounting for short- and longer-term correlations in the signal, e.g. DPRNN,
or conditioning on simultaneously computed speaker centroids, e.g. Wavesplit.
◉ Overall, this has led to an improvement in SI-SDR from roughly 10 ~ 20+ dB on the standard
WSJ0-2mix dataset, which consists of artificial mixtures of anechoic speech.
anechoic (adj.): free from echoes and reverberations
5
6
SepFormer
DPTNet
Sudormrf
Recommended article
<WSJ0-2mix Benchmark (Speech Separation) | Papers With Code>
7
Datasets
◉ Common datasets include:
➢ anechoic: WSJ0mix, LibriMix
➢ noisy: WHAM!, LibriMix
➢ reverberant: SMS-WSJ
➢ noisy & reverberant: WHAMR!
◉ There are also data related to task of enhancing speech in WHAM!, WHAM! and LibriMix.
◉ Instead of utterance manner, there are also continuous speech separation corpus, like LibriCSS.
8
Non-anechoic
◉ However, an anechoic environment is a rather unrealistic assumption as in a real-world scenario, the
superposition of the speech of two or more speakers typically occurs in a distant microphone setting.
◉ In particular, reverberation has been considered more challenging than noise.
◉ WHAMR! and SMS-WSJ are two widely used datasets for research on source separation for
reverberant mixtures. Both contain artificially reverberated utterances from the WSJ corpus.
◉ Source separation performance on WHAMR! is 2 ~ 8 dB, on SMS-WSJ is 5 ~ 6 dB for single-channel
input and single-stage processing, which is much worse than the performance on anechoic mixtures.
9
Non-anechoic
◉ In [1], they aim to explore which of the recent innovations that proved useful for the separation of
anechoic mixtures are also beneficial in the reverberant case, in order to propose some guidelines on
how to adjust a separation system to reverberated input.
◉ They take the SepFormer architecture, which achieves state-of-the-art performance both on
WSJ0-2mix and WHAMR!, and the traditional model PIT-BLSTM then adopt both models on
SMS-WSJ datasets.
◉ By modifying and optimizing the each model, w.r.t. loss function, encoder/decoder architecture,
resolution and representation, they adapt the model to mitigate the performance degradation between
the anechoic and reverberant scenario.
10
Non-anechoic
◉ In [2], they expand the study by providing additional experiments and insights on more realistic and
challenging datasets such as Libri2/3-Mix, which include longer mixtures, WHAM! and WHAMR!,
featuring noisy and noisy & reverberant conditions respectively.
◉ Moreover, on WHAM! and on WHAMR! datasets they also provide results for speech enhancement.
◉ Another contribution of [2] is investigating different types of self-attention for speech separation with
and without the dual-path mechanism. Namely, they compare the vanilla Transformer with
Longformer, Linformer, and Reformer.
11
Reverberation
12
<Reverberation-Wikipedia> <測量殘響時間 RT60>
WHAMR!: RT60 ∈ U(200 ms, 500ms)
SMS-WSJ: RT60 ∈ U(100 ms, 1000ms)
Objectives
13
◉ In this presentation, we expect to obtain:
✓ separation performance on WHAM!, WHAMR!, LibriMix and SMS-WSJ,[1][2]
✓ enhancement performance on WHAM! and WHAMR!, [2]
✓ modifications that are rewarding to adapt to the reverberant condition,[1]
✓ analysis on those modifications[1]
and
✓ comparison between transformer variety. [2]
Prior Knowledge
- Basic architecture
- Metrics
- Permutation invariant training
- Transformer
- Time Complexity
- Variety
2
14
Basic Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for each
source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
15
conv1d transpose-conv1d
Dual-path Mechanism
16
Process intra-chunk
dependencies
Process inter-chunk
dependencies
Metrics
See page 6 ~ 9 in PPT made by professor HUNG-YI LEE.
17
<Speech Separation - HUNG-YI LEE>
PIT
See page 10 ~ 11 in PPT made by professor HUNG-YI LEE.
18
<Speech Separation - HUNG-YI LEE>
◉ Transformers enable direct and accurate context modeling of
longer term dependencies which render them suitable for
audio processing, especially for speech separation, where
long-term modeling has been shown to impact performance
significantly.
◉ However, to avoid confusions, the Transformer in this paper
refers especially to the encoder part, and it is comprised of
three core modules: scaled dot-product attention, multi-head
attention and position-wise feed-forward network.
◉ 【機器學習2021】Transformer (上)、
【機器學習2021】Transformer (下)、
Attention 及 Transformer 架構理解
19
Transformer
20
Transformer
Time Complexity
21
Variety - Longformer
22
<Longformer: The Long-Document Transformer>
<Longformer論文筆記-知乎 >
Variety - Linformer
23
<Linformer: Self-Attention with Linear Complexity>
<Linformer論文筆記-知乎 >
24
Variety - Reformer
<Reformer: The Efficient Transformer>
<圖解Reformer>
Architecture
- SepFormer[2]
- PIT-BLSTM[1]
- Comparison[1]
3
25
SepFormer
26
SepFormer
◉ For mixture , the encoder learns an STFT-like representation:
◉ The masking network is fed by and estimates masks for each of the
speakers.
1. Linear + Chunking (segmentation + overlap) :
2. SepFormer block (intra-Transformer + inter-Transformer) :
3. Linear + PReLU :
27
SepFormer
4. Overlap-add :
5. FFW + PReLU :
◉ The input to the decoder is the element-wise multiplication between masks and initial
representation:
28
SepFormer Block
◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk
independently, modeling the short-term dependencies within each chunk.
◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is
applied to model the transitions across chunks.
◉ Overall transformation :
◉ The above consist a SepFormer block, and there can be N blocks repeatedly.
29
Dual-path mechanism
30
Process intra-chunk
dependencies
Process inter-chunk
dependencies
SepFormer Block
◉ Transformer procedure (Pre-LN setting):
1. Positional encoding :
2. Layer norm + Multi-head attention :
3. Layer norm + Feed forward + Residuals :
4. Repeat blocks
31
<On Layer Normalization in the Transformer Architecture>
SepFormer
◉ Parameters:
○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 ;
○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 ;
○ Optimizer Adam ; Loss negative SI-SNR ; Learning rate 1.5e-4
; Batch size 1 ; Epochs 200 .
◉ They explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of
new mixtures from single speaker sources, along with speed perturbation in [95%, 105%].
◉ Training process also apply learning rate halving, gradient clipping and mixed-precision.
Mixed-precision training is a technique for substantially reducing neural net training time by
performing as many operations as possible in fp16, instead of fp32.
32
SepFormer
◉ When using dynamic mixing, SepFormer
achieves state-of- the-art performance.
◉ SepFormer outperforms previous
systems without using dynamic mixing
except Wavesplit, which uses speaker
identity as additional information.
33
SepFormer
◉ A respectable performance of 19.2 dB is obtained
even when we use a single layer Transformer for
the Inter-Transformer. This suggests that the
Intra-Transformer, and thus local processing, has
a greater influence on the performance.
◉ It also emerges that positional encoding is helpful.
A similar outcome has been observed in T-gsa for
speech enhancement.
◉ Finally, it can be observed that dynamic mixing
helps the performance drastically.
34
SepFormer
❗ However, my inference experience shows that during training the memory consumption of
SepFormer is, on the contrary, large enough to be unable to fit in one 24 GB card.
❗ As a result, they can only train with batch size of one in 32 GB card.
35
PIT-BLSTM
36
STFT iSTFT
Masking net 3 BLSTM + 2 FC ;
Hidden units 600 ;
Window size 512 ;
Frame size 128 ;
Feature dim 257 ;
Loss MSE (on magnitude).
PIT-BLSTM
◉ In [1], SepFormer is trained with a soft-thresholded time-domain negative SDR:
❓ This loss decreases the contribution of well separated examples to the gradient, encouraging
the model to focus more on enhancing examples with a low SDR than those that already show
a good separation.
(I argue that such modification is mainly for stability, e.g. when the distance between estimation
and ground truth is close enough, like the evaluation in Sudormrf.)
37
38
Schematic diagram
Comparison
◉ In [1], SepFormer (small) reduce the original number of
intra- and inter-Transformer layers to 4, each. This
modification yields an about 1 dB lower SDR on
WSJ0-2mix, but significantly reduces the number of
parameters.
39
❗ Note that the memory footprint of SepFormer (small) still is 16 times larger than the PIT-BLSTM, so that a
complexity comparison purely based on the parameters is not fair.
Experiments
- Non-anechoic[1][2]
- Optimization[1]
- Summary[1]
- My perspective
- Variety[2]
4
40
LibriMix
41
◉ LibriMix is proposed as an alternative to WSJ0-2/3Mix with more diverse and longer utterances.
◉ The network is trained to perform separation and de-noising at the same time.
WHAM! & WHAMR!
◉ In WHAM! dataset, the model learns to de-noise while
also separating, and in WHAMR! dataset, the model learns
to de-noise and de-reverberate in addition to separating.
◉ SepFormer outperforms the previously proposed methods
even without the use of DM, which further improves the
performance.
42
Speech Enhancement
◉ In addition to training SepFormer, we also trained a
Bidirectional LSTM, CNNTransformer, CFDN models
from Speechbrain, which are trained to minimize MSE
of the estimated magnitude spectrogram and the
spectrogram of the clean signals.
43
<The SpeechBrain Toolkit>
SMS-WSJ
◉ It can be seen that both systems degrade under the presence of reverberation. However, the
separation performance of the SepFormer degrades by more than 10 dB in terms of SDR and
almost 30 percentage points regarding the WER.
(WERs are evaluated by kaldi baseline provided in SMS-WSJ.)
44
Optimization
◉ Apart from the discrepancy in masking net structure, there are also other differences
between both model, namely loss function, encoder/decoder architecture and resolution.
◉ The choice of representation, e.g. magnitude, phase or both, is also worth exploring.
◉ But first, they speculate whether the degradation is on account of artifacts. The method to
mitigate that is by adding white Gaussian noise at an SNR of 25 dB to the separated audio
files before they were input to the speech recognizer.
45
<White Gaussian Noise>
<How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR>
Optim.
◉ Even more so, these modifications work well both with and without reverberation and lead to a
reduction in WER of more than 20 percentage points for both scenarios.
◉ As for SepFormer, it is striking that it is no longer superior to PIT-BLSTM for reverberant data.
46
Optim. - En/Decoder
◉ There is a large mismatch between the window size and the frame size in between. It is necessary
to verify whether the violation of the Multiplicative Transfer Function Approximation (MTFA) related
problem caused by the small window size contributes to the system deterioration under
reverberation.
◉ MTFA is a long standing problem when facing any application using STFT in reverberant scenario.
When switching from a fixed STFT encoder to a learnable encoder, the overall structure of the
system stays the same. Therefore, it can be assumed that similar issues arise with the learnable
encoder.
47
MTFA
◉ Mask-based separation relies on the sparsity and orthogonality of the sources in the the domain
where the masks are computed. In case of the STFT domain, this means that a time-frequency bin
of a mixture will typically be:
,where and are the STFT representation of the k-th
source signal and the Room Impulse Response (RIR) from the k-th source to the
microphone, respectively.
◉ Note that the equation makes the assumption that the convolution of the source signal with the
RIR corresponds to a multiplication of in STFT domain. This so-called Multiplicative Transfer
Function Approximation, however, only holds true if the temporal extent of is smaller than the
STFT analysis window.
48
49
A convolution of and is defined as:
So, its fourier transform will be:
Since , we may apply Fubini's theorem (i.e. interchange the order of
integration):
Proof of “convolution in time = multiplication in frequency”
詳情請見DSP相關課程
“
The multiplication is NOT accurate in STFT domain!
In particularly, it’s on account of the windowing of
short time.
50
<On MTF Approximation in the STFT Domain>
<STFT: Influence of Window Function>
51
◉ Expected behavior in anechoic conditions :
✓ Reducing the frame shift leads to an improvement in SDR and WER. The recommended
analysis window size and shift of 16 and 8 samples (i.e. 2 ms and 1 ms), respectively, provides
the best results.
✓ Furthermore, the learnable encoder proves superior to the STFT encoder.
52
◉ Unexpected behaviour:
✗ While the STFT encoder is significantly worse than a learnable encoder for a small window
size and shift, it begins to be on par or even outperforms the learnable encoder for an
increased window size of 32 ms. violation of the MTFA
✗ Interestingly, the overall best results of the SepFormer are achieved with the STFT.
53
◉ W-Disjoint Orthogonality (WDO) is a score which
measures the orthogonality of the single-speaker utterances
in the latent space.
◉ This indicates that the learnable encoder is able to compensate the effects to some degree, but that
choosing a large enough window size is mandatory to stabilize the performance under
reverberation.
(though not highly correlated)
from high orthogonality to relatively low
mitigated by larger window
Optim. - Representation
◉ A significant difference between PIT-BLSTM and SepFormer is that the PIT model estimates the
masks based on the magnitude spectrogram only, whereas the SepFormer mask estimator has
access to the both magnitude and phase implicitly.
◉ Result shows that the phase information is not helpful for the SepFormer in the reverberant
scenario. Even more so, omitting the phase information leads to a better system performance.
54
Optim. - Representation
◉ Firstly, only using the magnitude spectrogram results in a larger window of 512 samples to keep the
size of the separator identical, increasing the temporal context of each frame even further.
◉ Secondly, papers have shown that the phase becomes less informative while the magnitude
becomes more informative for increasing frame sizes.
◉ However, the large window sizes that were shown to be necessary for the reverberant scenario.
◉ An interesting finding is when using the magnitude and frame shift altered from 16 to 128 samples
improves both signal-level metrics and WER.
55
<Phase-Aware Deep Speech Enhancement: It's All About The Frame Length>
Summary
◉ When only the network of the separator is different, i.e., intra/inter-transformer layers vs BLSTM
layers, the performances are close.
◉ However, this modified SepFormer only shows a marginally better SDR and an improvement of 1
percentage point in the WER.
56
STFT, SDR loss, +Gaussian
Learned en/decoder, 16 window
STFT, 512 window
My Perspective
◉ What possibly causes the inconsistent performance in WHAMR (SOTA) and SMS-WSJ (marginally
better)?
➜ In [1] and [2], the configuration of each are different.
➜ If learned properly, the obvious benefit of having more parameters is that model can
represent much more complicated functions than the one with fewer parameters.
(If not, having more parameters increase the risk of overfitting.)
◉ What is the consensuses in separation, even under reverberant? (cont.)
57
My Perspective
58
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
<Preference for 20-40 ms window duration in speech analysis>
Typical frame lengths in speech processing, around 32 ms, correspond to
an interval that is short enough to be considered quasi-stationary but
long enough to cover multiple fundamental periods of voiced speech
whose fundamental period lies between 2 ms to 12.5 ms.
My Perspective
59
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
My Perspective
60
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
consensus on anechoic env.
My Perspective
61
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
not a crucial factor, but FFC may help
<Fast Fourier Convolution>
My Perspective
62
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Perhaps multi-resolution encoder is a solution!
Agree Disagree
<Multi-encoder multi-resolution framework for end-to-end speech recognition>
My Perspective
◉ Is SDR the right metrics for obtaining better WER?
63
<Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR>
Variety
64
Variety
65
Conclusions
5
66
Conclusions
◉ In [1][2], we have explored speech separation and enhancement upon SepFormer, an
attention-only masking network that is able to achieve state-of-the-art performance in anechoic
conditions.
◉ The experiments in [1] and [3] raise the issue of whether the improvements that have been
appraised for the separation of anechoic mixtures are futile for the more realistic case of
reverberant source separation.
◉ We also investigate the use of more efficient self-attention mechanisms such as Longformer,
Linformer and Reformer.
◉ Can we find a versatile model that can separate speech in both anechoic and reverberant
condition but with less parameters or even model size?
67
Appendix
◉ <My research notes>
68
◉ “Speech Separation” video from professor HUNG-YI LEE:
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
69

More Related Content

What's hot

音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組み音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組みAtsushi_Ando
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between Capsulesharmonylab
 
DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相Takuya Yoshioka
 
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...Akira Tamamori
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用NU_I_TODALAB
 
【 xpaper.challenge 】ブレインストーミング法
【 xpaper.challenge 】ブレインストーミング法【 xpaper.challenge 】ブレインストーミング法
【 xpaper.challenge 】ブレインストーミング法cvpaper. challenge
 
深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調Yuma Koizumi
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワークNU_I_TODALAB
 
Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022Kwanghee Choi
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現NU_I_TODALAB
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice ConversionNU_I_TODALAB
 
国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告Shinnosuke Takamichi
 
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法Shinnosuke Takamichi
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパスShinnosuke Takamichi
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析Shinnosuke Takamichi
 

What's hot (20)

音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組み音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組み
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between Capsules
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相
 
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
 
【 xpaper.challenge 】ブレインストーミング法
【 xpaper.challenge 】ブレインストーミング法【 xpaper.challenge 】ブレインストーミング法
【 xpaper.challenge 】ブレインストーミング法
 
深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告国際会議 interspeech 2020 報告
国際会議 interspeech 2020 報告
 
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
 

Similar to Speech Separation under Reverberant Condition.pdf

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptxssuser849b73
 
Implementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGAImplementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGAMangaiK4
 
Analysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix ModalAnalysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix ModalMangaiK4
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...Videoguy
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptxTasnimRahman54
 
Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionWhenty Ariyanti
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...kevig
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits
Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI CircuitsOptimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits
Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI CircuitsVLSICS Design
 
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...AM Publications
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
Iaetsd gmsk modulation implementation for gsm in dsp
Iaetsd gmsk modulation implementation for gsm in dspIaetsd gmsk modulation implementation for gsm in dsp
Iaetsd gmsk modulation implementation for gsm in dspIaetsd Iaetsd
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 

Similar to Speech Separation under Reverberant Condition.pdf (20)

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Implementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGAImplementation of High Speed OFDM Transceiver using FPGA
Implementation of High Speed OFDM Transceiver using FPGA
 
Analysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix ModalAnalysis of Women Harassment inVillages Using CETD Matrix Modal
Analysis of Women Harassment inVillages Using CETD Matrix Modal
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptx
 
Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech Recognition
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits
Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI CircuitsOptimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits
Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits
 
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
Iaetsd gmsk modulation implementation for gsm in dsp
Iaetsd gmsk modulation implementation for gsm in dspIaetsd gmsk modulation implementation for gsm in dsp
Iaetsd gmsk modulation implementation for gsm in dsp
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 

Recently uploaded

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Speech Separation under Reverberant Condition.pdf

  • 1. Speech Separation under Reverberant Condition Presenter : 何冠勳 61047017s Date : 2022/08/31 1: Tobias Cord-Landwehr et al., “Monaural Source Separation: from Anechoic to Reverberant Enviorments”, IWAENC 2022 2: Cem Subakan et al., “On Using Transformers for Speech-Separation”, IEEE Signal Processing Letters 3: Jens Heitkaemper et al., “Demystifying TasNet: a Dissecting Approach”, ICASSP 2020
  • 2. Outline 2 1 3 5 Introduction Architecture Conclusions 4 2 Prior Knowledge Experiments
  • 3. Introduction What is Speech Separation? Why? How? 1 3
  • 4. What is Speech Separation? See page 1 ~ 5 in PPT made by professor HUNG-YI LEE. 4 <Speech Separation - HUNG-YI LEE>
  • 5. History ◉ Starting with the seminal papers on deep clustering and permutation invariant training (PIT), improvements have been achieved by combining the two in a multi-objective training criterion, or replacing the STFT with a learnable encoder and decoder, e.g. Conv-TasNet, or accounting for short- and longer-term correlations in the signal, e.g. DPRNN, or conditioning on simultaneously computed speaker centroids, e.g. Wavesplit. ◉ Overall, this has led to an improvement in SI-SDR from roughly 10 ~ 20+ dB on the standard WSJ0-2mix dataset, which consists of artificial mixtures of anechoic speech. anechoic (adj.): free from echoes and reverberations 5
  • 7. 7
  • 8. Datasets ◉ Common datasets include: ➢ anechoic: WSJ0mix, LibriMix ➢ noisy: WHAM!, LibriMix ➢ reverberant: SMS-WSJ ➢ noisy & reverberant: WHAMR! ◉ There are also data related to task of enhancing speech in WHAM!, WHAM! and LibriMix. ◉ Instead of utterance manner, there are also continuous speech separation corpus, like LibriCSS. 8
  • 9. Non-anechoic ◉ However, an anechoic environment is a rather unrealistic assumption as in a real-world scenario, the superposition of the speech of two or more speakers typically occurs in a distant microphone setting. ◉ In particular, reverberation has been considered more challenging than noise. ◉ WHAMR! and SMS-WSJ are two widely used datasets for research on source separation for reverberant mixtures. Both contain artificially reverberated utterances from the WSJ corpus. ◉ Source separation performance on WHAMR! is 2 ~ 8 dB, on SMS-WSJ is 5 ~ 6 dB for single-channel input and single-stage processing, which is much worse than the performance on anechoic mixtures. 9
  • 10. Non-anechoic ◉ In [1], they aim to explore which of the recent innovations that proved useful for the separation of anechoic mixtures are also beneficial in the reverberant case, in order to propose some guidelines on how to adjust a separation system to reverberated input. ◉ They take the SepFormer architecture, which achieves state-of-the-art performance both on WSJ0-2mix and WHAMR!, and the traditional model PIT-BLSTM then adopt both models on SMS-WSJ datasets. ◉ By modifying and optimizing the each model, w.r.t. loss function, encoder/decoder architecture, resolution and representation, they adapt the model to mitigate the performance degradation between the anechoic and reverberant scenario. 10
  • 11. Non-anechoic ◉ In [2], they expand the study by providing additional experiments and insights on more realistic and challenging datasets such as Libri2/3-Mix, which include longer mixtures, WHAM! and WHAMR!, featuring noisy and noisy & reverberant conditions respectively. ◉ Moreover, on WHAM! and on WHAMR! datasets they also provide results for speech enhancement. ◉ Another contribution of [2] is investigating different types of self-attention for speech separation with and without the dual-path mechanism. Namely, they compare the vanilla Transformer with Longformer, Linformer, and Reformer. 11
  • 12. Reverberation 12 <Reverberation-Wikipedia> <測量殘響時間 RT60> WHAMR!: RT60 ∈ U(200 ms, 500ms) SMS-WSJ: RT60 ∈ U(100 ms, 1000ms)
  • 13. Objectives 13 ◉ In this presentation, we expect to obtain: ✓ separation performance on WHAM!, WHAMR!, LibriMix and SMS-WSJ,[1][2] ✓ enhancement performance on WHAM! and WHAMR!, [2] ✓ modifications that are rewarding to adapt to the reverberant condition,[1] ✓ analysis on those modifications[1] and ✓ comparison between transformer variety. [2]
  • 14. Prior Knowledge - Basic architecture - Metrics - Permutation invariant training - Transformer - Time Complexity - Variety 2 14
  • 15. Basic Architecture ◉ Encoder module is used to transform short segments of the mixture waveform into their corresponding representations in an intermediate feature space. ◉ This representation is then used to estimate a multiplicative function (separation mask) for each source at each time step. ◉ The source waveforms are then reconstructed by transforming the masked encoder features using a decoder module. 15 conv1d transpose-conv1d
  • 17. Metrics See page 6 ~ 9 in PPT made by professor HUNG-YI LEE. 17 <Speech Separation - HUNG-YI LEE>
  • 18. PIT See page 10 ~ 11 in PPT made by professor HUNG-YI LEE. 18 <Speech Separation - HUNG-YI LEE>
  • 19. ◉ Transformers enable direct and accurate context modeling of longer term dependencies which render them suitable for audio processing, especially for speech separation, where long-term modeling has been shown to impact performance significantly. ◉ However, to avoid confusions, the Transformer in this paper refers especially to the encoder part, and it is comprised of three core modules: scaled dot-product attention, multi-head attention and position-wise feed-forward network. ◉ 【機器學習2021】Transformer (上)、 【機器學習2021】Transformer (下)、 Attention 及 Transformer 架構理解 19 Transformer
  • 22. Variety - Longformer 22 <Longformer: The Long-Document Transformer> <Longformer論文筆記-知乎 >
  • 23. Variety - Linformer 23 <Linformer: Self-Attention with Linear Complexity> <Linformer論文筆記-知乎 >
  • 24. 24 Variety - Reformer <Reformer: The Efficient Transformer> <圖解Reformer>
  • 27. SepFormer ◉ For mixture , the encoder learns an STFT-like representation: ◉ The masking network is fed by and estimates masks for each of the speakers. 1. Linear + Chunking (segmentation + overlap) : 2. SepFormer block (intra-Transformer + inter-Transformer) : 3. Linear + PReLU : 27
  • 28. SepFormer 4. Overlap-add : 5. FFW + PReLU : ◉ The input to the decoder is the element-wise multiplication between masks and initial representation: 28
  • 29. SepFormer Block ◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk independently, modeling the short-term dependencies within each chunk. ◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is applied to model the transitions across chunks. ◉ Overall transformation : ◉ The above consist a SepFormer block, and there can be N blocks repeatedly. 29
  • 31. SepFormer Block ◉ Transformer procedure (Pre-LN setting): 1. Positional encoding : 2. Layer norm + Multi-head attention : 3. Layer norm + Feed forward + Residuals : 4. Repeat blocks 31 <On Layer Normalization in the Transformer Architecture>
  • 32. SepFormer ◉ Parameters: ○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 ; ○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 ; ○ Optimizer Adam ; Loss negative SI-SNR ; Learning rate 1.5e-4 ; Batch size 1 ; Epochs 200 . ◉ They explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of new mixtures from single speaker sources, along with speed perturbation in [95%, 105%]. ◉ Training process also apply learning rate halving, gradient clipping and mixed-precision. Mixed-precision training is a technique for substantially reducing neural net training time by performing as many operations as possible in fp16, instead of fp32. 32
  • 33. SepFormer ◉ When using dynamic mixing, SepFormer achieves state-of- the-art performance. ◉ SepFormer outperforms previous systems without using dynamic mixing except Wavesplit, which uses speaker identity as additional information. 33
  • 34. SepFormer ◉ A respectable performance of 19.2 dB is obtained even when we use a single layer Transformer for the Inter-Transformer. This suggests that the Intra-Transformer, and thus local processing, has a greater influence on the performance. ◉ It also emerges that positional encoding is helpful. A similar outcome has been observed in T-gsa for speech enhancement. ◉ Finally, it can be observed that dynamic mixing helps the performance drastically. 34
  • 35. SepFormer ❗ However, my inference experience shows that during training the memory consumption of SepFormer is, on the contrary, large enough to be unable to fit in one 24 GB card. ❗ As a result, they can only train with batch size of one in 32 GB card. 35
  • 36. PIT-BLSTM 36 STFT iSTFT Masking net 3 BLSTM + 2 FC ; Hidden units 600 ; Window size 512 ; Frame size 128 ; Feature dim 257 ; Loss MSE (on magnitude).
  • 37. PIT-BLSTM ◉ In [1], SepFormer is trained with a soft-thresholded time-domain negative SDR: ❓ This loss decreases the contribution of well separated examples to the gradient, encouraging the model to focus more on enhancing examples with a low SDR than those that already show a good separation. (I argue that such modification is mainly for stability, e.g. when the distance between estimation and ground truth is close enough, like the evaluation in Sudormrf.) 37
  • 39. Comparison ◉ In [1], SepFormer (small) reduce the original number of intra- and inter-Transformer layers to 4, each. This modification yields an about 1 dB lower SDR on WSJ0-2mix, but significantly reduces the number of parameters. 39 ❗ Note that the memory footprint of SepFormer (small) still is 16 times larger than the PIT-BLSTM, so that a complexity comparison purely based on the parameters is not fair.
  • 40. Experiments - Non-anechoic[1][2] - Optimization[1] - Summary[1] - My perspective - Variety[2] 4 40
  • 41. LibriMix 41 ◉ LibriMix is proposed as an alternative to WSJ0-2/3Mix with more diverse and longer utterances. ◉ The network is trained to perform separation and de-noising at the same time.
  • 42. WHAM! & WHAMR! ◉ In WHAM! dataset, the model learns to de-noise while also separating, and in WHAMR! dataset, the model learns to de-noise and de-reverberate in addition to separating. ◉ SepFormer outperforms the previously proposed methods even without the use of DM, which further improves the performance. 42
  • 43. Speech Enhancement ◉ In addition to training SepFormer, we also trained a Bidirectional LSTM, CNNTransformer, CFDN models from Speechbrain, which are trained to minimize MSE of the estimated magnitude spectrogram and the spectrogram of the clean signals. 43 <The SpeechBrain Toolkit>
  • 44. SMS-WSJ ◉ It can be seen that both systems degrade under the presence of reverberation. However, the separation performance of the SepFormer degrades by more than 10 dB in terms of SDR and almost 30 percentage points regarding the WER. (WERs are evaluated by kaldi baseline provided in SMS-WSJ.) 44
  • 45. Optimization ◉ Apart from the discrepancy in masking net structure, there are also other differences between both model, namely loss function, encoder/decoder architecture and resolution. ◉ The choice of representation, e.g. magnitude, phase or both, is also worth exploring. ◉ But first, they speculate whether the degradation is on account of artifacts. The method to mitigate that is by adding white Gaussian noise at an SNR of 25 dB to the separated audio files before they were input to the speech recognizer. 45 <White Gaussian Noise> <How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR>
  • 46. Optim. ◉ Even more so, these modifications work well both with and without reverberation and lead to a reduction in WER of more than 20 percentage points for both scenarios. ◉ As for SepFormer, it is striking that it is no longer superior to PIT-BLSTM for reverberant data. 46
  • 47. Optim. - En/Decoder ◉ There is a large mismatch between the window size and the frame size in between. It is necessary to verify whether the violation of the Multiplicative Transfer Function Approximation (MTFA) related problem caused by the small window size contributes to the system deterioration under reverberation. ◉ MTFA is a long standing problem when facing any application using STFT in reverberant scenario. When switching from a fixed STFT encoder to a learnable encoder, the overall structure of the system stays the same. Therefore, it can be assumed that similar issues arise with the learnable encoder. 47
  • 48. MTFA ◉ Mask-based separation relies on the sparsity and orthogonality of the sources in the the domain where the masks are computed. In case of the STFT domain, this means that a time-frequency bin of a mixture will typically be: ,where and are the STFT representation of the k-th source signal and the Room Impulse Response (RIR) from the k-th source to the microphone, respectively. ◉ Note that the equation makes the assumption that the convolution of the source signal with the RIR corresponds to a multiplication of in STFT domain. This so-called Multiplicative Transfer Function Approximation, however, only holds true if the temporal extent of is smaller than the STFT analysis window. 48
  • 49. 49 A convolution of and is defined as: So, its fourier transform will be: Since , we may apply Fubini's theorem (i.e. interchange the order of integration): Proof of “convolution in time = multiplication in frequency” 詳情請見DSP相關課程
  • 50. “ The multiplication is NOT accurate in STFT domain! In particularly, it’s on account of the windowing of short time. 50 <On MTF Approximation in the STFT Domain> <STFT: Influence of Window Function>
  • 51. 51 ◉ Expected behavior in anechoic conditions : ✓ Reducing the frame shift leads to an improvement in SDR and WER. The recommended analysis window size and shift of 16 and 8 samples (i.e. 2 ms and 1 ms), respectively, provides the best results. ✓ Furthermore, the learnable encoder proves superior to the STFT encoder.
  • 52. 52 ◉ Unexpected behaviour: ✗ While the STFT encoder is significantly worse than a learnable encoder for a small window size and shift, it begins to be on par or even outperforms the learnable encoder for an increased window size of 32 ms. violation of the MTFA ✗ Interestingly, the overall best results of the SepFormer are achieved with the STFT.
  • 53. 53 ◉ W-Disjoint Orthogonality (WDO) is a score which measures the orthogonality of the single-speaker utterances in the latent space. ◉ This indicates that the learnable encoder is able to compensate the effects to some degree, but that choosing a large enough window size is mandatory to stabilize the performance under reverberation. (though not highly correlated) from high orthogonality to relatively low mitigated by larger window
  • 54. Optim. - Representation ◉ A significant difference between PIT-BLSTM and SepFormer is that the PIT model estimates the masks based on the magnitude spectrogram only, whereas the SepFormer mask estimator has access to the both magnitude and phase implicitly. ◉ Result shows that the phase information is not helpful for the SepFormer in the reverberant scenario. Even more so, omitting the phase information leads to a better system performance. 54
  • 55. Optim. - Representation ◉ Firstly, only using the magnitude spectrogram results in a larger window of 512 samples to keep the size of the separator identical, increasing the temporal context of each frame even further. ◉ Secondly, papers have shown that the phase becomes less informative while the magnitude becomes more informative for increasing frame sizes. ◉ However, the large window sizes that were shown to be necessary for the reverberant scenario. ◉ An interesting finding is when using the magnitude and frame shift altered from 16 to 128 samples improves both signal-level metrics and WER. 55 <Phase-Aware Deep Speech Enhancement: It's All About The Frame Length>
  • 56. Summary ◉ When only the network of the separator is different, i.e., intra/inter-transformer layers vs BLSTM layers, the performances are close. ◉ However, this modified SepFormer only shows a marginally better SDR and an improvement of 1 percentage point in the WER. 56 STFT, SDR loss, +Gaussian Learned en/decoder, 16 window STFT, 512 window
  • 57. My Perspective ◉ What possibly causes the inconsistent performance in WHAMR (SOTA) and SMS-WSJ (marginally better)? ➜ In [1] and [2], the configuration of each are different. ➜ If learned properly, the obvious benefit of having more parameters is that model can represent much more complicated functions than the one with fewer parameters. (If not, having more parameters increase the risk of overfitting.) ◉ What is the consensuses in separation, even under reverberant? (cont.) 57
  • 58. My Perspective 58 Anechoic Time domain loss Fine-grained Learned en/decoder SepFormer[2] (16/8k = 2ms) Conv-TasNet[3] (0.5 ms) (improv. < 2dB) Reverberant Time domain loss Fine-grained Learned en/decoder SepFormer (small)[1] (64 ms) (improv. < 1dB) SepFormer[2] (2ms) Conv-TasNet[3] (8 ms) (improv. < 2dB) Agree Disagree <Preference for 20-40 ms window duration in speech analysis> Typical frame lengths in speech processing, around 32 ms, correspond to an interval that is short enough to be considered quasi-stationary but long enough to cover multiple fundamental periods of voiced speech whose fundamental period lies between 2 ms to 12.5 ms.
  • 59. My Perspective 59 Anechoic Time domain loss Fine-grained Learned en/decoder SepFormer[2] (16/8k = 2ms) Conv-TasNet[3] (0.5 ms) (improv. < 2dB) Reverberant Time domain loss Fine-grained Learned en/decoder SepFormer (small)[1] (64 ms) (improv. < 1dB) SepFormer[2] (2ms) Conv-TasNet[3] (8 ms) (improv. < 2dB) Agree Disagree
  • 60. My Perspective 60 Anechoic Time domain loss Fine-grained Learned en/decoder SepFormer[2] (16/8k = 2ms) Conv-TasNet[3] (0.5 ms) (improv. < 2dB) Reverberant Time domain loss Fine-grained Learned en/decoder SepFormer (small)[1] (64 ms) (improv. < 1dB) SepFormer[2] (2ms) Conv-TasNet[3] (8 ms) (improv. < 2dB) Agree Disagree consensus on anechoic env.
  • 61. My Perspective 61 Anechoic Time domain loss Fine-grained Learned en/decoder SepFormer[2] (16/8k = 2ms) Conv-TasNet[3] (0.5 ms) (improv. < 2dB) Reverberant Time domain loss Fine-grained Learned en/decoder SepFormer (small)[1] (64 ms) (improv. < 1dB) SepFormer[2] (2ms) Conv-TasNet[3] (8 ms) (improv. < 2dB) Agree Disagree not a crucial factor, but FFC may help <Fast Fourier Convolution>
  • 62. My Perspective 62 Anechoic Time domain loss Fine-grained Learned en/decoder SepFormer[2] (16/8k = 2ms) Conv-TasNet[3] (0.5 ms) (improv. < 2dB) Reverberant Time domain loss Fine-grained Learned en/decoder SepFormer (small)[1] (64 ms) (improv. < 1dB) SepFormer[2] (2ms) Conv-TasNet[3] (8 ms) (improv. < 2dB) Perhaps multi-resolution encoder is a solution! Agree Disagree <Multi-encoder multi-resolution framework for end-to-end speech recognition>
  • 63. My Perspective ◉ Is SDR the right metrics for obtaining better WER? 63 <Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR>
  • 67. Conclusions ◉ In [1][2], we have explored speech separation and enhancement upon SepFormer, an attention-only masking network that is able to achieve state-of-the-art performance in anechoic conditions. ◉ The experiments in [1] and [3] raise the issue of whether the improvements that have been appraised for the separation of anechoic mixtures are futile for the more realistic case of reverberant source separation. ◉ We also investigate the use of more efficient self-attention mechanisms such as Longformer, Linformer and Reformer. ◉ Can we find a versatile model that can separate speech in both anechoic and reverberant condition but with less parameters or even model size? 67
  • 68. Appendix ◉ <My research notes> 68 ◉ “Speech Separation” video from professor HUNG-YI LEE:
  • 69. Any questions ? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 69