SlideShare a Scribd company logo
1 of 53
Conv-TasNet:
Surpassing Ideal Time-Frequency
Magnitude Masking for Speech Separation
Presenter : 何冠勳 61047017s
Date : 2022/04/14
Authors: Yi Luo, Nima Mesgarani
Published in arXiv:1809.07454v3 [cs.SD] 15 May 2019
Outline
2
1 3 5
6
4
2
Introduction Experiment Samples
Architecture Results Discussions &
Conclusions
Introduction
Why Conv-TasNet? What is Conv-TasNet?
1
3
4
Why Conv-TasNet?
Plot shows scale-invariant SDR improvement (SI-SDRi) on wsj0-mix2 (higher is better)
Source: https://paperswithcode.com/task/speech-separation
5
Why Conv-TasNet?
Milestone 1
Milestone 2
Why Conv-TasNet?
◉ Robust speech processing in real-world acoustic environments often requires automatic speech
separation.
◉ Most previous speech separation approaches have been formulated in the time-frequency
representation (spectrogram) of the mixture signal.
◉ Alternatively, a weighting function (mask) can be estimated for each source to multiply each T-F bin in
the mixture spectrogram to recover the individual sources.
6
Noisy
spectrogram
Clean
spectrogram 1
Nonlinear
regression
techniques
Seperated
spectrogram 1
Seperated
spectrogram 2
Clean
spectrogram 2
Loss
Why Conv-TasNet?
◉ In both the direct method and the mask estimation method, the waveform of each source is
calculated using the STFT and iSTFT.
◉ This method has several shortcomings.
○ STFT is a generic signal transformation, not necessarily optimal for speech separation.
○ Phase problem!! Accurate reconstruction of the phase of the clean sources is a
nontrivial problem.
○ Successful separation from the time-frequency representation requires a
high-resolution frequency decomposition.
◉ Because these issues arise from formulating the separation problem in the time-frequency
domain, a logical approach is to directly formulating the separation in the time domain.
7
What is Conv-TasNet?
◉ Convolutional Time-domain Audio Separation Network
◉ The main idea is to replace the STFT and iSTFT for feature extraction with a data-driven representation
that is jointly optimized with an end-to-end training paradigm.
◉ While original TasNet uses LSTM as separation module, it significantly limited its applicability.
○ Difficult determination on kernel size (training unmanageable)
○ Computation cost
○ Inconsistent accuracy, while changing the starting point of same mixture.
◉ Conv-TasNet uses TCN to replace the deep LSTM networks for the separation step.
◉ Simple as its structure is, Conv-TasNet only consists encoder, separation mask, and decoder.
8
What is Conv-TasNet?
9
Architecture
- Architecture
- Formulating
- Separation module
- Normalization
2
10
Architecture
11
Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for
each source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
12
Formulating
13
encoder
representation
segment
mixture source
If C = 2, two speakers are involved;
if C = 3, three speakers are involved.
Uses depthwise separable convolution
in every TCN layer, who produces
masks.
mask
masked
If C = 2, two masks are estimated;
if C = 3, three masks are estimated, along
with constraint .
seperated
seperated
source
( )
decoder
Architecture
14
Separation module
◉ Motivated by TCN, we propose a fully-convolutional separation module that consists of stacked
1-D dilated convolutional blocks.
◉ Each layer in a TCN consists of 1-D convolutional blocks with increasing dilation factors. The
dilation factors increase exponentially to ensure a sufficiently large temporal context window to
take advantage of the long-range dependencies of the speech signal.
◉ The input to each block is zero padded accordingly to ensure the output length is the same as the
input.
◉ The 1-D convolutional block consists of:
○ Residual path of a block serves as the input to the next block
○ Skip-connection paths for all blocks are summed up and used as the output of the TCN.
15
Dilated?
16
Normal convolution Dilated convolution
◉ Enable networks to consider context not just only in
the neighborhood.
◉ Preserves the input resolution throughout the
network.
◉ Increase the dilation factor exponentially results in
exponential receptive field growth with depth.
“The receptive field, or sensory space, is a delimited medium where some physiological
stimuli can evoke a sensory neuronal response in specific organisms.”
Residual & Skip connections
◉ Both residual and parameterised skip connections are
used throughout the network, to speed up convergence
and enable training of much deeper models.
◉ To further decrease the number of parameters, depthwise
separable convolution is used to replace standard
convolution in each convolutional block.
17
Residual?
◉ In traditional neural networks, each layer feeds into the next layer. In a network with residual
blocks, each layer feeds into the next layer and directly into the layers about 2–3 hops away.
18
Optimizing the residual F(x) := H(x) − x is more tractable for
deeper networks.
Normalization
The choice of the normalization method in the network depends on the causality requirement.
19
◉ For noncausal configuration, we found
empirically that global layer normalization
(gLN) outperforms all other normalization
methods.
◉ In causal configuration, gLN cannot be applied
since it relies on the future values of the signal at
any time step. Instead, we designed a cumulative
layer normalization (cLN) operation.
Experiments
- Dataset
- Training configurations
- Evaluation metrics
- Comparison
3
20
Dataset
◉ We evaluated our system on two-speaker and three-speaker speech separation problems using
the WSJ0-2mix and WSJ0- 3mix datasets. (30 hours of training and 10 hours of validation data)
◉ The speech mixtures are generated by randomly selecting utterances from different speakers in
the WSJ0 and mixing them at random signal- to-noise ratios (SNR) between -5 dB and 5 dB.
◉ 5 hours of evaluation set is generated in the same way using utterances from 16 unseen speakers.
◉ All the waveforms are resampled at 8 kHz.
21
Training configurations
◉ The networks are trained for 100 epochs on
4-second long segments.
◉ The initial lr is set to 1e−3. The learning rate is
halved if the accuracy of validation set is not
improved in 3 consecutive epochs.
◉ Adam is used as the optimizer. A 50% stride size
is used in the convolutional autoencoder.
◉ Gradient clipping with maximum L2-norm of 5 is
applied during training.
22
Training objectives
◉ The objective of training the end-to-end system is maximizing the improvement of the
scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion
ratio (SDR).
◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the
source permutation problem.
23
Evaluation metrics
◉ Report SI-SNRi & SDRi as objective measures of separation accuracy.
◉ Evaluate the quality of the separated mixtures using
○ perceptual evaluation of subjective quality (PESQ)
○ mean opinion score (MOS) by asking 40 normal hearing subjects to rate the quality of the
separated mixtures.
24
Comparison
◉ Following the common configurations mentioned above, the ideal time-frequency masks were
calculated using STFT with a 32 ms window size and 8 ms hop size with a Hanning window.
◉ The ideal masks include the ideal binary mask (IBM), ideal ratio mask (IRM), and Wiener
filter-like mask (WFM).
25
Results
- Encoder & decoder
- Best parameters
- Comparison
- Dissection of autoencoder
4
26
27
Visualization of the encoder and decoder basis functions,
encoder representation, and source masks for a sample
2-speaker mixture.
The estimated masks for the two speakers highly resemble
their encoder representations.
Encoder & decoder
◉ Review:
, the term (·) is actually an optional nonlinear function.
◉ Compare five different encoder-decoder configurations:
a. Linear encoder with its pseudo-inverse (Pinv) as decoder, with Softmax function for mask
estimation. Formulated as follows:
&
b. Linear encoder and decoder, with Softmax or Sigmoid function for mask estimation.
Formulated as follows:
&
c. Encoder with ReLU activation and linear decoder, with Softmax or Sigmoid function for
mask estimation. Formulated as follows:
& 28
encoder
representation
segment
Encoder & decoder
◉ We can see that pseudo-inverse
autoencoder leads to the worst
performance, indicating that an explicit
autoencoder configuration does not
necessarily improve the separation
score.
◉ We use linear encoder and decoder with
Sigmoid function in all the following
experiments.
29
Best parameters
30
Best 5
Best parameters
31
Increasing the number of
basis signals in the
encoder/decoder increases
the overcompleteness of the
basis signals and improves
the performance.
Best parameters
32
Shorter segment length
consistently improves
performance.
Note that the best system
uses a filter length of only 2
ms (L = 16/8k = 0.002s),
which makes it very difficult
to train a deep LSTM network
with the same L due to the
large number of time steps in
the encoder output.
Best parameters
33
The bottleneck H/B was found
to be best around 5.
Regardless of receptive field,
deeper networks lead to better
performance, possibly due to
the increased model capacity.
Different models
34
◉ While the BLSTM-TasNet already outperformed IRM
and IBM, the non-causal Conv-TasNet significantly
surpasses the performance of all three ideal T-F masks
in SI-SNRi and SDRi metrics
◉ Moreover, with a significantly smaller model size
comparing with all previous methods.
Different models
35
◉ The non-causal Conv-TasNet system
significantly outperforms all previous
STFT-based systems in SDRi.
◉ The causal Conv- TasNet significantly
outperforms even the other two non-causal
STFT-based systems.
Quality evaluation - PESQ
◉ In addition to SDRi and SI-SNRi, we evaluated the subjective (by MOS) and objective (by PESQ)
quality of the separated speech and compared with three ideal time-frequency magnitude
masks.
◉ However, since PESQ aims to predict the subjective quality of speech, human quality evaluation
can be considered as the ground truth.
36
Quality evaluation - MOS
◉ Therefore, we conducted a psychophysics experiment in which we asked 40 normal hearing
subjects to listen and rate the quality of the separated speech sounds.
◉ We randomly chose 25 two-speaker mixture sounds from WSJ0-2mix. We avoided a possible
selection bias by ensuring that the average PESQ scores for the IRM and Conv-TasNet separated
sounds for the selected 25 samples were equal to the average PESQ scores over the entire test
set.
◉ The subjects were asked to rate the quality of the clean utterances, the IRM-separated
utterances, and the Conv-TasNet separated utterances on the scale of 1 to 5.
1: bad, 2: poor, 3: fair, 4: good, 5: excellent
37
Quality evaluation - MOS
◉ MOS for Conv-TasNet is higher than the MOS for the IRM.
38
Quality evaluation - MOS
◉ The subjective ratings of almost all utterances for
Conv-TasNet are higher than their corresponding PESQ
scores.
◉ Observation shows that PESQ consistently underestimates
MOS for Conv-TasNet separated utterances, which may be
due to the dependence of PESQ on the magnitude
spectrogram of speech which could produce lower scores
for time-domain approaches.
39
Speed
◉ Here compares the processing speed of LSTM-TasNet and
causal Conv-TasNet. The speed is evaluated as the
average processing time for the systems to separate each
frame, time per frame (TPF).
40
◉ The processing in LSTM-TasNet is done sequentially, meaning each time frame must wait for the
completion of the previous time frame.
◉ Since Conv-TasNet decouples the processing of consecutive frames, the processing of subsequent
frames does not have to wait, allowing the possibility of parallel computing.
◉ TPF << frame length (2 ms) in CPU configuration -> Conv-TasNet can perform real-time separation.
Stability
◉ Unlike language processing tasks where sentences have determined starting words, it is difficult
to define a general starting sample or frame for speech separation and enhancement tasks.
◉ A robust audio processing system should therefore be insensitive to the starting point of the
mixture.
◉ We systematically examined the robustness of LSTM-TasNet and causal Conv-TasNet to the
starting point of the mixture by evaluating the separation accuracy for each mixture in the
WSJ0-2mix test set with different sample shifts of the input.
41
Stability
◉ Figure(A) shows how an example mixture
perform in different model.
◉ Figure(B) shows the standard deviation of SDRi
across all test set.
◉ The causal Conv-TasNet performs consistently
well for all shift values of the input mixture.
42
◉ One explanation for this inconsistency may be due to the sequential processing constraint in
LSTM-TasNet which means that failures in previous frames can accumulate and affect the separation
performance in all following frames.
Dissection of autoencoder
◉ One of the motivations for replacing the STFT with
the convolutional encoder was to construct a
representation of the audio that is optimized for
speech separation.
◉ The majority of the filters are tuned to lower
frequencies. In addition, it shows that filters with
the same frequency tuning express various phase
values for that frequency.
◉ This result suggests an important role for
low-frequency features of speech such as pitch as
well as explicit encoding of the phase information
to achieve superior performance.
43
Samples
Enough numbers, let’s listen.
5
44
Stability
◉ Official samples: tasnet.html
◉ Inference samples (using pre-trained model not from official, but here):
45
mixture spk1 spk2 clean1 clean2
Discussions &
Conclusions
6
46
Discussions
◉ Most of the filters are tuned to low acoustic frequencies. This pattern of frequency representation,
which we found using a data-driven method, roughly resembles the well-known mel-frequency scale.
◉ In addition, the overexpression of lower frequencies may indicate the importance of accurate pitch
tracking in speech separation, similar to what has been reported in human multitalker perception
studies.
◉ The filters with the same frequency tuning explicitly express various phase information. In contrast, this
information is implicit in the STFT operations
◉ However, limitations are:
○ The long- term tracking of an individual speaker may fail.
○ The generalization to noisy and reverberant conditions must be further tested.
47
Conclusion
◉ The improvements are accomplished by replacing the STFT with a convolutional
encoder-decoder architecture.
◉ Separation in Conv-TasNet is done using TCN architecture together with a depthwise
separable convolution operation to address the challenges of LSTM networks.
◉ Conv-TasNet significantly outperforms STFT speech separation systems, never mention
possessing much smaller model size and a shorter minimum latency.
◉ In summary, Conv-TasNet represents a significant step toward the realization of speech
separation algorithms and opens many future research directions that would further
improve it, which could eventually make automatic speech separation a common and
necessary feature of every speech processing technology designed for real-world
applications. 48
Architecture of WaveNet
49
There are four mechanisms in WaveNet:
Causal dilated convolutions, Gated activation function, Residual and Skip connection.
Architecture of Conv-TasNet
50
Gated activation units
51
Gated Activation Unit
◉ Gated activation units, same as used in gated PixelCNN
∗: convolution; ⊙: element-wise multiplication
σ: sigmoid function
k: layer index; f: filter; g: gate
W : convolution filter
◉ Empirically found to outperform ReLUs.
1-D Convolutional block
◉ Both residual and parameterised skip connections are
used throughout the network, to speed up convergence
and enable training of much deeper models.
◉ To further decrease the number of parameters, depthwise
separable convolution is used to replace standard
convolution in each convolutional block.
52
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
53

More Related Content

What's hot

Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Daichi Kitamura
 
独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...
独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...
独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...Daichi Kitamura
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用NU_I_TODALAB
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークNU_I_TODALAB
 
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習Shinnosuke Takamichi
 
Pred net使ってみた
Pred net使ってみたPred net使ってみた
Pred net使ってみたkoji ochiai
 
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...Daichi Kitamura
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-notePei-Che Chang
 
深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討Kitamura Laboratory
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離NU_I_TODALAB
 
非負値行列分解の確率的生成モデルと 多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...
非負値行列分解の確率的生成モデルと多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...非負値行列分解の確率的生成モデルと多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...
非負値行列分解の確率的生成モデルと 多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...Daichi Kitamura
 
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...ssuserf54db1
 
統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展Shinnosuke Takamichi
 
Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Shinnosuke Takamichi
 
Icml読み会 deep speech2
Icml読み会 deep speech2Icml読み会 deep speech2
Icml読み会 deep speech2Jiro Nishitoba
 
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案Keisuke Imoto
 
Adaptive Beamforming Algorithms
Adaptive Beamforming Algorithms Adaptive Beamforming Algorithms
Adaptive Beamforming Algorithms Mohammed Abuibaid
 
DNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズム
DNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズムDNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズム
DNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズムShinnosuke Takamichi
 

What's hot (20)

Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...
 
独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...
独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...
独立深層学習行列分析に基づく多チャネル音源分離の実験的評価(Experimental evaluation of multichannel audio s...
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
 
Pred net使ってみた
Pred net使ってみたPred net使ってみた
Pred net使ってみた
 
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note
 
Ea2015 7for ss
Ea2015 7for ssEa2015 7for ss
Ea2015 7for ss
 
深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
非負値行列分解の確率的生成モデルと 多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...
非負値行列分解の確率的生成モデルと多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...非負値行列分解の確率的生成モデルと多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...
非負値行列分解の確率的生成モデルと 多チャネル音源分離への応用 (Generative model in nonnegative matrix facto...
 
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
 
統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展
 
Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討
 
Icml読み会 deep speech2
Icml読み会 deep speech2Icml読み会 deep speech2
Icml読み会 deep speech2
 
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
Onoma-to-wave: オノマトペを利用した環境音合成手法の提案
 
Adaptive Beamforming Algorithms
Adaptive Beamforming Algorithms Adaptive Beamforming Algorithms
Adaptive Beamforming Algorithms
 
DNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズム
DNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズムDNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズム
DNNテキスト音声合成のためのAnti-spoofingに敵対する学習アルゴリズム
 

Similar to Conv-TasNet.pdf

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System IJECEIAES
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitShubham Verma
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Editor IJCATR
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
 
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDADynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDAIJERA Editor
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...csandit
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizera3labdsp
 

Similar to Conv-TasNet.pdf (20)

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
Et25897899
Et25897899Et25897899
Et25897899
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDADynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizer
 

Recently uploaded

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 

Recently uploaded (20)

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 

Conv-TasNet.pdf

  • 1. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation Presenter : 何冠勳 61047017s Date : 2022/04/14 Authors: Yi Luo, Nima Mesgarani Published in arXiv:1809.07454v3 [cs.SD] 15 May 2019
  • 2. Outline 2 1 3 5 6 4 2 Introduction Experiment Samples Architecture Results Discussions & Conclusions
  • 4. 4 Why Conv-TasNet? Plot shows scale-invariant SDR improvement (SI-SDRi) on wsj0-mix2 (higher is better) Source: https://paperswithcode.com/task/speech-separation
  • 6. Why Conv-TasNet? ◉ Robust speech processing in real-world acoustic environments often requires automatic speech separation. ◉ Most previous speech separation approaches have been formulated in the time-frequency representation (spectrogram) of the mixture signal. ◉ Alternatively, a weighting function (mask) can be estimated for each source to multiply each T-F bin in the mixture spectrogram to recover the individual sources. 6 Noisy spectrogram Clean spectrogram 1 Nonlinear regression techniques Seperated spectrogram 1 Seperated spectrogram 2 Clean spectrogram 2 Loss
  • 7. Why Conv-TasNet? ◉ In both the direct method and the mask estimation method, the waveform of each source is calculated using the STFT and iSTFT. ◉ This method has several shortcomings. ○ STFT is a generic signal transformation, not necessarily optimal for speech separation. ○ Phase problem!! Accurate reconstruction of the phase of the clean sources is a nontrivial problem. ○ Successful separation from the time-frequency representation requires a high-resolution frequency decomposition. ◉ Because these issues arise from formulating the separation problem in the time-frequency domain, a logical approach is to directly formulating the separation in the time domain. 7
  • 8. What is Conv-TasNet? ◉ Convolutional Time-domain Audio Separation Network ◉ The main idea is to replace the STFT and iSTFT for feature extraction with a data-driven representation that is jointly optimized with an end-to-end training paradigm. ◉ While original TasNet uses LSTM as separation module, it significantly limited its applicability. ○ Difficult determination on kernel size (training unmanageable) ○ Computation cost ○ Inconsistent accuracy, while changing the starting point of same mixture. ◉ Conv-TasNet uses TCN to replace the deep LSTM networks for the separation step. ◉ Simple as its structure is, Conv-TasNet only consists encoder, separation mask, and decoder. 8
  • 10. Architecture - Architecture - Formulating - Separation module - Normalization 2 10
  • 12. Architecture ◉ Encoder module is used to transform short segments of the mixture waveform into their corresponding representations in an intermediate feature space. ◉ This representation is then used to estimate a multiplicative function (separation mask) for each source at each time step. ◉ The source waveforms are then reconstructed by transforming the masked encoder features using a decoder module. 12
  • 13. Formulating 13 encoder representation segment mixture source If C = 2, two speakers are involved; if C = 3, three speakers are involved. Uses depthwise separable convolution in every TCN layer, who produces masks. mask masked If C = 2, two masks are estimated; if C = 3, three masks are estimated, along with constraint . seperated seperated source ( ) decoder
  • 15. Separation module ◉ Motivated by TCN, we propose a fully-convolutional separation module that consists of stacked 1-D dilated convolutional blocks. ◉ Each layer in a TCN consists of 1-D convolutional blocks with increasing dilation factors. The dilation factors increase exponentially to ensure a sufficiently large temporal context window to take advantage of the long-range dependencies of the speech signal. ◉ The input to each block is zero padded accordingly to ensure the output length is the same as the input. ◉ The 1-D convolutional block consists of: ○ Residual path of a block serves as the input to the next block ○ Skip-connection paths for all blocks are summed up and used as the output of the TCN. 15
  • 16. Dilated? 16 Normal convolution Dilated convolution ◉ Enable networks to consider context not just only in the neighborhood. ◉ Preserves the input resolution throughout the network. ◉ Increase the dilation factor exponentially results in exponential receptive field growth with depth. “The receptive field, or sensory space, is a delimited medium where some physiological stimuli can evoke a sensory neuronal response in specific organisms.”
  • 17. Residual & Skip connections ◉ Both residual and parameterised skip connections are used throughout the network, to speed up convergence and enable training of much deeper models. ◉ To further decrease the number of parameters, depthwise separable convolution is used to replace standard convolution in each convolutional block. 17
  • 18. Residual? ◉ In traditional neural networks, each layer feeds into the next layer. In a network with residual blocks, each layer feeds into the next layer and directly into the layers about 2–3 hops away. 18 Optimizing the residual F(x) := H(x) − x is more tractable for deeper networks.
  • 19. Normalization The choice of the normalization method in the network depends on the causality requirement. 19 ◉ For noncausal configuration, we found empirically that global layer normalization (gLN) outperforms all other normalization methods. ◉ In causal configuration, gLN cannot be applied since it relies on the future values of the signal at any time step. Instead, we designed a cumulative layer normalization (cLN) operation.
  • 20. Experiments - Dataset - Training configurations - Evaluation metrics - Comparison 3 20
  • 21. Dataset ◉ We evaluated our system on two-speaker and three-speaker speech separation problems using the WSJ0-2mix and WSJ0- 3mix datasets. (30 hours of training and 10 hours of validation data) ◉ The speech mixtures are generated by randomly selecting utterances from different speakers in the WSJ0 and mixing them at random signal- to-noise ratios (SNR) between -5 dB and 5 dB. ◉ 5 hours of evaluation set is generated in the same way using utterances from 16 unseen speakers. ◉ All the waveforms are resampled at 8 kHz. 21
  • 22. Training configurations ◉ The networks are trained for 100 epochs on 4-second long segments. ◉ The initial lr is set to 1e−3. The learning rate is halved if the accuracy of validation set is not improved in 3 consecutive epochs. ◉ Adam is used as the optimizer. A 50% stride size is used in the convolutional autoencoder. ◉ Gradient clipping with maximum L2-norm of 5 is applied during training. 22
  • 23. Training objectives ◉ The objective of training the end-to-end system is maximizing the improvement of the scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion ratio (SDR). ◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the source permutation problem. 23
  • 24. Evaluation metrics ◉ Report SI-SNRi & SDRi as objective measures of separation accuracy. ◉ Evaluate the quality of the separated mixtures using ○ perceptual evaluation of subjective quality (PESQ) ○ mean opinion score (MOS) by asking 40 normal hearing subjects to rate the quality of the separated mixtures. 24
  • 25. Comparison ◉ Following the common configurations mentioned above, the ideal time-frequency masks were calculated using STFT with a 32 ms window size and 8 ms hop size with a Hanning window. ◉ The ideal masks include the ideal binary mask (IBM), ideal ratio mask (IRM), and Wiener filter-like mask (WFM). 25
  • 26. Results - Encoder & decoder - Best parameters - Comparison - Dissection of autoencoder 4 26
  • 27. 27 Visualization of the encoder and decoder basis functions, encoder representation, and source masks for a sample 2-speaker mixture. The estimated masks for the two speakers highly resemble their encoder representations.
  • 28. Encoder & decoder ◉ Review: , the term (·) is actually an optional nonlinear function. ◉ Compare five different encoder-decoder configurations: a. Linear encoder with its pseudo-inverse (Pinv) as decoder, with Softmax function for mask estimation. Formulated as follows: & b. Linear encoder and decoder, with Softmax or Sigmoid function for mask estimation. Formulated as follows: & c. Encoder with ReLU activation and linear decoder, with Softmax or Sigmoid function for mask estimation. Formulated as follows: & 28 encoder representation segment
  • 29. Encoder & decoder ◉ We can see that pseudo-inverse autoencoder leads to the worst performance, indicating that an explicit autoencoder configuration does not necessarily improve the separation score. ◉ We use linear encoder and decoder with Sigmoid function in all the following experiments. 29
  • 31. Best parameters 31 Increasing the number of basis signals in the encoder/decoder increases the overcompleteness of the basis signals and improves the performance.
  • 32. Best parameters 32 Shorter segment length consistently improves performance. Note that the best system uses a filter length of only 2 ms (L = 16/8k = 0.002s), which makes it very difficult to train a deep LSTM network with the same L due to the large number of time steps in the encoder output.
  • 33. Best parameters 33 The bottleneck H/B was found to be best around 5. Regardless of receptive field, deeper networks lead to better performance, possibly due to the increased model capacity.
  • 34. Different models 34 ◉ While the BLSTM-TasNet already outperformed IRM and IBM, the non-causal Conv-TasNet significantly surpasses the performance of all three ideal T-F masks in SI-SNRi and SDRi metrics ◉ Moreover, with a significantly smaller model size comparing with all previous methods.
  • 35. Different models 35 ◉ The non-causal Conv-TasNet system significantly outperforms all previous STFT-based systems in SDRi. ◉ The causal Conv- TasNet significantly outperforms even the other two non-causal STFT-based systems.
  • 36. Quality evaluation - PESQ ◉ In addition to SDRi and SI-SNRi, we evaluated the subjective (by MOS) and objective (by PESQ) quality of the separated speech and compared with three ideal time-frequency magnitude masks. ◉ However, since PESQ aims to predict the subjective quality of speech, human quality evaluation can be considered as the ground truth. 36
  • 37. Quality evaluation - MOS ◉ Therefore, we conducted a psychophysics experiment in which we asked 40 normal hearing subjects to listen and rate the quality of the separated speech sounds. ◉ We randomly chose 25 two-speaker mixture sounds from WSJ0-2mix. We avoided a possible selection bias by ensuring that the average PESQ scores for the IRM and Conv-TasNet separated sounds for the selected 25 samples were equal to the average PESQ scores over the entire test set. ◉ The subjects were asked to rate the quality of the clean utterances, the IRM-separated utterances, and the Conv-TasNet separated utterances on the scale of 1 to 5. 1: bad, 2: poor, 3: fair, 4: good, 5: excellent 37
  • 38. Quality evaluation - MOS ◉ MOS for Conv-TasNet is higher than the MOS for the IRM. 38
  • 39. Quality evaluation - MOS ◉ The subjective ratings of almost all utterances for Conv-TasNet are higher than their corresponding PESQ scores. ◉ Observation shows that PESQ consistently underestimates MOS for Conv-TasNet separated utterances, which may be due to the dependence of PESQ on the magnitude spectrogram of speech which could produce lower scores for time-domain approaches. 39
  • 40. Speed ◉ Here compares the processing speed of LSTM-TasNet and causal Conv-TasNet. The speed is evaluated as the average processing time for the systems to separate each frame, time per frame (TPF). 40 ◉ The processing in LSTM-TasNet is done sequentially, meaning each time frame must wait for the completion of the previous time frame. ◉ Since Conv-TasNet decouples the processing of consecutive frames, the processing of subsequent frames does not have to wait, allowing the possibility of parallel computing. ◉ TPF << frame length (2 ms) in CPU configuration -> Conv-TasNet can perform real-time separation.
  • 41. Stability ◉ Unlike language processing tasks where sentences have determined starting words, it is difficult to define a general starting sample or frame for speech separation and enhancement tasks. ◉ A robust audio processing system should therefore be insensitive to the starting point of the mixture. ◉ We systematically examined the robustness of LSTM-TasNet and causal Conv-TasNet to the starting point of the mixture by evaluating the separation accuracy for each mixture in the WSJ0-2mix test set with different sample shifts of the input. 41
  • 42. Stability ◉ Figure(A) shows how an example mixture perform in different model. ◉ Figure(B) shows the standard deviation of SDRi across all test set. ◉ The causal Conv-TasNet performs consistently well for all shift values of the input mixture. 42 ◉ One explanation for this inconsistency may be due to the sequential processing constraint in LSTM-TasNet which means that failures in previous frames can accumulate and affect the separation performance in all following frames.
  • 43. Dissection of autoencoder ◉ One of the motivations for replacing the STFT with the convolutional encoder was to construct a representation of the audio that is optimized for speech separation. ◉ The majority of the filters are tuned to lower frequencies. In addition, it shows that filters with the same frequency tuning express various phase values for that frequency. ◉ This result suggests an important role for low-frequency features of speech such as pitch as well as explicit encoding of the phase information to achieve superior performance. 43
  • 45. Stability ◉ Official samples: tasnet.html ◉ Inference samples (using pre-trained model not from official, but here): 45 mixture spk1 spk2 clean1 clean2
  • 47. Discussions ◉ Most of the filters are tuned to low acoustic frequencies. This pattern of frequency representation, which we found using a data-driven method, roughly resembles the well-known mel-frequency scale. ◉ In addition, the overexpression of lower frequencies may indicate the importance of accurate pitch tracking in speech separation, similar to what has been reported in human multitalker perception studies. ◉ The filters with the same frequency tuning explicitly express various phase information. In contrast, this information is implicit in the STFT operations ◉ However, limitations are: ○ The long- term tracking of an individual speaker may fail. ○ The generalization to noisy and reverberant conditions must be further tested. 47
  • 48. Conclusion ◉ The improvements are accomplished by replacing the STFT with a convolutional encoder-decoder architecture. ◉ Separation in Conv-TasNet is done using TCN architecture together with a depthwise separable convolution operation to address the challenges of LSTM networks. ◉ Conv-TasNet significantly outperforms STFT speech separation systems, never mention possessing much smaller model size and a shorter minimum latency. ◉ In summary, Conv-TasNet represents a significant step toward the realization of speech separation algorithms and opens many future research directions that would further improve it, which could eventually make automatic speech separation a common and necessary feature of every speech processing technology designed for real-world applications. 48
  • 49. Architecture of WaveNet 49 There are four mechanisms in WaveNet: Causal dilated convolutions, Gated activation function, Residual and Skip connection.
  • 51. Gated activation units 51 Gated Activation Unit ◉ Gated activation units, same as used in gated PixelCNN ∗: convolution; ⊙: element-wise multiplication σ: sigmoid function k: layer index; f: filter; g: gate W : convolution filter ◉ Empirically found to outperform ReLUs.
  • 52. 1-D Convolutional block ◉ Both residual and parameterised skip connections are used throughout the network, to speed up convergence and enable training of much deeper models. ◉ To further decrease the number of parameters, depthwise separable convolution is used to replace standard convolution in each convolutional block. 52
  • 53. Any questions ? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 53