SEGAN Speech Enhancement GAN Title

SEGAN: Speech Enhancement Generative
Adversarial Network
Santiago Pascual, Antonio Bonafonte, Joan Serra
TALP-UPC, BarcelonaTech & Telefonica Research Barcelona
santi.pascual@upc.edu
[arxiv] [github] [samples]

The GAN Epidemic
Figure credit: https://github.com/hindupuravinash/the-gan-zoo

3
Outline
1. Introduction
2. Generative Adversarial Networks
3. Speech Enhancement GAN
4. Experimental Setup
5. Results
6. Conclusions

4
Introduction
● Speech Enhancement: improve the intelligibility and quality of speech contaminated
by noise.
● Classic methods: Spectral subtraction, Wiener filtering, Statistical-model based
methods, Sub-space algorithms...
● Neural networks have been also applied to speech enhancement since the 80s!
● Most of the current systems are based on the short-time Fourier
analysis/synthesis framework: spectral domain features, signal assumptions.
● Significant improvements of speech quality are possible, especially when a clean
phase spectrum is known.
● Deep learning makes us wonder → What about using waveforms?

5
Generative Adversarial Networks (GAN)
We have a pair of networks, Generator (G) and Discriminator (D):
● They “fight” against each other during training→ Adversarial Training
● G mission: make its pdf, Pmodel, as much similar as possible to our training set
distribution Pdata → Try to make predictions so realistic that D can’t distinguish
● D mission: distinguish between G samples and real samples

Adversarial Training (conceptual)
Generator
Real world
samples
Discriminator
Real
Loss
Latentrandomvariable
Sample
Sample
Fake
6
z

Adversarial Training (Goodfellow et al. 2014)
We have networks G and D, and training set with pdf Pdata. Notation:
● θ(G), θ(D) (Parameters of model G and D respectively)
● x ~ Pdata (M-dim sample from training data pdf)
● z ~ N(0, I) (sample from prior pdf, e.g. N-dim normal)
● G(z) = ẍ ~ Pg (M-dim sample from G network)
D network receives x or ẍ inputs → decides whether input is real or fake. It is optimized to learn: x is
real (1), ẍ is fake (0) (binary classifier).
G network maps sample z to G(z) = ẍ → it is optimized to maximize D mistakes.
NIPS 2016 Tutorial: Generative Adversarial Networks. Ian Goodfellow

Adversarial Training (batch update)
● Pick a sample x from training set
● Show x to D and update weights to
output 1 (real)

● G maps sample z to ẍ
● show ẍ and update weights to output 0 (fake)

● Freeze D weights
● Update G weights to make D output 1 (just G weights!)
● Unfreeze D Weights and repeat

Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
Key Idea: as D is trained to detect fraud, its parameters learn discriminative
features of “what is real/fake”. As backprop goes through D to G there happens to
be information leaking about the requirements for bank notes to look real. This
makes G perform small corrections by little steps to get closer and closer to what a
real sample would be.
● Caveat: this means GANs are not suitable for discrete tokens predictions, like
words, because in that discrete space there is no “small change” criteria to get
to a neighbour word (but can work in a word embedding space for example)

100
100
It’s not even
greenFAKE

100
100
There is no
watermarkFAKE

100
100
Watermark
should be
roundedFAKE

?
After enough iterations, and if the counterfeiter is good enough (in terms of G
network it means “has enough parameters”), the police should be confused.
Real?

Conditioned GANs
GANs can be conditioned on other info extra to z: text, labels, speech, etc..
z might capture random characteristics of the data, variabilities of plausible
futures, whilst c would condition the deterministic parts
For details on ways to condition GANs:
Ways of Conditioning Generative
Adversarial Networks (Wack et al.)

Least Squares GAN
Main idea: shift to a loss function that provides smooth and non-saturating
gradients in D
● Because of sigmoid saturation in binary classification loss, G gets no info
when D gets to label true examples → vanishing gradients make G no learn
● Least squares loss improves learning with notion of distance of Pmodel to
Pdata:
Least Squares Generative Adversarial
Networks, Mao et al. 2016

Speech Enhancement GAN
Requirements:
● End-to-end: no assumptions or discarding certain info from data
○ wav2wav.
● One-shot generation: not slow recursive operations as in WaveNet
○ fully convolutional structure.
● Many noise types and speakers learned with same shared parameters
○ generalize in those dimensions.

Underlying conv structure in the system
● Conv1D filters
● Virtual Batch Normalization: normalize layer responses with statistics
from fixed batch (reference) + current batch → less intra dependent
statistics for GAN instability
● LeakyReLU/ParametricReLU:
○ α fixed (0.3) or learnable

Two stages in Generator:
1. Encoder (Downconv): Project noisy signal
into a deterministic representation c and
concatenate to latent variable z ~ N(0, I)
2. Decoder (Deconv): Interpolate the
intermediate hidden features w/ learnable
params. until re-generation of clean speech.

G architecture:
● 22 1D conv filters with kernel width = 31
● Increase feature maps in encoder from 32 to 1024, and all way back in the decoder to
one final waveform bounded [-1, 1].
● Use of skip-connections to shuttle the low level features to highest layer (avoiding
bottleneck for these).
● Use of PReLU activations
D architecture:
● Same as G encoder except for having 2 input channels (Noisy, Real/Fake)
● LeakyReLUs with alpha = 0.3
● Virtual Batch Normalizations for speed up convergence and stability
● Classification output unit

Experimental setup: database
● Publicly available Edinburgh dataset: http://datashare.is.ed.ac.uk/handle/10283/1942
● Amount of data and types of speakers and noises fit our purposes
● Data re-sampled at 16kHz
● Training set:
○ 40 noise conditions: 10 types of noise w/ 4 SNR conditions {15, 10, 5, 0} dB
○ 10 sentences in each condition for every speaker
○ 28 speakers
● Test set (None seen during training. No condition nor speaker):
○ 20 noise conditions: 5 types of noise w/ 4 SNR conditions each {17.5, 12.5, 7.5, 2.5} dB
○ 20 sentences for each condition per speaker
○ 2 speakers (male and female)

Experimental setup: training
● Show pairs of signals to “learn” a reconstruction loss.
● Use of L1 regularization to guide the GAN training.

Experimental setup: training
● Training set was chunked with 50% overlaps → generated canvas of 1 second each
(16384 samples).
● Batch size of 400 samples
● Training for 86 epochs
● Distributed training among up to 4 GPUs → ~ 18h total time
● Very low learning rates: 0.0002
● RMSprop optimizer
● L1 regularization weight: 100
● Pre-emphasis and de-emphasis with factor 0.95 helped getting rid of high freq. artifacts!
Final G loss: LSGAN
Adversarial + weighted L1
regularizatoin

Experimental setup: test
Objective evaluation:
● PESQ: Perceptual Evaluation of Speech Quality [-0.5, 4.5]: designed for
telephonic compression assessment
● COVL: MOS prediction of the overall effect [1, 5]
○ CSIG: Mean opinion score (MOS) prediction of the signaldistortion attending only to the
speech signal [1, 5]
○ CBAK: MOS prediction of the intrusiveness of background noise [1, 5]
● SSNR: Segmental SNR [0, inf)

Experimental setup: test
Subjective evaluation (perceptual test):
● 20 sentences per utterance. Each utterance randomly shown for the different
systems. Subjects can listen/compare as many times as desired.
● Cherry picking meaningful types of noise for both speakers (we had no tags,
so by hand).
● 16 subjects to rate from 1 (bad) to 5 (excellent) a trade-off:
○ How much noise is removed
○ How intact remains the signal to the enhancement process

Results: objective
Worst PESQ than the baseline, but outperforming the other metrics. However
CSIG/CBAK/COVL are more correlated to enhancement than PESQ. Quite higher
SSNR → SEGAN gets to remove much more noise.

Conclusions
● And end-to-end speech enhancement method has been implemented within
the GAN framework.
● It works as a fully conv encoder-decoder structure, making it faster than
recursive solutions.
● The results show its effectiveness, but there is an improvement margin.
● Further development in architecture design is now the work in progress.
● Next stage: voice conversion by adaptation of this architecture.

SEGAN Speech Enhancement GAN Title

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SEGAN Speech Enhancement GAN Title

Similar to SEGAN Speech Enhancement GAN Title (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

SEGAN Speech Enhancement GAN Title