Pascual, Santiago, Antonio Bonafonte, and Joan Serrà. "SEGAN: Speech Enhancement Generative Adversarial Network." INTERSPEECH 2017.
Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
SEGAN Speech Enhancement GAN Title
1. SEGAN: Speech Enhancement Generative
Adversarial Network
Santiago Pascual, Antonio Bonafonte, Joan Serra
TALP-UPC, BarcelonaTech & Telefonica Research Barcelona
santi.pascual@upc.edu
[arxiv] [github] [samples]
4. 4
Introduction
● Speech Enhancement: improve the intelligibility and quality of speech contaminated
by noise.
● Classic methods: Spectral subtraction, Wiener filtering, Statistical-model based
methods, Sub-space algorithms...
● Neural networks have been also applied to speech enhancement since the 80s!
● Most of the current systems are based on the short-time Fourier
analysis/synthesis framework: spectral domain features, signal assumptions.
● Significant improvements of speech quality are possible, especially when a clean
phase spectrum is known.
● Deep learning makes us wonder → What about using waveforms?
5. 5
Generative Adversarial Networks (GAN)
We have a pair of networks, Generator (G) and Discriminator (D):
● They “fight” against each other during training→ Adversarial Training
● G mission: make its pdf, Pmodel, as much similar as possible to our training set
distribution Pdata → Try to make predictions so realistic that D can’t distinguish
● D mission: distinguish between G samples and real samples
7. Adversarial Training (Goodfellow et al. 2014)
We have networks G and D, and training set with pdf Pdata. Notation:
● θ(G), θ(D) (Parameters of model G and D respectively)
● x ~ Pdata (M-dim sample from training data pdf)
● z ~ N(0, I) (sample from prior pdf, e.g. N-dim normal)
● G(z) = ẍ ~ Pg (M-dim sample from G network)
D network receives x or ẍ inputs → decides whether input is real or fake. It is optimized to learn: x is
real (1), ẍ is fake (0) (binary classifier).
G network maps sample z to G(z) = ẍ → it is optimized to maximize D mistakes.
NIPS 2016 Tutorial: Generative Adversarial Networks. Ian Goodfellow
8. Adversarial Training (batch update)
● Pick a sample x from training set
● Show x to D and update weights to
output 1 (real)
10. Adversarial Training (batch update)
● Freeze D weights
● Update G weights to make D output 1 (just G weights!)
● Unfreeze D Weights and repeat
11. Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
Key Idea: as D is trained to detect fraud, its parameters learn discriminative
features of “what is real/fake”. As backprop goes through D to G there happens to
be information leaking about the requirements for bank notes to look real. This
makes G perform small corrections by little steps to get closer and closer to what a
real sample would be.
● Caveat: this means GANs are not suitable for discrete tokens predictions, like
words, because in that discrete space there is no “small change” criteria to get
to a neighbour word (but can work in a word embedding space for example)
12. Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
100
100
It’s not even
greenFAKE
13. Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
100
100
There is no
watermarkFAKE
14. Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
100
100
Watermark
should be
roundedFAKE
15. Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D)
has to detect whether money is real or fake.
?
After enough iterations, and if the counterfeiter is good enough (in terms of G
network it means “has enough parameters”), the police should be confused.
Real?
16. Conditioned GANs
GANs can be conditioned on other info extra to z: text, labels, speech, etc..
z might capture random characteristics of the data, variabilities of plausible
futures, whilst c would condition the deterministic parts
For details on ways to condition GANs:
Ways of Conditioning Generative
Adversarial Networks (Wack et al.)
17. Least Squares GAN
Main idea: shift to a loss function that provides smooth and non-saturating
gradients in D
● Because of sigmoid saturation in binary classification loss, G gets no info
when D gets to label true examples → vanishing gradients make G no learn
● Least squares loss improves learning with notion of distance of Pmodel to
Pdata:
Least Squares Generative Adversarial
Networks, Mao et al. 2016
18. Speech Enhancement GAN
Requirements:
● End-to-end: no assumptions or discarding certain info from data
○ wav2wav.
● One-shot generation: not slow recursive operations as in WaveNet
○ fully convolutional structure.
● Many noise types and speakers learned with same shared parameters
○ generalize in those dimensions.
19. Underlying conv structure in the system
● Conv1D filters
● Virtual Batch Normalization: normalize layer responses with statistics
from fixed batch (reference) + current batch → less intra dependent
statistics for GAN instability
● LeakyReLU/ParametricReLU:
○ α fixed (0.3) or learnable
20. Speech Enhancement GAN
Two stages in Generator:
1. Encoder (Downconv): Project noisy signal
into a deterministic representation c and
concatenate to latent variable z ~ N(0, I)
2. Decoder (Deconv): Interpolate the
intermediate hidden features w/ learnable
params. until re-generation of clean speech.
21. Speech Enhancement GAN
G architecture:
● 22 1D conv filters with kernel width = 31
● Increase feature maps in encoder from 32 to 1024, and all way back in the decoder to
one final waveform bounded [-1, 1].
● Use of skip-connections to shuttle the low level features to highest layer (avoiding
bottleneck for these).
● Use of PReLU activations
D architecture:
● Same as G encoder except for having 2 input channels (Noisy, Real/Fake)
● LeakyReLUs with alpha = 0.3
● Virtual Batch Normalizations for speed up convergence and stability
● Classification output unit
22. Experimental setup: database
● Publicly available Edinburgh dataset: http://datashare.is.ed.ac.uk/handle/10283/1942
● Amount of data and types of speakers and noises fit our purposes
● Data re-sampled at 16kHz
● Training set:
○ 40 noise conditions: 10 types of noise w/ 4 SNR conditions {15, 10, 5, 0} dB
○ 10 sentences in each condition for every speaker
○ 28 speakers
● Test set (None seen during training. No condition nor speaker):
○ 20 noise conditions: 5 types of noise w/ 4 SNR conditions each {17.5, 12.5, 7.5, 2.5} dB
○ 20 sentences for each condition per speaker
○ 2 speakers (male and female)
23. Experimental setup: training
● Show pairs of signals to “learn” a reconstruction loss.
● Use of L1 regularization to guide the GAN training.
24. Experimental setup: training
● Training set was chunked with 50% overlaps → generated canvas of 1 second each
(16384 samples).
● Batch size of 400 samples
● Training for 86 epochs
● Distributed training among up to 4 GPUs → ~ 18h total time
● Very low learning rates: 0.0002
● RMSprop optimizer
● L1 regularization weight: 100
● Pre-emphasis and de-emphasis with factor 0.95 helped getting rid of high freq. artifacts!
Final G loss: LSGAN
Adversarial + weighted L1
regularizatoin
25. Experimental setup: test
Objective evaluation:
● PESQ: Perceptual Evaluation of Speech Quality [-0.5, 4.5]: designed for
telephonic compression assessment
● COVL: MOS prediction of the overall effect [1, 5]
○ CSIG: Mean opinion score (MOS) prediction of the signaldistortion attending only to the
speech signal [1, 5]
○ CBAK: MOS prediction of the intrusiveness of background noise [1, 5]
● SSNR: Segmental SNR [0, inf)
26. Experimental setup: test
Subjective evaluation (perceptual test):
● 20 sentences per utterance. Each utterance randomly shown for the different
systems. Subjects can listen/compare as many times as desired.
● Cherry picking meaningful types of noise for both speakers (we had no tags,
so by hand).
● 16 subjects to rate from 1 (bad) to 5 (excellent) a trade-off:
○ How much noise is removed
○ How intact remains the signal to the enhancement process
27. Results: objective
Worst PESQ than the baseline, but outperforming the other metrics. However
CSIG/CBAK/COVL are more correlated to enhancement than PESQ. Quite higher
SSNR → SEGAN gets to remove much more noise.
29. Conclusions
● And end-to-end speech enhancement method has been implemented within
the GAN framework.
● It works as a fully conv encoder-decoder structure, making it faster than
recursive solutions.
● The results show its effectiveness, but there is an improvement margin.
● Further development in architecture design is now the work in progress.
● Next stage: voice conversion by adaptation of this architecture.