On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

On Training Targets and Objective Functions for
Deep-Learning-Based Audio-Visual Speech Enhancement
17th May, 2019
Daniel Michelsanti1, Zheng-Hua Tan1, Sigurdur Sigurdsson2, Jesper Jensen1,2
1Aalborg University, Department of Electronic Systems, Denmark
2Oticon A/S, Denmark
{danmi,zt,jje}@es.aau.dk {ssig,jesj}@oticon.com

(D. Michelsanti, 2019) CASPR - Aalborg University 2
Agenda
• Introduction
• Training Targets and Objective Functions
• Experiments
• Results
• Conclusion

Introduction
Speech Enhancement
The cocktail party problem [1].

Introduction
Speech Enhancement
x(n) d(n) y(n)+ =

Introduction
Speech Enhancement
[2]
Rk,l = |Y (k, l)| = |X(k, l) + D(k, l)| Ak,l = |X(k, l)|

Introduction
Audio-Visual Speech Enhancement
• Speech is generally not a unimodal process.
• Some articulatory organs that we move during speech production, like the lips, are visible to the
listener [3] and have a contribution to speech intelligibility in noisy environments [4].
• The influence that visual aspects have on speech perception has been studied [5, 6].
• McGurk effect [7]: an audio-visual mismatch causes the perception of a sound different from the
acoustic and the visual components of the stimulus.

Introduction
Deep Learning for Audio-Visual Speech Enhancement
Neural Network
Model
Input Output
Training Target
(desired output)
Objective Function
(usually mean squared error)
• Previous work focused on the design of training targets and objective functions for audio-only speech
enhancement [8-12].
• Two contributions:
• New taxonomy for speech enhancement (eterogeneous terminology in previous work).
• Comparison of training targets and objective functions for audio-visual speech enhancement
(previous work analysed audio-only case).

Training Targets and Objective Functions
Spectrogram vs Mask
Neural Network
Model
Input bAk,l Ak,l
Direct
Mapping
(DM)
Neural Network
Model
Input cMk,l ·
Rk,l
cMk,lRk,l Ak,l
Indirect
Mapping
(IM)
Neural Network
Model
Input Mk,l =
Ak,l
Rk,l
cMk,l
Mask
Approximation
(MA)

CASPR - Aalborg University 9
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
(D. Michelsanti, 2019)

T F
and b = 1
T Q
.
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
2
(2) J = a
X
k,l
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
2
(4) J = b
X
q,l
2
(9) -
PSSA
J = a
X
k,l
2
(5) J = a
X
k,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Short-Time Spectral Amplitude (STSA)
Ak,l

T F
and b = 1
T Q
.
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
2
(2) J = a
X
k,l
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
2
(4) J = b
X
q,l
2
(9) -
PSSA
J = a
X
k,l
2
(5) J = a
X
k,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Log Spectral Amplitude (LSA)
Introduced because a logarithmic law reflects better human loudness perception [15].
log(Ak,l)

T F
and b = 1
T Q
.
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
2
(2) J = a
X
k,l
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
2
(4) J = b
X
q,l
2
(9) -
PSSA
J = a
X
k,l
2
(5) J = a
X
k,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Mel-Scaled Spectral Amplitude (MSA)
Introduced because the human auditory system is more discriminative at low than at
high frequencies [16].
Aq,l

T F
and b = 1
T Q
.
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
2
(2) J = a
X
k,l
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
2
(4) J = b
X
q,l
2
(9) -
PSSA
J = a
X
k,l
2
(5) J = a
X
k,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Log Mel-Scaled Spectral Amplitude (LMSA)
Introduced to combine the previous two considerations.
log(Aq,l)

T F
and b = 1
T Q
.
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
2
(2) J = a
X
k,l
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
2
(4) J = b
X
q,l
2
(9) -
PSSA
J = a
X
k,l
2
(5) J = a
X
k,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Phase Sensitive Spectral Amplitude (PSSA)
Introduced to compensate for the phase mismatch between noisy and clean signals [9].
Ak,l cos(✓k,l)
with ✓k,l = X(k, l) Y (k, l)

T F
and b = 1
T Q
.
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
2
(2) J = a
X
k,l
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
2
(4) J = b
X
q,l
2
(9) -
PSSA
J = a
X
k,l
2
(5) J = a
X
k,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Taxonomy

Experiments
Neural Network Architecture
Deep-Learning-Based Framework
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm
FullyConnected+Leaky-ReLU
Deconv+Leaky-ReLU+BatchNorm
Face Detection
Face Alignment
Mouth Region Extraction
STFT
Magnitude
Computation
ISTFT
Video Encoder
Audio Encoder
Fusion Sub-Network
Audio Decoder
Phase
Computation
Estimated
Spectrogram
Estimated
Mask
OR

Experiments
Setup
• Corpus: audio-visual GRID.
• Six kinds of additive noise: bus, cafeteria, street, pedestrian, babble and speech shaped noise (unseen).
• SNRs: training [-20:5:20]; evaluation [-15:5:15].
• 25 speakers for training (600 utterances each).
• 25 seen speakers for evaluation (25 utterances each).
• 6 unseen speakers for evaluation (100 utterances each).
• Evaluation metrics:
• PESQ [17] – Speech quality.
• ESTOI [18] – Speech intelligibility.

Results
PESQ
LMS
J = b
X
q,l
2
(4) J = b
X
q,l
2
(9) -
PSSA
J = a
X
k,l
2
(5) J = a
X
k,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Results in terms of PESQ. The Unproc. rows refer to the unprocessed signals.
PESQ Seen Speakers Unseen Speakers
SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg.
Unproc. 1.09 1.08 1.08 1.11 1.20 1.39 1.71 1.24 1.10 1.09 1.08 1.11 1.20 1.39 1.70 1.24
STSA-DM 1.27 1.35 1.48 1.65 1.86 2.08 2.31 1.71 1.13 1.19 1.30 1.48 1.73 1.99 2.24 1.58
LSA-DM 1.24 1.37 1.57 1.84 2.14 2.45 2.74 1.91 1.15 1.23 1.37 1.59 1.91 2.25 2.57 1.72
MSA-DM 1.27 1.36 1.49 1.67 1.87 2.07 2.28 1.72 1.14 1.20 1.32 1.51 1.75 1.99 2.21 1.59
LMSA-DM 1.27 1.39 1.56 1.78 2.01 2.18 2.31 1.79 1.15 1.22 1.34 1.53 1.77 1.98 2.14 1.59
PSSA-DM 1.24 1.32 1.44 1.61 1.82 2.04 2.25 1.67 1.13 1.18 1.28 1.45 1.70 1.94 2.17 1.55
STSA-IM 1.24 1.33 1.45 1.61 1.77 1.95 2.19 1.65 1.13 1.18 1.28 1.44 1.65 1.87 2.11 1.52
LSA-IM 1.17 1.25 1.39 1.60 1.89 2.19 2.49 1.71 1.13 1.17 1.28 1.46 1.72 2.02 2.34 1.59
MSA-IM 1.26 1.34 1.47 1.64 1.85 2.07 2.30 1.70 1.13 1.19 1.29 1.47 1.71 1.98 2.24 1.57
LMSA-IM 1.21 1.32 1.48 1.72 1.99 2.26 2.53 1.79 1.13 1.19 1.30 1.49 1.76 2.06 2.35 1.61
PSSA-IM 1.29 1.37 1.50 1.68 1.87 2.05 2.22 1.71 1.16 1.22 1.33 1.51 1.74 1.96 2.15 1.58
STSA-MA 1.31 1.42 1.57 1.78 2.02 2.29 2.58 1.85 1.15 1.21 1.32 1.52 1.81 2.15 2.48 1.66
PSSA-MA 1.28 1.38 1.54 1.78 2.08 2.40 2.71 1.88 1.18 1.25 1.38 1.61 1.95 2.31 2.63 1.76
Results in terms of ESTOI. The Unproc. rows refer to the unprocessed signals.
ESTOI Seen Speakers Unseen Speakers
SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg.
DM
IM
MA

Results
ESTOI
MSA-IM 1.26 1.34 1.47 1.64 1.85 2.07 2.30 1.70 1.13 1.19 1.29 1.47 1.71 1.98 2.24 1.57
LMSA-IM 1.21 1.32 1.48 1.72 1.99 2.26 2.53 1.79 1.13 1.19 1.30 1.49 1.76 2.06 2.35 1.61
PSSA-IM 1.29 1.37 1.50 1.68 1.87 2.05 2.22 1.71 1.16 1.22 1.33 1.51 1.74 1.96 2.15 1.58
STSA-MA 1.31 1.42 1.57 1.78 2.02 2.29 2.58 1.85 1.15 1.21 1.32 1.52 1.81 2.15 2.48 1.66
PSSA-MA 1.28 1.38 1.54 1.78 2.08 2.40 2.71 1.88 1.18 1.25 1.38 1.61 1.95 2.31 2.63 1.76
Results in terms of ESTOI. The Unproc. rows refer to the unprocessed signals.
ESTOI Seen Speakers Unseen Speakers
SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg.
Unproc. 0.08 0.15 0.24 0.35 0.47 0.58 0.67 0.36 0.08 0.14 0.23 0.34 0.46 0.57 0.66 0.35
STSA-DM 0.35 0.41 0.49 0.57 0.64 0.70 0.74 0.56 0.23 0.29 0.39 0.49 0.59 0.67 0.72 0.48
LSA-DM 0.35 0.41 0.49 0.58 0.65 0.71 0.76 0.56 0.24 0.30 0.39 0.49 0.60 0.68 0.73 0.49
MSA-DM 0.36 0.42 0.49 0.57 0.64 0.70 0.74 0.56 0.24 0.31 0.40 0.51 0.61 0.68 0.73 0.50
LMSA-DM 0.37 0.44 0.51 0.60 0.66 0.71 0.75 0.58 0.25 0.31 0.40 0.51 0.61 0.68 0.72 0.50
PSSA-DM 0.29 0.36 0.46 0.56 0.64 0.70 0.74 0.53 0.19 0.27 0.37 0.49 0.60 0.68 0.72 0.48
STSA-IM 0.33 0.40 0.48 0.56 0.64 0.69 0.74 0.55 0.23 0.29 0.39 0.50 0.60 0.67 0.72 0.48
LSA-IM 0.33 0.38 0.46 0.55 0.63 0.70 0.75 0.54 0.22 0.28 0.36 0.46 0.57 0.66 0.73 0.47
MSA-IM 0.36 0.42 0.50 0.58 0.65 0.70 0.75 0.57 0.25 0.31 0.40 0.50 0.60 0.68 0.73 0.50
LMSA-IM 0.36 0.42 0.50 0.59 0.66 0.72 0.76 0.57 0.24 0.30 0.38 0.49 0.60 0.68 0.73 0.49
PSSA-IM 0.29 0.37 0.46 0.56 0.64 0.70 0.75 0.54 0.21 0.28 0.38 0.50 0.61 0.68 0.73 0.48
STSA-MA 0.39 0.45 0.52 0.60 0.67 0.72 0.77 0.59 0.26 0.32 0.41 0.51 0.62 0.70 0.75 0.51
PSSA-MA 0.29 0.36 0.46 0.57 0.66 0.72 0.77 0.55 0.22 0.29 0.40 0.52 0.63 0.70 0.75 0.50

Conclusion
• We proposed a new taxonomy to have a uniform terminology that links classical speech enhancement
methods with more recent techniques.
• We investigated several training targets and objective functions for audio-visual speech enhancement.
• We used a deep-learning-based framework to directly and indirectly learn the short time spectral
amplitude of the target speech in different domains.
• The mask approximation approaches and the direct estimation of the log magnitude spectrum are the
methods that perform the best.
• In contrast to the results for audio-only speech enhancement, the use of a phase-aware mask is not as
effective in improving estimated intelligibility especially at low SNRs.

Thank You!
Any questions?
Daniel Michelsanti
danmi@es.aau.dk

1. E. C. Cherry (1953). Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustical Society of
America, 25(5):975–979.
2. W. A. Sethares (2007). Rhythm and transforms. Springer Science & Business Media.
3. A. Abel and A. Hussain (2014). Novel two-stage audiovisual speech filtering in noisy environments. Cognitive Computation, 6(2).
4. W. H. Sumby and I. Pollack (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America,
26(2):212–215.
5. N. P. Erber (1975). Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40(4).
6. Q. Summerfield (1979). Use of visual information for phonetic perception. Phonetica, 36(4-5).
7. H. McGurk and J. MacDonald (1976). Hearing lips and seeing voices. Nature, 264(5588):746–748.
8. Y. Wang, A. Narayanan, and D. L. Wang (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio,
Speech and Language Processing (TASLP), 22(12):1849–1858.
9. H. Erdogan, J. R. Hershey, S.Watanabe, and J. Le Roux (2015). Phase-sensitive and recognition-boosted speech separation using deep
recurrent neural networks. In ICASSP.
10. D. S. Williamson, Y. Wang, and D. L. Wang (2016). Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio,
Speech and Language Processing, 24(3):483–492.
11. D. L.Wang and J. Chen (2017). Supervised speech separation based on deep learning: an overview. arXiv preprint arXiv:1708.07524.
12. L. Sun, J. Du, L.-R. Dai, and C.-H. Lee (2017). Multiple-target deep learning for lstm-rnn based speech enhancement. In HSCMA.
13. T. Fingscheidt, S. Suhadi, and S. Stan (2008). Environment optimized speech enhancement. IEEE Transactions on Audio, Speech, and
Language Processing, vol. 16, no. 4, pp. 825–834.
BIBLIOGRAPHY

14. P. C. Loizou (2005). Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum. IEEE
Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 857–869.
15. E. Zwicker and H. Fastl (2013). Psychoacoustics: Facts and models, vol. 22. Springer Science & Business Media.
16. S. S. Stevens, J. Volkmann, and E. B. Newman (1937). A scale for the measurement of the psychological magnitude pitch. The Journal of the
Acoustical Society of America, vol. 8, no. 3, pp. 185–190.
17. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001). Perceptual evaluation of speech quality (PESQ) - A new method for speech
quality assessment of telephone networks and codecs. in ICASSP.
18. J. Jensen and C. H. Taal (2016). An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022.
BIBLIOGRAPHY

24
Slide 3 – Most of the icons made by OCHA or by Freepik from www.flaticon.com
IMAGES
(D. Michelsanti, 2019) CASPR - Aalborg University

Training Targets and Objective Functions
Mask Approximation vs Indirect Mapping
J =
1
TF
X
k,l
⇣
Ak,l
cMk,lRk,l
⌘2
J =
1
TF
X
k,l
✓
Ak,l
Rk,l
cMk,l
◆2
=
1
TF
X
k,l
(Ak,l
cMk,lRk,l)2
Rk,l
2
Mask Approximation (MA)Indirect Mapping (IM)
MA is nothing more than a spectrally weighted version of IM [13], which reduces the cost of estimation errors
at high-energy spectral regions of the noisy signal relative to low-energy spectral regions, and is related to a
perceptually motivated cost function [14].

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Similar to On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement (20)

Recently uploaded

Recently uploaded (20)

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement