Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deeplearning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.
https://ieeexplore.ieee.org/document/8682790
On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement
1. On Training Targets and Objective Functions for
Deep-Learning-Based Audio-Visual Speech Enhancement
17th May, 2019
Daniel Michelsanti1, Zheng-Hua Tan1, Sigurdur Sigurdsson2, Jesper Jensen1,2
1Aalborg University, Department of Electronic Systems, Denmark
2Oticon A/S, Denmark
{danmi,zt,jje}@es.aau.dk {ssig,jesj}@oticon.com
2. (D. Michelsanti, 2019) CASPR - Aalborg University 2
Agenda
• Introduction
• Training Targets and Objective Functions
• Experiments
• Results
• Conclusion
3. (D. Michelsanti, 2019) CASPR - Aalborg University 3
Introduction
Speech Enhancement
The cocktail party problem [1].
4. (D. Michelsanti, 2019) CASPR - Aalborg University 4
Introduction
Speech Enhancement
x(n) d(n) y(n)+ =
6. (D. Michelsanti, 2019) CASPR - Aalborg University 6
Introduction
Audio-Visual Speech Enhancement
• Speech is generally not a unimodal process.
• Some articulatory organs that we move during speech production, like the lips, are visible to the
listener [3] and have a contribution to speech intelligibility in noisy environments [4].
• The influence that visual aspects have on speech perception has been studied [5, 6].
• McGurk effect [7]: an audio-visual mismatch causes the perception of a sound different from the
acoustic and the visual components of the stimulus.
7. (D. Michelsanti, 2019) CASPR - Aalborg University 7
Introduction
Deep Learning for Audio-Visual Speech Enhancement
Neural Network
Model
Input Output
Training Target
(desired output)
Objective Function
(usually mean squared error)
• Previous work focused on the design of training targets and objective functions for audio-only speech
enhancement [8-12].
• Two contributions:
• New taxonomy for speech enhancement (eterogeneous terminology in previous work).
• Comparison of training targets and objective functions for audio-visual speech enhancement
(previous work analysed audio-only case).
8. (D. Michelsanti, 2019) CASPR - Aalborg University 8
Training Targets and Objective Functions
Spectrogram vs Mask
Neural Network
Model
Input bAk,l Ak,l
Direct
Mapping
(DM)
Neural Network
Model
Input cMk,l ·
Rk,l
cMk,lRk,l Ak,l
Indirect
Mapping
(IM)
Neural Network
Model
Input Mk,l =
Ak,l
Rk,l
cMk,l
Mask
Approximation
(MA)
9. CASPR - Aalborg University 9
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
(D. Michelsanti, 2019)
10. CASPR - Aalborg University 10
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Short-Time Spectral Amplitude (STSA)
Ak,l
(D. Michelsanti, 2019)
11. CASPR - Aalborg University 11
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Log Spectral Amplitude (LSA)
Introduced because a logarithmic law reflects better human loudness perception [15].
log(Ak,l)
(D. Michelsanti, 2019)
12. CASPR - Aalborg University 12
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Mel-Scaled Spectral Amplitude (MSA)
Introduced because the human auditory system is more discriminative at low than at
high frequencies [16].
Aq,l
(D. Michelsanti, 2019)
13. CASPR - Aalborg University 13
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Log Mel-Scaled Spectral Amplitude (LMSA)
Introduced to combine the previous two considerations.
log(Aq,l)
(D. Michelsanti, 2019)
14. CASPR - Aalborg University 14
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Phase Sensitive Spectral Amplitude (PSSA)
Introduced to compensate for the phase mismatch between noisy and clean signals [9].
Ak,l cos(✓k,l)
with ✓k,l = X(k, l) Y (k, l)
(D. Michelsanti, 2019)
15. CASPR - Aalborg University 15
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Traning Targets and Objective Functions
Taxonomy
(D. Michelsanti, 2019)
16. (D. Michelsanti, 2019) CASPR - Aalborg University 16
Experiments
Neural Network Architecture
Deep-Learning-Based Framework
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
FullyConnected+Leaky-ReLU
FullyConnected+Leaky-ReLU
FullyConnected+Leaky-ReLU
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Face Detection
Face Alignment
Mouth Region Extraction
STFT
Magnitude
Computation
ISTFT
Video Encoder
Audio Encoder
Fusion Sub-Network
Audio Decoder
Phase
Computation
Estimated
Spectrogram
Estimated
Mask
OR
17. (D. Michelsanti, 2019) CASPR - Aalborg University 17
Experiments
Setup
• Corpus: audio-visual GRID.
• Six kinds of additive noise: bus, cafeteria, street, pedestrian, babble and speech shaped noise (unseen).
• SNRs: training [-20:5:20]; evaluation [-15:5:15].
• 25 speakers for training (600 utterances each).
• 25 seen speakers for evaluation (25 utterances each).
• 6 unseen speakers for evaluation (100 utterances each).
• Evaluation metrics:
• PESQ [17] – Speech quality.
• ESTOI [18] – Speech intelligibility.
20. (D. Michelsanti, 2019) CASPR - Aalborg University 20
Conclusion
• We proposed a new taxonomy to have a uniform terminology that links classical speech enhancement
methods with more recent techniques.
• We investigated several training targets and objective functions for audio-visual speech enhancement.
• We used a deep-learning-based framework to directly and indirectly learn the short time spectral
amplitude of the target speech in different domains.
• The mask approximation approaches and the direct estimation of the log magnitude spectrum are the
methods that perform the best.
• In contrast to the results for audio-only speech enhancement, the use of a phase-aware mask is not as
effective in improving estimated intelligibility especially at low SNRs.
21. (D. Michelsanti, 2019) CASPR - Aalborg University 21
Thank You!
Any questions?
Daniel Michelsanti
danmi@es.aau.dk
22. (D. Michelsanti, 2019) CASPR - Aalborg University 22
1. E. C. Cherry (1953). Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustical Society of
America, 25(5):975–979.
2. W. A. Sethares (2007). Rhythm and transforms. Springer Science & Business Media.
3. A. Abel and A. Hussain (2014). Novel two-stage audiovisual speech filtering in noisy environments. Cognitive Computation, 6(2).
4. W. H. Sumby and I. Pollack (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America,
26(2):212–215.
5. N. P. Erber (1975). Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40(4).
6. Q. Summerfield (1979). Use of visual information for phonetic perception. Phonetica, 36(4-5).
7. H. McGurk and J. MacDonald (1976). Hearing lips and seeing voices. Nature, 264(5588):746–748.
8. Y. Wang, A. Narayanan, and D. L. Wang (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio,
Speech and Language Processing (TASLP), 22(12):1849–1858.
9. H. Erdogan, J. R. Hershey, S.Watanabe, and J. Le Roux (2015). Phase-sensitive and recognition-boosted speech separation using deep
recurrent neural networks. In ICASSP.
10. D. S. Williamson, Y. Wang, and D. L. Wang (2016). Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio,
Speech and Language Processing, 24(3):483–492.
11. D. L.Wang and J. Chen (2017). Supervised speech separation based on deep learning: an overview. arXiv preprint arXiv:1708.07524.
12. L. Sun, J. Du, L.-R. Dai, and C.-H. Lee (2017). Multiple-target deep learning for lstm-rnn based speech enhancement. In HSCMA.
13. T. Fingscheidt, S. Suhadi, and S. Stan (2008). Environment optimized speech enhancement. IEEE Transactions on Audio, Speech, and
Language Processing, vol. 16, no. 4, pp. 825–834.
BIBLIOGRAPHY
23. (D. Michelsanti, 2019) CASPR - Aalborg University 23
14. P. C. Loizou (2005). Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum. IEEE
Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 857–869.
15. E. Zwicker and H. Fastl (2013). Psychoacoustics: Facts and models, vol. 22. Springer Science & Business Media.
16. S. S. Stevens, J. Volkmann, and E. B. Newman (1937). A scale for the measurement of the psychological magnitude pitch. The Journal of the
Acoustical Society of America, vol. 8, no. 3, pp. 185–190.
17. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001). Perceptual evaluation of speech quality (PESQ) - A new method for speech
quality assessment of telephone networks and codecs. in ICASSP.
18. J. Jensen and C. H. Taal (2016). An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022.
BIBLIOGRAPHY
24. 24
Slide 3 – Most of the icons made by OCHA or by Freepik from www.flaticon.com
IMAGES
(D. Michelsanti, 2019) CASPR - Aalborg University
25. (D. Michelsanti, 2019) CASPR - Aalborg University 25
Training Targets and Objective Functions
Mask Approximation vs Indirect Mapping
J =
1
TF
X
k,l
⇣
Ak,l
cMk,lRk,l
⌘2
J =
1
TF
X
k,l
✓
Ak,l
Rk,l
cMk,l
◆2
=
1
TF
X
k,l
(Ak,l
cMk,lRk,l)2
Rk,l
2
Mask Approximation (MA)Indirect Mapping (IM)
MA is nothing more than a spectrally weighted version of IM [13], which reduces the cost of estimation errors
at high-energy spectral regions of the noisy signal relative to low-energy spectral regions, and is related to a
perceptually motivated cost function [14].