SlideShare a Scribd company logo
1 of 25
Download to read offline
On Training Targets and Objective Functions for
Deep-Learning-Based Audio-Visual Speech Enhancement
17th May, 2019
Daniel Michelsanti1, Zheng-Hua Tan1, Sigurdur Sigurdsson2, Jesper Jensen1,2
1Aalborg University, Department of Electronic Systems, Denmark
2Oticon A/S, Denmark
{danmi,zt,jje}@es.aau.dk {ssig,jesj}@oticon.com
(D. Michelsanti, 2019) CASPR - Aalborg University 2
Agenda
• Introduction
• Training Targets and Objective Functions
• Experiments
• Results
• Conclusion
(D. Michelsanti, 2019) CASPR - Aalborg University 3
Introduction
Speech Enhancement
The cocktail party problem [1].
(D. Michelsanti, 2019) CASPR - Aalborg University 4
Introduction
Speech Enhancement
x(n) d(n) y(n)+ =
(D. Michelsanti, 2019) CASPR - Aalborg University 5
Introduction
Speech Enhancement
[2]
Rk,l = |Y (k, l)| = |X(k, l) + D(k, l)| Ak,l = |X(k, l)|
(D. Michelsanti, 2019) CASPR - Aalborg University 6
Introduction
Audio-Visual Speech Enhancement
• Speech is generally not a unimodal process.
• Some articulatory organs that we move during speech production, like the lips, are visible to the
listener [3] and have a contribution to speech intelligibility in noisy environments [4].
• The influence that visual aspects have on speech perception has been studied [5, 6].
• McGurk effect [7]: an audio-visual mismatch causes the perception of a sound different from the
acoustic and the visual components of the stimulus.
(D. Michelsanti, 2019) CASPR - Aalborg University 7
Introduction
Deep Learning for Audio-Visual Speech Enhancement
Neural Network
Model
Input Output
Training Target
(desired output)
Objective Function
(usually mean squared error)
• Previous work focused on the design of training targets and objective functions for audio-only speech
enhancement [8-12].
• Two contributions:
• New taxonomy for speech enhancement (eterogeneous terminology in previous work).
• Comparison of training targets and objective functions for audio-visual speech enhancement
(previous work analysed audio-only case).
(D. Michelsanti, 2019) CASPR - Aalborg University 8
Training Targets and Objective Functions
Spectrogram vs Mask
Neural Network
Model
Input bAk,l Ak,l
Direct
Mapping
(DM)
Neural Network
Model
Input cMk,l ·
Rk,l
cMk,lRk,l Ak,l
Indirect
Mapping
(IM)
Neural Network
Model
Input Mk,l =
Ak,l
Rk,l
cMk,l
Mask
Approximation
(MA)
CASPR - Aalborg University 9
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
(D. Michelsanti, 2019)
CASPR - Aalborg University 10
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Short-Time Spectral Amplitude (STSA)
Ak,l
(D. Michelsanti, 2019)
CASPR - Aalborg University 11
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Log Spectral Amplitude (LSA)
Introduced because a logarithmic law reflects better human loudness perception [15].
log(Ak,l)
(D. Michelsanti, 2019)
CASPR - Aalborg University 12
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Mel-Scaled Spectral Amplitude (MSA)
Introduced because the human auditory system is more discriminative at low than at
high frequencies [16].
Aq,l
(D. Michelsanti, 2019)
CASPR - Aalborg University 13
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Log Mel-Scaled Spectral Amplitude (LMSA)
Introduced to combine the previous two considerations.
log(Aq,l)
(D. Michelsanti, 2019)
CASPR - Aalborg University 14
Traning Targets and Objective Functions
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Phase Sensitive Spectral Amplitude (PSSA)
Introduced to compensate for the phase mismatch between noisy and clean signals [9].
Ak,l cos(✓k,l)
with ✓k,l = X(k, l) Y (k, l)
(D. Michelsanti, 2019)
CASPR - Aalborg University 15
Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1
T F
and b = 1
T Q
.
Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA)
STSA
J = a
X
k,l
Ak,l
bAk,l
2
(1) J = a
X
k,l
Ak,l
cMk,lRk,l
2
(6) J = a
X
k,l
MIAM
k,l
cMk,l
2
(11)
LSA
J = a
X
k,l
log(Ak,l) log( bAk,l)
2
(2) J = a
X
k,l
log(Ak,l) log(cMk,lRk,l)
2
(7) -
MSA
J = b
X
q,l
Aq,l
bAq,l
2
(3) J = b
X
q,l
Aq,l
cMq,lRq,l
2
(8) -
LMSA
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Spectrogram as a target Mask as a target
Traning Targets and Objective Functions
Taxonomy
(D. Michelsanti, 2019)
(D. Michelsanti, 2019) CASPR - Aalborg University 16
Experiments
Neural Network Architecture
Deep-Learning-Based Framework
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
Conv+Leaky-ReLU+BatchNorm
FullyConnected+Leaky-ReLU
FullyConnected+Leaky-ReLU
FullyConnected+Leaky-ReLU
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Deconv+Leaky-ReLU+BatchNorm
Face Detection
Face Alignment
Mouth Region Extraction
STFT
Magnitude
Computation
ISTFT
Video Encoder
Audio Encoder
Fusion Sub-Network
Audio Decoder
Phase
Computation
Estimated
Spectrogram
Estimated
Mask
OR
(D. Michelsanti, 2019) CASPR - Aalborg University 17
Experiments
Setup
• Corpus: audio-visual GRID.
• Six kinds of additive noise: bus, cafeteria, street, pedestrian, babble and speech shaped noise (unseen).
• SNRs: training [-20:5:20]; evaluation [-15:5:15].
• 25 speakers for training (600 utterances each).
• 25 seen speakers for evaluation (25 utterances each).
• 6 unseen speakers for evaluation (100 utterances each).
• Evaluation metrics:
• PESQ [17] – Speech quality.
• ESTOI [18] – Speech intelligibility.
(D. Michelsanti, 2019) CASPR - Aalborg University 18
Results
PESQ
LMS
J = b
X
q,l
log(Aq,l) log(bAq,l)
2
(4) J = b
X
q,l
log(Aq,l) log(cMq,lRq,l)
2
(9) -
PSSA
J = a
X
k,l
Ak,l cos(✓k,l) bAk,l
2
(5) J = a
X
k,l
Ak,l cos(✓k,l) cMk,lRk,l
2
(10) J = a
X
k,l
MPSM
k,l
cMk,l
2
(12)
Results in terms of PESQ. The Unproc. rows refer to the unprocessed signals.
PESQ Seen Speakers Unseen Speakers
SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg.
Unproc. 1.09 1.08 1.08 1.11 1.20 1.39 1.71 1.24 1.10 1.09 1.08 1.11 1.20 1.39 1.70 1.24
STSA-DM 1.27 1.35 1.48 1.65 1.86 2.08 2.31 1.71 1.13 1.19 1.30 1.48 1.73 1.99 2.24 1.58
LSA-DM 1.24 1.37 1.57 1.84 2.14 2.45 2.74 1.91 1.15 1.23 1.37 1.59 1.91 2.25 2.57 1.72
MSA-DM 1.27 1.36 1.49 1.67 1.87 2.07 2.28 1.72 1.14 1.20 1.32 1.51 1.75 1.99 2.21 1.59
LMSA-DM 1.27 1.39 1.56 1.78 2.01 2.18 2.31 1.79 1.15 1.22 1.34 1.53 1.77 1.98 2.14 1.59
PSSA-DM 1.24 1.32 1.44 1.61 1.82 2.04 2.25 1.67 1.13 1.18 1.28 1.45 1.70 1.94 2.17 1.55
STSA-IM 1.24 1.33 1.45 1.61 1.77 1.95 2.19 1.65 1.13 1.18 1.28 1.44 1.65 1.87 2.11 1.52
LSA-IM 1.17 1.25 1.39 1.60 1.89 2.19 2.49 1.71 1.13 1.17 1.28 1.46 1.72 2.02 2.34 1.59
MSA-IM 1.26 1.34 1.47 1.64 1.85 2.07 2.30 1.70 1.13 1.19 1.29 1.47 1.71 1.98 2.24 1.57
LMSA-IM 1.21 1.32 1.48 1.72 1.99 2.26 2.53 1.79 1.13 1.19 1.30 1.49 1.76 2.06 2.35 1.61
PSSA-IM 1.29 1.37 1.50 1.68 1.87 2.05 2.22 1.71 1.16 1.22 1.33 1.51 1.74 1.96 2.15 1.58
STSA-MA 1.31 1.42 1.57 1.78 2.02 2.29 2.58 1.85 1.15 1.21 1.32 1.52 1.81 2.15 2.48 1.66
PSSA-MA 1.28 1.38 1.54 1.78 2.08 2.40 2.71 1.88 1.18 1.25 1.38 1.61 1.95 2.31 2.63 1.76
Results in terms of ESTOI. The Unproc. rows refer to the unprocessed signals.
ESTOI Seen Speakers Unseen Speakers
SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg.
DM
IM
MA
(D. Michelsanti, 2019) CASPR - Aalborg University 19
Results
ESTOI
MSA-IM 1.26 1.34 1.47 1.64 1.85 2.07 2.30 1.70 1.13 1.19 1.29 1.47 1.71 1.98 2.24 1.57
LMSA-IM 1.21 1.32 1.48 1.72 1.99 2.26 2.53 1.79 1.13 1.19 1.30 1.49 1.76 2.06 2.35 1.61
PSSA-IM 1.29 1.37 1.50 1.68 1.87 2.05 2.22 1.71 1.16 1.22 1.33 1.51 1.74 1.96 2.15 1.58
STSA-MA 1.31 1.42 1.57 1.78 2.02 2.29 2.58 1.85 1.15 1.21 1.32 1.52 1.81 2.15 2.48 1.66
PSSA-MA 1.28 1.38 1.54 1.78 2.08 2.40 2.71 1.88 1.18 1.25 1.38 1.61 1.95 2.31 2.63 1.76
Results in terms of ESTOI. The Unproc. rows refer to the unprocessed signals.
ESTOI Seen Speakers Unseen Speakers
SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg.
Unproc. 0.08 0.15 0.24 0.35 0.47 0.58 0.67 0.36 0.08 0.14 0.23 0.34 0.46 0.57 0.66 0.35
STSA-DM 0.35 0.41 0.49 0.57 0.64 0.70 0.74 0.56 0.23 0.29 0.39 0.49 0.59 0.67 0.72 0.48
LSA-DM 0.35 0.41 0.49 0.58 0.65 0.71 0.76 0.56 0.24 0.30 0.39 0.49 0.60 0.68 0.73 0.49
MSA-DM 0.36 0.42 0.49 0.57 0.64 0.70 0.74 0.56 0.24 0.31 0.40 0.51 0.61 0.68 0.73 0.50
LMSA-DM 0.37 0.44 0.51 0.60 0.66 0.71 0.75 0.58 0.25 0.31 0.40 0.51 0.61 0.68 0.72 0.50
PSSA-DM 0.29 0.36 0.46 0.56 0.64 0.70 0.74 0.53 0.19 0.27 0.37 0.49 0.60 0.68 0.72 0.48
STSA-IM 0.33 0.40 0.48 0.56 0.64 0.69 0.74 0.55 0.23 0.29 0.39 0.50 0.60 0.67 0.72 0.48
LSA-IM 0.33 0.38 0.46 0.55 0.63 0.70 0.75 0.54 0.22 0.28 0.36 0.46 0.57 0.66 0.73 0.47
MSA-IM 0.36 0.42 0.50 0.58 0.65 0.70 0.75 0.57 0.25 0.31 0.40 0.50 0.60 0.68 0.73 0.50
LMSA-IM 0.36 0.42 0.50 0.59 0.66 0.72 0.76 0.57 0.24 0.30 0.38 0.49 0.60 0.68 0.73 0.49
PSSA-IM 0.29 0.37 0.46 0.56 0.64 0.70 0.75 0.54 0.21 0.28 0.38 0.50 0.61 0.68 0.73 0.48
STSA-MA 0.39 0.45 0.52 0.60 0.67 0.72 0.77 0.59 0.26 0.32 0.41 0.51 0.62 0.70 0.75 0.51
PSSA-MA 0.29 0.36 0.46 0.57 0.66 0.72 0.77 0.55 0.22 0.29 0.40 0.52 0.63 0.70 0.75 0.50
(D. Michelsanti, 2019) CASPR - Aalborg University 20
Conclusion
• We proposed a new taxonomy to have a uniform terminology that links classical speech enhancement
methods with more recent techniques.
• We investigated several training targets and objective functions for audio-visual speech enhancement.
• We used a deep-learning-based framework to directly and indirectly learn the short time spectral
amplitude of the target speech in different domains.
• The mask approximation approaches and the direct estimation of the log magnitude spectrum are the
methods that perform the best.
• In contrast to the results for audio-only speech enhancement, the use of a phase-aware mask is not as
effective in improving estimated intelligibility especially at low SNRs.
(D. Michelsanti, 2019) CASPR - Aalborg University 21
Thank You!
Any questions?
Daniel Michelsanti
danmi@es.aau.dk
(D. Michelsanti, 2019) CASPR - Aalborg University 22
1. E. C. Cherry (1953). Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustical Society of
America, 25(5):975–979.
2. W. A. Sethares (2007). Rhythm and transforms. Springer Science & Business Media.
3. A. Abel and A. Hussain (2014). Novel two-stage audiovisual speech filtering in noisy environments. Cognitive Computation, 6(2).
4. W. H. Sumby and I. Pollack (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America,
26(2):212–215.
5. N. P. Erber (1975). Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40(4).
6. Q. Summerfield (1979). Use of visual information for phonetic perception. Phonetica, 36(4-5).
7. H. McGurk and J. MacDonald (1976). Hearing lips and seeing voices. Nature, 264(5588):746–748.
8. Y. Wang, A. Narayanan, and D. L. Wang (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio,
Speech and Language Processing (TASLP), 22(12):1849–1858.
9. H. Erdogan, J. R. Hershey, S.Watanabe, and J. Le Roux (2015). Phase-sensitive and recognition-boosted speech separation using deep
recurrent neural networks. In ICASSP.
10. D. S. Williamson, Y. Wang, and D. L. Wang (2016). Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio,
Speech and Language Processing, 24(3):483–492.
11. D. L.Wang and J. Chen (2017). Supervised speech separation based on deep learning: an overview. arXiv preprint arXiv:1708.07524.
12. L. Sun, J. Du, L.-R. Dai, and C.-H. Lee (2017). Multiple-target deep learning for lstm-rnn based speech enhancement. In HSCMA.
13. T. Fingscheidt, S. Suhadi, and S. Stan (2008). Environment optimized speech enhancement. IEEE Transactions on Audio, Speech, and
Language Processing, vol. 16, no. 4, pp. 825–834.
BIBLIOGRAPHY
(D. Michelsanti, 2019) CASPR - Aalborg University 23
14. P. C. Loizou (2005). Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum. IEEE
Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 857–869.
15. E. Zwicker and H. Fastl (2013). Psychoacoustics: Facts and models, vol. 22. Springer Science & Business Media.
16. S. S. Stevens, J. Volkmann, and E. B. Newman (1937). A scale for the measurement of the psychological magnitude pitch. The Journal of the
Acoustical Society of America, vol. 8, no. 3, pp. 185–190.
17. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001). Perceptual evaluation of speech quality (PESQ) - A new method for speech
quality assessment of telephone networks and codecs. in ICASSP.
18. J. Jensen and C. H. Taal (2016). An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022.
BIBLIOGRAPHY
24
Slide 3 – Most of the icons made by OCHA or by Freepik from www.flaticon.com
IMAGES
(D. Michelsanti, 2019) CASPR - Aalborg University
(D. Michelsanti, 2019) CASPR - Aalborg University 25
Training Targets and Objective Functions
Mask Approximation vs Indirect Mapping
J =
1
TF
X
k,l
⇣
Ak,l
cMk,lRk,l
⌘2
J =
1
TF
X
k,l
✓
Ak,l
Rk,l
cMk,l
◆2
=
1
TF
X
k,l
(Ak,l
cMk,lRk,l)2
Rk,l
2
Mask Approximation (MA)Indirect Mapping (IM)
MA is nothing more than a spectrally weighted version of IM [13], which reduces the cost of estimation errors
at high-energy spectral regions of the noisy signal relative to low-energy spectral regions, and is related to a
perceptually motivated cost function [14].

More Related Content

What's hot

Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsRyan B Harvey, CSDP, CSM
 
A comparative study of histogram equalization based image enhancement techniq...
A comparative study of histogram equalization based image enhancement techniq...A comparative study of histogram equalization based image enhancement techniq...
A comparative study of histogram equalization based image enhancement techniq...sipij
 
Lec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image SegmentationLec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image SegmentationUlaş Bağcı
 
Lec10: Medical Image Segmentation as an Energy Minimization Problem
Lec10: Medical Image Segmentation as an Energy Minimization ProblemLec10: Medical Image Segmentation as an Energy Minimization Problem
Lec10: Medical Image Segmentation as an Energy Minimization ProblemUlaş Bağcı
 
Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...ijma
 
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitRobust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitPantelis Bouboulis
 
Repairing and Inpainting Damaged Images using Adaptive Diffusion Technique
Repairing and Inpainting Damaged Images using Adaptive Diffusion TechniqueRepairing and Inpainting Damaged Images using Adaptive Diffusion Technique
Repairing and Inpainting Damaged Images using Adaptive Diffusion TechniqueIJMTST Journal
 
Image Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation ModelImage Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation ModelIJERA Editor
 
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...sipij
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)James McMurray
 
Snakes in Images (Active contour tutorial)
Snakes in Images (Active contour tutorial)Snakes in Images (Active contour tutorial)
Snakes in Images (Active contour tutorial)Yan Xu
 
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...IOSRJECE
 
Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...
Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...
Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...CSCJournals
 
Speckle noise reduction using hybrid tmav based fuzzy filter
Speckle noise reduction using hybrid tmav based fuzzy filterSpeckle noise reduction using hybrid tmav based fuzzy filter
Speckle noise reduction using hybrid tmav based fuzzy filtereSAT Publishing House
 
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningImage Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningIJRESJOURNAL
 
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...IRJET Journal
 
Iaetsd vlsi implementation of gabor filter based image edge detection
Iaetsd vlsi implementation of gabor filter based image edge detectionIaetsd vlsi implementation of gabor filter based image edge detection
Iaetsd vlsi implementation of gabor filter based image edge detectionIaetsd Iaetsd
 

What's hot (20)

Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
A comparative study of histogram equalization based image enhancement techniq...
A comparative study of histogram equalization based image enhancement techniq...A comparative study of histogram equalization based image enhancement techniq...
A comparative study of histogram equalization based image enhancement techniq...
 
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
 
Lec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image SegmentationLec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image Segmentation
 
Lec10: Medical Image Segmentation as an Energy Minimization Problem
Lec10: Medical Image Segmentation as an Energy Minimization ProblemLec10: Medical Image Segmentation as an Energy Minimization Problem
Lec10: Medical Image Segmentation as an Energy Minimization Problem
 
Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...Mr image compression based on selection of mother wavelet and lifting based w...
Mr image compression based on selection of mother wavelet and lifting based w...
 
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitRobust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
 
Repairing and Inpainting Damaged Images using Adaptive Diffusion Technique
Repairing and Inpainting Damaged Images using Adaptive Diffusion TechniqueRepairing and Inpainting Damaged Images using Adaptive Diffusion Technique
Repairing and Inpainting Damaged Images using Adaptive Diffusion Technique
 
Image Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation ModelImage Restitution Using Non-Locally Centralized Sparse Representation Model
Image Restitution Using Non-Locally Centralized Sparse Representation Model
 
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
 
poster
posterposter
poster
 
BMC 2012
BMC 2012BMC 2012
BMC 2012
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)
 
Snakes in Images (Active contour tutorial)
Snakes in Images (Active contour tutorial)Snakes in Images (Active contour tutorial)
Snakes in Images (Active contour tutorial)
 
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
 
Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...
Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...
Learning Based Single Frame Image Super-resolution Using Fast Discrete Curvel...
 
Speckle noise reduction using hybrid tmav based fuzzy filter
Speckle noise reduction using hybrid tmav based fuzzy filterSpeckle noise reduction using hybrid tmav based fuzzy filter
Speckle noise reduction using hybrid tmav based fuzzy filter
 
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningImage Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
 
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...
 
Iaetsd vlsi implementation of gabor filter based image edge detection
Iaetsd vlsi implementation of gabor filter based image edge detectionIaetsd vlsi implementation of gabor filter based image edge detection
Iaetsd vlsi implementation of gabor filter based image edge detection
 

Similar to On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...IJERA Editor
 
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVINA MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVINijistjournal
 
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVINA MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVINijistjournal
 
Iberspeech2012
Iberspeech2012Iberspeech2012
Iberspeech2012joseangl
 
Image Denoising Based On Sparse Representation In A Probabilistic Framework
Image Denoising Based On Sparse Representation In A Probabilistic FrameworkImage Denoising Based On Sparse Representation In A Probabilistic Framework
Image Denoising Based On Sparse Representation In A Probabilistic FrameworkCSCJournals
 
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHODSTUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHODaciijournal
 
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHODSTUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHODaciijournal
 
Study Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set MethodStudy Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set Methodaciijournal
 
Study Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set MethodStudy Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set Methodaciijournal
 
Boston university; operations research presentation; 2013
Boston university; operations research presentation; 2013Boston university; operations research presentation; 2013
Boston university; operations research presentation; 2013Alvin Zhang
 
Higher-order graph clustering at AMS Spring Western Sectional
Higher-order graph clustering at AMS Spring Western SectionalHigher-order graph clustering at AMS Spring Western Sectional
Higher-order graph clustering at AMS Spring Western SectionalAustin Benson
 
An enhanced fletcher-reeves-like conjugate gradient methods for image restora...
An enhanced fletcher-reeves-like conjugate gradient methods for image restora...An enhanced fletcher-reeves-like conjugate gradient methods for image restora...
An enhanced fletcher-reeves-like conjugate gradient methods for image restora...IJECEIAES
 
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesBand Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesIDES Editor
 
Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...
Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...
Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...Communication Systems & Networks
 
離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用Ryo Hayakawa
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionCSCJournals
 

Similar to On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement (20)

Clustering-beamer.pdf
Clustering-beamer.pdfClustering-beamer.pdf
Clustering-beamer.pdf
 
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
CLIM: Transition Workshop - Statistical Emulation with Dimension Reduction fo...
 
Compressive Spectral Image Sensing, Processing, and Optimization
Compressive Spectral Image Sensing, Processing, and OptimizationCompressive Spectral Image Sensing, Processing, and Optimization
Compressive Spectral Image Sensing, Processing, and Optimization
 
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
Performance Analysis of Adaptive DOA Estimation Algorithms For Mobile Applica...
 
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVINA MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
 
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVINA MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
A MEDIAN BASED DIRECTIONAL CASCADED WITH MASK FILTER FOR REMOVAL OF RVIN
 
Iberspeech2012
Iberspeech2012Iberspeech2012
Iberspeech2012
 
Image Denoising Based On Sparse Representation In A Probabilistic Framework
Image Denoising Based On Sparse Representation In A Probabilistic FrameworkImage Denoising Based On Sparse Representation In A Probabilistic Framework
Image Denoising Based On Sparse Representation In A Probabilistic Framework
 
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHODSTUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
 
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHODSTUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
STUDY ANALYSIS ON TEETH SEGMENTATION USING LEVEL SET METHOD
 
Study Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set MethodStudy Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set Method
 
Study Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set MethodStudy Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set Method
 
Boston university; operations research presentation; 2013
Boston university; operations research presentation; 2013Boston university; operations research presentation; 2013
Boston university; operations research presentation; 2013
 
Higher-order graph clustering at AMS Spring Western Sectional
Higher-order graph clustering at AMS Spring Western SectionalHigher-order graph clustering at AMS Spring Western Sectional
Higher-order graph clustering at AMS Spring Western Sectional
 
An enhanced fletcher-reeves-like conjugate gradient methods for image restora...
An enhanced fletcher-reeves-like conjugate gradient methods for image restora...An enhanced fletcher-reeves-like conjugate gradient methods for image restora...
An enhanced fletcher-reeves-like conjugate gradient methods for image restora...
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesBand Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
 
Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...
Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...
Novel Performance Analysis of Network Coded Communications in Single-Relay Ne...
 
離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 

Recently uploaded (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

  • 1. On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement 17th May, 2019 Daniel Michelsanti1, Zheng-Hua Tan1, Sigurdur Sigurdsson2, Jesper Jensen1,2 1Aalborg University, Department of Electronic Systems, Denmark 2Oticon A/S, Denmark {danmi,zt,jje}@es.aau.dk {ssig,jesj}@oticon.com
  • 2. (D. Michelsanti, 2019) CASPR - Aalborg University 2 Agenda • Introduction • Training Targets and Objective Functions • Experiments • Results • Conclusion
  • 3. (D. Michelsanti, 2019) CASPR - Aalborg University 3 Introduction Speech Enhancement The cocktail party problem [1].
  • 4. (D. Michelsanti, 2019) CASPR - Aalborg University 4 Introduction Speech Enhancement x(n) d(n) y(n)+ =
  • 5. (D. Michelsanti, 2019) CASPR - Aalborg University 5 Introduction Speech Enhancement [2] Rk,l = |Y (k, l)| = |X(k, l) + D(k, l)| Ak,l = |X(k, l)|
  • 6. (D. Michelsanti, 2019) CASPR - Aalborg University 6 Introduction Audio-Visual Speech Enhancement • Speech is generally not a unimodal process. • Some articulatory organs that we move during speech production, like the lips, are visible to the listener [3] and have a contribution to speech intelligibility in noisy environments [4]. • The influence that visual aspects have on speech perception has been studied [5, 6]. • McGurk effect [7]: an audio-visual mismatch causes the perception of a sound different from the acoustic and the visual components of the stimulus.
  • 7. (D. Michelsanti, 2019) CASPR - Aalborg University 7 Introduction Deep Learning for Audio-Visual Speech Enhancement Neural Network Model Input Output Training Target (desired output) Objective Function (usually mean squared error) • Previous work focused on the design of training targets and objective functions for audio-only speech enhancement [8-12]. • Two contributions: • New taxonomy for speech enhancement (eterogeneous terminology in previous work). • Comparison of training targets and objective functions for audio-visual speech enhancement (previous work analysed audio-only case).
  • 8. (D. Michelsanti, 2019) CASPR - Aalborg University 8 Training Targets and Objective Functions Spectrogram vs Mask Neural Network Model Input bAk,l Ak,l Direct Mapping (DM) Neural Network Model Input cMk,l · Rk,l cMk,lRk,l Ak,l Indirect Mapping (IM) Neural Network Model Input Mk,l = Ak,l Rk,l cMk,l Mask Approximation (MA)
  • 9. CASPR - Aalborg University 9 Traning Targets and Objective Functions Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l Ak,l bAk,l 2 (1) J = a X k,l Ak,l cMk,lRk,l 2 (6) J = a X k,l MIAM k,l cMk,l 2 (11) LSA J = a X k,l log(Ak,l) log( bAk,l) 2 (2) J = a X k,l log(Ak,l) log(cMk,lRk,l) 2 (7) - MSA J = b X q,l Aq,l bAq,l 2 (3) J = b X q,l Aq,l cMq,lRq,l 2 (8) - LMSA J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Spectrogram as a target Mask as a target (D. Michelsanti, 2019)
  • 10. CASPR - Aalborg University 10 Traning Targets and Objective Functions Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l Ak,l bAk,l 2 (1) J = a X k,l Ak,l cMk,lRk,l 2 (6) J = a X k,l MIAM k,l cMk,l 2 (11) LSA J = a X k,l log(Ak,l) log( bAk,l) 2 (2) J = a X k,l log(Ak,l) log(cMk,lRk,l) 2 (7) - MSA J = b X q,l Aq,l bAq,l 2 (3) J = b X q,l Aq,l cMq,lRq,l 2 (8) - LMSA J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Spectrogram as a target Mask as a target Short-Time Spectral Amplitude (STSA) Ak,l (D. Michelsanti, 2019)
  • 11. CASPR - Aalborg University 11 Traning Targets and Objective Functions Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l Ak,l bAk,l 2 (1) J = a X k,l Ak,l cMk,lRk,l 2 (6) J = a X k,l MIAM k,l cMk,l 2 (11) LSA J = a X k,l log(Ak,l) log( bAk,l) 2 (2) J = a X k,l log(Ak,l) log(cMk,lRk,l) 2 (7) - MSA J = b X q,l Aq,l bAq,l 2 (3) J = b X q,l Aq,l cMq,lRq,l 2 (8) - LMSA J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Spectrogram as a target Mask as a target Log Spectral Amplitude (LSA) Introduced because a logarithmic law reflects better human loudness perception [15]. log(Ak,l) (D. Michelsanti, 2019)
  • 12. CASPR - Aalborg University 12 Traning Targets and Objective Functions Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l Ak,l bAk,l 2 (1) J = a X k,l Ak,l cMk,lRk,l 2 (6) J = a X k,l MIAM k,l cMk,l 2 (11) LSA J = a X k,l log(Ak,l) log( bAk,l) 2 (2) J = a X k,l log(Ak,l) log(cMk,lRk,l) 2 (7) - MSA J = b X q,l Aq,l bAq,l 2 (3) J = b X q,l Aq,l cMq,lRq,l 2 (8) - LMSA J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Spectrogram as a target Mask as a target Mel-Scaled Spectral Amplitude (MSA) Introduced because the human auditory system is more discriminative at low than at high frequencies [16]. Aq,l (D. Michelsanti, 2019)
  • 13. CASPR - Aalborg University 13 Traning Targets and Objective Functions Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l Ak,l bAk,l 2 (1) J = a X k,l Ak,l cMk,lRk,l 2 (6) J = a X k,l MIAM k,l cMk,l 2 (11) LSA J = a X k,l log(Ak,l) log( bAk,l) 2 (2) J = a X k,l log(Ak,l) log(cMk,lRk,l) 2 (7) - MSA J = b X q,l Aq,l bAq,l 2 (3) J = b X q,l Aq,l cMq,lRq,l 2 (8) - LMSA J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Spectrogram as a target Mask as a target Log Mel-Scaled Spectral Amplitude (LMSA) Introduced to combine the previous two considerations. log(Aq,l) (D. Michelsanti, 2019)
  • 14. CASPR - Aalborg University 14 Traning Targets and Objective Functions Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l Ak,l bAk,l 2 (1) J = a X k,l Ak,l cMk,lRk,l 2 (6) J = a X k,l MIAM k,l cMk,l 2 (11) LSA J = a X k,l log(Ak,l) log( bAk,l) 2 (2) J = a X k,l log(Ak,l) log(cMk,lRk,l) 2 (7) - MSA J = b X q,l Aq,l bAq,l 2 (3) J = b X q,l Aq,l cMq,lRq,l 2 (8) - LMSA J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Spectrogram as a target Mask as a target Phase Sensitive Spectral Amplitude (PSSA) Introduced to compensate for the phase mismatch between noisy and clean signals [9]. Ak,l cos(✓k,l) with ✓k,l = X(k, l) Y (k, l) (D. Michelsanti, 2019)
  • 15. CASPR - Aalborg University 15 Objective functions of the approaches used in the study organised according to our taxonomy. Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l Ak,l bAk,l 2 (1) J = a X k,l Ak,l cMk,lRk,l 2 (6) J = a X k,l MIAM k,l cMk,l 2 (11) LSA J = a X k,l log(Ak,l) log( bAk,l) 2 (2) J = a X k,l log(Ak,l) log(cMk,lRk,l) 2 (7) - MSA J = b X q,l Aq,l bAq,l 2 (3) J = b X q,l Aq,l cMq,lRq,l 2 (8) - LMSA J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Spectrogram as a target Mask as a target Traning Targets and Objective Functions Taxonomy (D. Michelsanti, 2019)
  • 16. (D. Michelsanti, 2019) CASPR - Aalborg University 16 Experiments Neural Network Architecture Deep-Learning-Based Framework Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout Conv+Leaky-ReLU+BatchNorm+MaxPooling+Dropout Conv+Leaky-ReLU+BatchNorm Conv+Leaky-ReLU+BatchNorm Conv+Leaky-ReLU+BatchNorm Conv+Leaky-ReLU+BatchNorm Conv+Leaky-ReLU+BatchNorm Conv+Leaky-ReLU+BatchNorm FullyConnected+Leaky-ReLU FullyConnected+Leaky-ReLU FullyConnected+Leaky-ReLU Deconv+Leaky-ReLU+BatchNorm Deconv+Leaky-ReLU+BatchNorm Deconv+Leaky-ReLU+BatchNorm Deconv+Leaky-ReLU+BatchNorm Deconv+Leaky-ReLU+BatchNorm Deconv+Leaky-ReLU+BatchNorm Face Detection Face Alignment Mouth Region Extraction STFT Magnitude Computation ISTFT Video Encoder Audio Encoder Fusion Sub-Network Audio Decoder Phase Computation Estimated Spectrogram Estimated Mask OR
  • 17. (D. Michelsanti, 2019) CASPR - Aalborg University 17 Experiments Setup • Corpus: audio-visual GRID. • Six kinds of additive noise: bus, cafeteria, street, pedestrian, babble and speech shaped noise (unseen). • SNRs: training [-20:5:20]; evaluation [-15:5:15]. • 25 speakers for training (600 utterances each). • 25 seen speakers for evaluation (25 utterances each). • 6 unseen speakers for evaluation (100 utterances each). • Evaluation metrics: • PESQ [17] – Speech quality. • ESTOI [18] – Speech intelligibility.
  • 18. (D. Michelsanti, 2019) CASPR - Aalborg University 18 Results PESQ LMS J = b X q,l log(Aq,l) log(bAq,l) 2 (4) J = b X q,l log(Aq,l) log(cMq,lRq,l) 2 (9) - PSSA J = a X k,l Ak,l cos(✓k,l) bAk,l 2 (5) J = a X k,l Ak,l cos(✓k,l) cMk,lRk,l 2 (10) J = a X k,l MPSM k,l cMk,l 2 (12) Results in terms of PESQ. The Unproc. rows refer to the unprocessed signals. PESQ Seen Speakers Unseen Speakers SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg. Unproc. 1.09 1.08 1.08 1.11 1.20 1.39 1.71 1.24 1.10 1.09 1.08 1.11 1.20 1.39 1.70 1.24 STSA-DM 1.27 1.35 1.48 1.65 1.86 2.08 2.31 1.71 1.13 1.19 1.30 1.48 1.73 1.99 2.24 1.58 LSA-DM 1.24 1.37 1.57 1.84 2.14 2.45 2.74 1.91 1.15 1.23 1.37 1.59 1.91 2.25 2.57 1.72 MSA-DM 1.27 1.36 1.49 1.67 1.87 2.07 2.28 1.72 1.14 1.20 1.32 1.51 1.75 1.99 2.21 1.59 LMSA-DM 1.27 1.39 1.56 1.78 2.01 2.18 2.31 1.79 1.15 1.22 1.34 1.53 1.77 1.98 2.14 1.59 PSSA-DM 1.24 1.32 1.44 1.61 1.82 2.04 2.25 1.67 1.13 1.18 1.28 1.45 1.70 1.94 2.17 1.55 STSA-IM 1.24 1.33 1.45 1.61 1.77 1.95 2.19 1.65 1.13 1.18 1.28 1.44 1.65 1.87 2.11 1.52 LSA-IM 1.17 1.25 1.39 1.60 1.89 2.19 2.49 1.71 1.13 1.17 1.28 1.46 1.72 2.02 2.34 1.59 MSA-IM 1.26 1.34 1.47 1.64 1.85 2.07 2.30 1.70 1.13 1.19 1.29 1.47 1.71 1.98 2.24 1.57 LMSA-IM 1.21 1.32 1.48 1.72 1.99 2.26 2.53 1.79 1.13 1.19 1.30 1.49 1.76 2.06 2.35 1.61 PSSA-IM 1.29 1.37 1.50 1.68 1.87 2.05 2.22 1.71 1.16 1.22 1.33 1.51 1.74 1.96 2.15 1.58 STSA-MA 1.31 1.42 1.57 1.78 2.02 2.29 2.58 1.85 1.15 1.21 1.32 1.52 1.81 2.15 2.48 1.66 PSSA-MA 1.28 1.38 1.54 1.78 2.08 2.40 2.71 1.88 1.18 1.25 1.38 1.61 1.95 2.31 2.63 1.76 Results in terms of ESTOI. The Unproc. rows refer to the unprocessed signals. ESTOI Seen Speakers Unseen Speakers SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg. DM IM MA
  • 19. (D. Michelsanti, 2019) CASPR - Aalborg University 19 Results ESTOI MSA-IM 1.26 1.34 1.47 1.64 1.85 2.07 2.30 1.70 1.13 1.19 1.29 1.47 1.71 1.98 2.24 1.57 LMSA-IM 1.21 1.32 1.48 1.72 1.99 2.26 2.53 1.79 1.13 1.19 1.30 1.49 1.76 2.06 2.35 1.61 PSSA-IM 1.29 1.37 1.50 1.68 1.87 2.05 2.22 1.71 1.16 1.22 1.33 1.51 1.74 1.96 2.15 1.58 STSA-MA 1.31 1.42 1.57 1.78 2.02 2.29 2.58 1.85 1.15 1.21 1.32 1.52 1.81 2.15 2.48 1.66 PSSA-MA 1.28 1.38 1.54 1.78 2.08 2.40 2.71 1.88 1.18 1.25 1.38 1.61 1.95 2.31 2.63 1.76 Results in terms of ESTOI. The Unproc. rows refer to the unprocessed signals. ESTOI Seen Speakers Unseen Speakers SNR (dB) -15 -10 -5 0 5 10 15 Avg. -15 -10 -5 0 5 10 15 Avg. Unproc. 0.08 0.15 0.24 0.35 0.47 0.58 0.67 0.36 0.08 0.14 0.23 0.34 0.46 0.57 0.66 0.35 STSA-DM 0.35 0.41 0.49 0.57 0.64 0.70 0.74 0.56 0.23 0.29 0.39 0.49 0.59 0.67 0.72 0.48 LSA-DM 0.35 0.41 0.49 0.58 0.65 0.71 0.76 0.56 0.24 0.30 0.39 0.49 0.60 0.68 0.73 0.49 MSA-DM 0.36 0.42 0.49 0.57 0.64 0.70 0.74 0.56 0.24 0.31 0.40 0.51 0.61 0.68 0.73 0.50 LMSA-DM 0.37 0.44 0.51 0.60 0.66 0.71 0.75 0.58 0.25 0.31 0.40 0.51 0.61 0.68 0.72 0.50 PSSA-DM 0.29 0.36 0.46 0.56 0.64 0.70 0.74 0.53 0.19 0.27 0.37 0.49 0.60 0.68 0.72 0.48 STSA-IM 0.33 0.40 0.48 0.56 0.64 0.69 0.74 0.55 0.23 0.29 0.39 0.50 0.60 0.67 0.72 0.48 LSA-IM 0.33 0.38 0.46 0.55 0.63 0.70 0.75 0.54 0.22 0.28 0.36 0.46 0.57 0.66 0.73 0.47 MSA-IM 0.36 0.42 0.50 0.58 0.65 0.70 0.75 0.57 0.25 0.31 0.40 0.50 0.60 0.68 0.73 0.50 LMSA-IM 0.36 0.42 0.50 0.59 0.66 0.72 0.76 0.57 0.24 0.30 0.38 0.49 0.60 0.68 0.73 0.49 PSSA-IM 0.29 0.37 0.46 0.56 0.64 0.70 0.75 0.54 0.21 0.28 0.38 0.50 0.61 0.68 0.73 0.48 STSA-MA 0.39 0.45 0.52 0.60 0.67 0.72 0.77 0.59 0.26 0.32 0.41 0.51 0.62 0.70 0.75 0.51 PSSA-MA 0.29 0.36 0.46 0.57 0.66 0.72 0.77 0.55 0.22 0.29 0.40 0.52 0.63 0.70 0.75 0.50
  • 20. (D. Michelsanti, 2019) CASPR - Aalborg University 20 Conclusion • We proposed a new taxonomy to have a uniform terminology that links classical speech enhancement methods with more recent techniques. • We investigated several training targets and objective functions for audio-visual speech enhancement. • We used a deep-learning-based framework to directly and indirectly learn the short time spectral amplitude of the target speech in different domains. • The mask approximation approaches and the direct estimation of the log magnitude spectrum are the methods that perform the best. • In contrast to the results for audio-only speech enhancement, the use of a phase-aware mask is not as effective in improving estimated intelligibility especially at low SNRs.
  • 21. (D. Michelsanti, 2019) CASPR - Aalborg University 21 Thank You! Any questions? Daniel Michelsanti danmi@es.aau.dk
  • 22. (D. Michelsanti, 2019) CASPR - Aalborg University 22 1. E. C. Cherry (1953). Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustical Society of America, 25(5):975–979. 2. W. A. Sethares (2007). Rhythm and transforms. Springer Science & Business Media. 3. A. Abel and A. Hussain (2014). Novel two-stage audiovisual speech filtering in noisy environments. Cognitive Computation, 6(2). 4. W. H. Sumby and I. Pollack (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America, 26(2):212–215. 5. N. P. Erber (1975). Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40(4). 6. Q. Summerfield (1979). Use of visual information for phonetic perception. Phonetica, 36(4-5). 7. H. McGurk and J. MacDonald (1976). Hearing lips and seeing voices. Nature, 264(5588):746–748. 8. Y. Wang, A. Narayanan, and D. L. Wang (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1849–1858. 9. H. Erdogan, J. R. Hershey, S.Watanabe, and J. Le Roux (2015). Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In ICASSP. 10. D. S. Williamson, Y. Wang, and D. L. Wang (2016). Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(3):483–492. 11. D. L.Wang and J. Chen (2017). Supervised speech separation based on deep learning: an overview. arXiv preprint arXiv:1708.07524. 12. L. Sun, J. Du, L.-R. Dai, and C.-H. Lee (2017). Multiple-target deep learning for lstm-rnn based speech enhancement. In HSCMA. 13. T. Fingscheidt, S. Suhadi, and S. Stan (2008). Environment optimized speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 4, pp. 825–834. BIBLIOGRAPHY
  • 23. (D. Michelsanti, 2019) CASPR - Aalborg University 23 14. P. C. Loizou (2005). Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum. IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 857–869. 15. E. Zwicker and H. Fastl (2013). Psychoacoustics: Facts and models, vol. 22. Springer Science & Business Media. 16. S. S. Stevens, J. Volkmann, and E. B. Newman (1937). A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185–190. 17. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001). Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs. in ICASSP. 18. J. Jensen and C. H. Taal (2016). An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022. BIBLIOGRAPHY
  • 24. 24 Slide 3 – Most of the icons made by OCHA or by Freepik from www.flaticon.com IMAGES (D. Michelsanti, 2019) CASPR - Aalborg University
  • 25. (D. Michelsanti, 2019) CASPR - Aalborg University 25 Training Targets and Objective Functions Mask Approximation vs Indirect Mapping J = 1 TF X k,l ⇣ Ak,l cMk,lRk,l ⌘2 J = 1 TF X k,l ✓ Ak,l Rk,l cMk,l ◆2 = 1 TF X k,l (Ak,l cMk,lRk,l)2 Rk,l 2 Mask Approximation (MA)Indirect Mapping (IM) MA is nothing more than a spectrally weighted version of IM [13], which reduces the cost of estimation errors at high-energy spectral regions of the noisy signal relative to low-energy spectral regions, and is related to a perceptually motivated cost function [14].