The individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems are examined in distant talking situations. The contents were published in:
Takuya Yoshioka and Mark J. F. Gales, "Environmentally robust ASR front-end for deep neural network acoustic models," Computer Speech and Language, vol. 31, no. 1, pp. 65-86, May 2015.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Environmentally robust ASR front end for DNN-based acoustic models
1.
2. • Do not compare results across different tables!
– Configurations may differ
• Most results shown here can be found in:
Takuya Yoshioka and Mark J. F. Gales, “Environmentally
robust ASR front-end for deep neural network acoustic
models,” Computer Speech and Language, vol. 31, no. 1, pp.
65-86, May 2015
3. 1. Motivation
2. Corpus
• AMI meeting corpus
3. Baseline systems
• SI and SAT set-ups
4. Assessment of environmental robustness of
DNN acoustic models
5. Front-end techniques
6. Combined effects
13. State output distributions modelled with
– GMM or
– DNN
¦ –
/
Q
T
t
tttt qpqqPqPp
q
xX
1
10 )|()|()()|(
¦
M
m
jmjm
mjmt Ncjp
1
)()(
),;()|( Σμxx
)(
)|(
)|(
jp
jp
jp t
t
x
x
18. Data Set
Parame-
terisation
%WER
Dev Eval Avg
SDM FBANK 43.5 42.6 43.1
IHM FBANK 28.2 24.6 26.4
• 39.2% of the errors caused by acoustic distortion
• DNN-HMMs not so robust
22. Align-
ment
DNN
input
%WER
Dev Eval Avg
SDM IHM 30.6 27.0 28.8
IHM SDM 41.8 40.8 41.3
IHM SDM 41.7 40.6 41.2
Using 648-2,000 5-4,000 DNN:
DNN training more sensitive to noise than state
alignment
27. • Based on linear time (almost) invariant filters
• Applied to complex-valued STFT coefficients
• The filters automatically adjusted using observations
– WPE for 1ch dereverberation (NTT’s work)
– BeamformIt for denoising (ICSI’s work)
• 8 microphones used, dedicated to meetings
• Unlikely to produce irregular transitions
¦
1
0
,,,,
T
Tk
ktfkftftf xgxy
28. Align-
ment
Dev Eval
SDM +Derev
BFIt
(8mics)
SDM +Derev
BFIt
(8mics)
MPE 43.8 41.8 38.6 43.0 41.3 36.6
Hybrid 43.5 41.7 38.8 43.3 41.4 36.7
• Dereveberation helps even with single microphone
• Multi-microphone beamforming works well
32. • Applied to magnitude spectra
• Cross terms (often) ignored
• Frame-by-frame modification
– Harmful for DNN?
• Noise estimated using long-term statistics
– IMCRA (used here), minimum statistics, etc
• Deltas from un-enhanced speech
– Essential for obtaining gains
2
,
2
,
2
, tftftf nxy
33.
34. • Applied to FBANK features
• The following mismatch function used
• Frame-by-frame modification
• Noise model estimated with EM
• Deltas from un-enhanced speech
))exp(1log( hynhxy tttt
35. Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
• Small consistent gains
• Different methods should not be connected
36. Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
Y Y 41.4 40.4 40.9
Using multi-stream approach:
38. • Frame level
– FMPE, RDT, FE-CMLLR
– Seems to be subsumed by DNN
• Speaker (or environment) level
– Global CMLLR, LIN, fDLR, VTLN
– Multiple decoding passes required Æ SAT
• Utterance level
– Single-pass decoding Æ SI
39. • Seems robust against supervision errors
• STC transform used to deal with correlations:
»
»
»
¼
º
«
«
«
¬
ª
tx
)()()()( ss
t
ss
t bLxAy
40.
41.
42. Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
• ~10% relative gains obtained
• “Block diagonal” outperforms “full”
43. Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
None (SI) 27.8 24.2 26.0
Full 23.8 21.6 22.7
On IHM data set
44. ))(()())(()( ucu
t
ucu
t bLxAy
uuc :)(
Clustering performed using:
– utterance-specific iVectors
– Kmeans (GMM yielded similar performance figures)
52. • Originally proposed by Aachen for shallow MLP
tandem configurations
• Exploits DNN’s insensitivity to the increase in input
dimensionality
• (Hopefully) complement features masked by noise
• Allows multiple enhancement results to be
combined
53. • Four types of auxiliary features investigated:
– MFCC (Δ/Δ2)
– PLP
– Gammatone cepstra
• Different frequency warping
• STFT not used
– Intra-frame delta ( )
• Emphasises spectral peaks/dips
59. System
Parame-
terisation
%WER
Dev Eval Avg
SAT GMM-HMM
MPE trained
HLDA 48.8 50.2 49.5
SAT tandem
MPE trained
FBANK 40.7 40.9 40.8
SI hybrid FBANK 43.5 42.6 43.1
• Outperforms SAT GMM-HMM
• Outperforms SI hybrid