Boost PC performance: How more available memory can improve productivity
Speech enhancement for distant talking speech recognition
1. 24 Feb 2014
Takuya Yoshioka
NTT CS Labs, Cambridge University
Thanks to: T. Nakatani, K. Kinoshita, M. Delcrolix (NTT)
M. Gales, X. Chen (Cambridge)
2. Speech Enhancement for ASR
• Effectiveness measured by WER
– use of a sensible ASR system essential
• Huge computational resources available
• Offline processing allowed
• AM can also do some job
4. Different Approaches for Different Situations
• 1ch vs. Mch (M > 1)
• background noise;
• reverberant noise; or
• interfering talkers
5. Different Approaches for Different Situations
• 1ch vs. Mch (M > 1)
• background noise;
• reverberant noise; or
• interfering talkers
6. • Reverberation usually modelled with FIR
• Given (x[t])t=1,…,N, recover (s[t])t=1,…,N
1ch Dereverberation (Offline)
∑=
−=
T
tshtx
0
][][][
τ
ττ
7. Approaches
• Time domain
– subspace, Trinicon, Long-term LP
– accuate
– can account for phase distortion
• Power spectral domain
– WF, NMF
– robust against speaker movement
• Feature domain
– front-end VTS, direct CMLLR
– can leverage the AM
12. ( )tk
U
kkUtkk tyaNtyty ,,...,1' ,)()(~))'((|)( λτττ∑ ∆=
∗
= −
( )∑ ∑=
∆=
∗
= −=
N
t
tk
U
kkNtk tyaftyp
1
,Normal,...,1 ,)()(log))((log λτττ
+
),0(~)( ,tkk Nts λ )()()()( tstxatx k
U
kkk +−= ∑∆=
∗
τ
ττ
13. Interleaved Estimation of:
- LP coeff A= (ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T
- clean speech samples
Initialise A
Calculate sk(t)
Estimate LP coeffs A
Convergent?
Estimate speech vars Λ
14. Eval on REVERB Challenge Data Sets
System %WER
DNN AM + RNN LM + AM adapt 20.0
Dereverb + DNN AM + RNN LM + AM adapt 16.5
• prompts from 5K WSJ
• trained on multi-condition data
• tested on real recordings from dev set
• small amount of background noise
15. Eval on AMI Corpus (Meeting Transcription)
System
%WER
Dev Eval
DNN AM + 3gram LM 43.5 42.6
Dereverb + DNN AM + 3gram LM 42.0 41.1
• 4 participants in each meeting
• table-top microphone used
• single-speaker segments used
• severe reverberation and background noise
16. 1ch Algorithm Summary
• very robust against modelling errors
• keys in development
– modelling the reverberation with LP
– using a reasonable clean speech pdf
18. • LP MIMO LP
)()()()( ttt k
U
kkk exΑx +−= ∑∆=
∗
τ
ττ
)(tskh
19. • LP MIMO LP
• single speech model vector speech model
)()()()( ttt k
U
kkk exΑx +−= ∑∆=
∗
τ
ττ
)(tskh
),0(~)( ,tkk Nts λ ),0(~)( ,tkk Nts λ∗
hhh
),0( ,tkN λI≈
⇔
20. Interleaved Estimation of:
- LP matrix A= (Ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T
- clean speech samples
Initialise A
Calculate sk(t)
Estimate LP matrices A
Convergent?
Estimate speech vars Λ
21. Eval on REVERB Challenge Data Sets
#Mics System %WER
1
Baseline(DNN AM + RNN LM + AM adapt) 20.0
Dereverb + Baseline 16.5
2
Dereverb + Baseline 14.8
Dereverb + MVDR + Baseline 13.6
8
Dereverb + Baseline 14.0
Dereverb + MVDR + Baseline 11.3
22. Long-Term LP Summary
• very robust against modelling errors
• can cover both 1ch and Mch set-ups
• keys in development
– modelling the reverberation with LP
– using a reasonable clean speech pdf
25. T60=0.3 s T60=0.5 s
0
2
4
6
8
10
12
14
16
dereverberation+separation
separation
w/oseparation
SIR(dB)
26. Conclusion
• Dereverberation based on long-term LP
– represents reverberation with LP
– consistent framework covering both 1ch and
Mch set-ups
– provides gains over well-optimised DNN AMs
in realistic conditions
– extensions to several directions described