Speech enhancement for distant talking speech recognition

24 Feb 2014
Takuya Yoshioka
NTT CS Labs, Cambridge University
Thanks to: T. Nakatani, K. Kinoshita, M. Delcrolix (NTT)
M. Gales, X. Chen (Cambridge)

Speech Enhancement for ASR
• Effectiveness measured by WER
– use of a sensible ASR system essential
• Huge computational resources available
• Offline processing allowed
• AM can also do some job

Typical ASR System
Pron
Dict
LMAM
Recog
Engine
Speech
Enh
Front-
End
Signal Sentence

Different Approaches for Different Situations
• 1ch vs. Mch (M > 1)
• background noise;
• reverberant noise; or
• interfering talkers

• Reverberation usually modelled with FIR
• Given (x[t])t=1,…,N, recover (s[t])t=1,…,N
1ch Dereverberation (Offline)
∑=
−=
T
tshtx
0
][][][
τ
ττ

Approaches
• Time domain
– subspace, Trinicon, Long-term LP
– accuate
– can account for phase distortion
• Power spectral domain
– WF, NMF
– robust against speaker movement
• Feature domain
– front-end VTS, direct CMLLR
– can leverage the AM

Dereverb
Dereverb
Analysis
Synthesis
xk(t) sk(t)
x[t] s[t]
∑=
∗
−=
T
kkk tshtx
0
)()()(
τ
ττ
...
Assume in each sub-band

Inverse Filtering (in Each Sub-band)
∑=
∗
−=
U
kkk txgts
0
)()()(
τ
ττ

Long-Term Linear Prediction
)()()()( tetxatx k
U
kkk +−= ∑∆=
∗
τ
ττ
)(tsk
∑∆=
∗
−−=
U
kkkk txatxts
τ
ττ )()()()(
we don’t minimise ek(t)!

Why LP?
)()()()( tstxatx k
U
kkk +−= ∑∆=
∗
τ
ττ ∑=
∗
−=
T
kkk tshtx
0
)()()(
τ
ττ
LP vs. FIR

( )tk
U
kkUtkk tyaNtyty ,,...,1' ,)()(~))'((|)( λτττ∑ ∆=
∗
= −
( )∑ ∑=
∆=
∗
= −=
N
t
tk
U
kkNtk tyaftyp
1
,Normal,...,1 ,)()(log))((log λτττ
+
),0(~)( ,tkk Nts λ )()()()( tstxatx k
U
kkk +−= ∑∆=
∗
τ
ττ

Interleaved Estimation of:
- LP coeff A= (ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T
- clean speech samples
Initialise A
Calculate sk(t)
Estimate LP coeffs A
Convergent?
Estimate speech vars Λ

Eval on REVERB Challenge Data Sets
System %WER
DNN AM + RNN LM + AM adapt 20.0
Dereverb + DNN AM + RNN LM + AM adapt 16.5
• prompts from 5K WSJ
• trained on multi-condition data
• tested on real recordings from dev set
• small amount of background noise

Eval on AMI Corpus (Meeting Transcription)
System
%WER
Dev Eval
DNN AM + 3gram LM 43.5 42.6
Dereverb + DNN AM + 3gram LM 42.0 41.1
• 4 participants in each meeting
• table-top microphone used
• single-speaker segments used
• severe reverberation and background noise

1ch Algorithm Summary
• very robust against modelling errors
• keys in development
– modelling the reverberation with LP
– using a reasonable clean speech pdf

Multi-Channel Extension
Dereverb BF To recogniser

• LP  MIMO LP
)()()()( ttt k
U
kkk exΑx +−= ∑∆=
∗
τ
ττ
)(tskh

• LP  MIMO LP
• single speech model  vector speech model
)()()()( ttt k
U
kkk exΑx +−= ∑∆=
∗
τ
ττ
)(tskh
),0(~)( ,tkk Nts λ ),0(~)( ,tkk Nts λ∗
hhh
),0( ,tkN λI≈
⇔

Interleaved Estimation of:
- LP matrix A= (Ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T
- clean speech samples
Initialise A
Calculate sk(t)
Estimate LP matrices A
Convergent?
Estimate speech vars Λ

Eval on REVERB Challenge Data Sets
#Mics System %WER
1
Baseline(DNN AM + RNN LM + AM adapt) 20.0
Dereverb + Baseline 16.5
2
Dereverb + MVDR + Baseline 13.6
8
Dereverb + MVDR + Baseline 11.3

Long-Term LP Summary
• very robust against modelling errors
• can cover both 1ch and Mch set-ups
• keys in development
– modelling the reverberation with LP
– using a reasonable clean speech pdf

Extensions Explored
• dereverberation+BSS
• adaptive long-term LP
• NMF-based dereverberation
– works in the power spectrum domain
• FE-VTS dereverberation

Dereverberation+BSS
Dereverb BSS

T60=0.3 s T60=0.5 s
0
2
4
6
8
10
12
14
16
dereverberation+separation
separation
w/oseparation
SIR(dB)

Conclusion
• Dereverberation based on long-term LP
– represents reverberation with LP
– consistent framework covering both 1ch and
Mch set-ups
– provides gains over well-optimised DNN AMs
in realistic conditions
– extensions to several directions described

Speech enhancement for distant talking speech recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Speech enhancement for distant talking speech recognition

Similar to Speech enhancement for distant talking speech recognition (20)

Recently uploaded

Recently uploaded (20)

Speech enhancement for distant talking speech recognition