A literature review on improving speech intelligibility in noisy environment

1/35
Introduction
Traditional techniques
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
A literature review on improving speech
intelligibility in noisy environment
Tuan Anh Dinh
Oregon Health & Science University
November 20, 2018
Tuan Anh Dinh A literature review on improving speech intelligibility in noisy en

2/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Introduction
Speech intelligibility is degraded in noisy environment
Reducing background noise beneﬁts hearing impaired (HI)
people, and hearing aid (HA) users. (Kochkin 2000),
(Nabeleck 2006)
Use single-channel speech-enhancement algorithms for
noise-suppression.

3/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Speech-enhancement algorithms
Assume additive model of noise:
y(n) = s(n) + d(n)
y(n) is noisy speech, s(n) is clean speech, and d(n) is an
uncorrelated noise
Modify/ﬁlter short-time spectral amplitude (STSA) of
degraded speech y(n)
Keep degraded speech’s phase

4/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Speech-enhancement algorithms
Traditional techniques:
Direct estimation of STSA: Power Spectrum Subtraction
(Boll, 1979)
Filtering of STSA: Wiener Filtering (Lim 1979),
Ideal Binary Masking (IBM) (Wang, 2008)
Little improvement with stationary noise (Monaghan 2017)
Machine learning techniques:
Sparse Coding (Monaghan 2017)
GMMs (Hu and Loizou 2010) or DNNs (Healy 2013, 2015) for
IBM.
DNN (Monaghan 2017) for Wiener Filtering.
The machine learning techniques such as Deep neural network
(DNN) can generalize well with noise (May 2014, Chen 2016)

5/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
I will talk about
Traditional techniques:
Power Spectrum Subtraction
Wiener Filtering
Ideal Binary Masking (IBM)
Machine learning techniques:
Sparse Coding
DNN for Wiener Filtering.

6/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Wiener Filtering
Ideal Binary Masking
Calculate power spectrum of enhanced speech |ˆS(f )|2:
|ˆS(f )|2
= |Y (f )|2
−E |D(f )|2
where |Y (f )|2 is the power spectrum of degraded speech,
|D(f )|2 is the power spectrum of noise
Obtain E |D(f )|2 either from assumed properties of noise or
actual measurement from background noise in no-speech
intervals
When right-hand side is negative, set it to zero
Little improvement with stationary noise.

Workﬂow of Power Spectrum Subtraction
Figure: A typical speech enhancement system based on power spectrum
subtraction
7/35

Figure: a speech degraded by a narrowband noise (top), and the result of
spectral subtraction (bottom). Source: the Internet
8/35

9/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Wiener Filtering
Wiener Filtering
Minimize the error E |s(n) − ˆs(n)|2 , we obtain Wiener
ﬁlter’s gain H(f )
H(f ) =
E |S(f )|2
E [|S(f )|2] + E [|D(f )|2]
where |S(f )|2 is the power spectrum of clean speech, |D(f )|2
is the power spectrum of noise, s(n) is clean signal and ˆs(n) is
enhanced signal.
The mean squared error (MSE) criterion is not strongly
correlated with perception (Lim 1979)

Wiener Filtering
Figure: estimate of clean signal after spectral subtraction (top) and signal
ﬁltered by Wiener ﬁlter (bottom)
10/35

11/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Wiener Filtering
Idea: Find speech-dominated time-frequency bins (1) and
noise-dominated time-frequency bins (0)
Given clean speech and noise (Ideal), construct IBM from
time-frequency representation of speech
Set the value of the the IBM to 0 or 1 by comparing SNR in
each time-frequency bin against a preset threshold.

Figure: (A) 32-channel cochlea-gram of clean speech, (B) 32-channel of
speech-shaped noise, (C) IBM with 32 channels, (D)32-channel
cochlea-gram of gated noise by IBM
12/35

13/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Sparse coding
Assume clean speech’s feature vector s
s = Dα
where D is a dictionary and α is a sparse coeﬃcient vector
enhanced speech
ˆs = D ˆα
such that
∥y − D ˆα∥2
2< ϵ
with ϵ is the desired error.
De-noising degraded feature vector y means making its ˆα
sparse.

Some vectors in dictionary D
Figure: Basic vectors in Dictionary obtained from clean speech 14/35

Sparse coding
Figure: Spectrograms of clean speech (top), noisy speech with babble
noise (middle), and enhanced speech using sparse coding (bottom)
15/35

16/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Deep Neural Network Input
Deep Neural Network Output
Deep Neural Network Framework
Deep neural network
DNNs can represent complex mapping function between input data
and output data
Figure: ANN and DNN structures

17/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Deep neural network—Input
Traditional features vs Auditory image model features
Traditional feature set:
Amplitude modulation spectrum (AMS) (Tchorz, 2003)
Perceptual linear prediction (PLP) (Hermansky, 1994)
Mel-frequency cepstra coeﬃcients (MFCC)
Auditory image model (AIM), in our paper AIM has 2 steps
1 basilar membrane motion (BMM)
2 size-shape transformed auditory (SSI)
Gamma-tone ﬁlter bank use equivalent rectangular bandwidth
(ERB) scale: ERB = 24.7 × (4.37 × 10−3f + 1)

Amplitude modulation spectrum (AMS)
Log amplitude of the Fourier transform of the Time-frequency
representation (e.g. log ﬁlter-banks) of speech.
Figure: AMS patterns generated from voice speech segment (top), and
from speech shaped noise (bottom). Bright and dark areas indicate high
and low energies, respectively.
18/35

19/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
PLP and MFCC
Relative spectral transform and Perceptual linear prediction
(RASTA-PLP) (Hermansky, 1994)
Estimate smooth spectral envelope using linear predictive code
Integrate auditory knowledge through Bark scale:
Bark = 13 arctan(0.00076
f ) + 3.5 arctan(( f
7500 )2
)
Mel-frequency cepstra coeﬃcients (MFCC)
Estimate smooth spectral envelope using cepstra coeﬃcients
Integrate auditory knowledge through Mel scale:
Mel = 2595 log10 1 + f
700

An example of PLP
Figure: Spectrogram (top), PLP approximation (bottom)
20/35

An example of MFCC
Figure: Spectrogram (left), MFCC approximation (right)
21/35

22/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Auditory image model
Basilar membrane motion (BMM)
simulates how sound waves cause basilar membrane motion
(BMM) in the human cochlea
Size-shape transformed auditory image (SSI)
produce same pattern for vowels spoken by speakers with
diﬀerent glottal pulse rates or vocal tract lengths

Basilar membrane motion
Figure: Basilar membrane response to vowel /ae/
23/35

24/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Deep Neural Network—Output
Output: Wiener filter’s gain (Monaghan 2017) / IBM (Healy
2013, 2015)
Loss functions:
Use Short Time Objective Intelligibility (STOI) (Monaghan
2017)
Use Normalized Covariance Metric (NCM) (Monaghan 2017)
Use hit rates (HIT): percentage of speech-dominated
time-frequency bins correctly classified by binary mask (Healy
2013, 2015), (Monaghan 2017)
Use false alarms (FA): percentage of noise-dominated
time-frequency bins incorrectly classified as speech-dominated
(Healy 2013, 2015), (Monaghan 2017)

25/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Deep Neural Network—Framework
Figure: DNN-based speech enhancement. NN_COMP is a DNN trained
on traditional feature. NN_AIM is a DNN trained on AIM

26/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Objective evaluation
Subjective evaluation
HIT-FA
Figure: HIT-FA scores for neural network based algorithms. To calculate
HIT-FA scores, the ratio masks (estimated and ideal) were converted to
binary masks

27/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Normalized Covariance Metric
Figure: Mean values of NCM for the sentences used in each condition.

28/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Short Time Objective Intelligibility
Figure: Mean values of STOI for the sentences used in each condition.

29/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Speech intelligibility
Use speech recognition scores as an subjective measurement.
Assessed as percentage of keywords identiﬁed correctly in
sentences spoken by a British male
17 participants are native speakers of British English
2 kinds of noise: speech-shaped noise (SSN) and babble noise.
2 SNR conditions: 0 dB and +4 dB

Subjective Intelligibility
Figure: Percentage of key words correctly recognized for each algorithm
in speech-shaped noise (SSN) and babble noise of 5 systems
30/35

31/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Speech quality
Same participants from speech intelligibility test.
Ask participant to rate the perceived quality of the speech on
a scale from 0 to 7. (0 is bad, and 7 is excellent)

Speech quality
Figure: speech quality rating 32/35

Speech intelligibility gain vs speech quality gain
Figure: Group-mean improvement in speech quality versus improvement
in speech intelligibility for the four algorithms in each noise condition.
33/35

Correlation between Objective and Subjective measurement
Figure: Average values of the objective measures NCM and STOI plotted
as a function of the mean intelligibility scores
34/35

35/35
Introduction
Sparse Coding
Deep Neural Network
Evaluation
Conclusion
Conclusion
Significant improvement of speech recognition score and
quality for 3 machine learning algorithms in at least one of
four conditions
No significant improvement for Wiener filtering
For DNNs, auditory features perform better than traditional
features, but not significant
DNNs perform better than sparse coding

A literature review on improving speech intelligibility in noisy environment

More Related Content

What's hot

Similar to A literature review on improving speech intelligibility in noisy environment

Recently uploaded

A literature review on improving speech intelligibility in noisy environment