Final defense

Improving Speech Intelligibility through
Spectral Style Conversion
Tuan Dinh
Oregon Health & Science University
Sep 2021

Table of Contents
1 Introduction
Motivation
Approach
Thesis Problem and Statement
Specific Aims
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion

3/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Specific Aims
Unintelligible Speech
Speech is important for human communication
Typical way of speaking is referred as habitual speech
Habitual speech becomes less intelligible in noise
Habitual speech is also hard to understand for people with
hearing impairments and non-native speakers
Tuan Dinh Improving Speech Intelligibility

Unintelligible Speech
Figure: Synthetic speech of speaking devices is degraded by noise
Figure: Atypical speech is hard to understand, especially in noise
4/67

5/67
Introduction
Background
Conclusion
Motivation
Approach
Specific Aims
Listener Side Solution
Use noise suppression and cancellation methods
Require noise-cancellation devices, which take as input a noisy
speech signal and output an enhanced signal with higher
intelligibility and quality
There are many cases where listeners don’t have
noise-cancellation devices
transit announcements

Lessons from Real Speakers: Habitual vs Clear
Speakers adjust their voice to make it more intelligible
Adopt special clear speaking style to make habitual speech
more resilient to noisy environments and listener deficits
Researchers showed that:
Clear speech features extended phoneme duration, longer and
more frequent pauses [Picheny86, Bradlow03, Krause04]
Clear speech is more intelligible than habitual speech [Picheny85,
Krause02]
Spectral and duration factors are probably significant to the
improved intelligibility of clear speech [Kain08, Tjaden14]
6/67

Speaker Side Solution
Convert habitual speech directly from speakers into clear
speech prior to its distortion due to background noise
Figure: Make habitual speech (generated by speech synthesizer) more resilient to noise
Figure: Make atypical speech (spoken by people with dysarthria) more resilient to noise
7/67

Previous Work on Speaker Side Solution
Applied filters to habitual speech to create spectral
characteristics of clear speech [Koutsogannaki14]
improved intelligibility for typical speakers
had a trade-off between intelligibility and naturalness
did not model the conversion from habitual to clear speech
Utilized HAB-to-CLR spectral style conversion on vowels using
a Gaussian Mixture Model [Mohammadi12]
Converted dysarthric speech into typical speech using a
Gaussian Mixture Model [Kain07]
Converted alaryngeal speech into typical speech using deep
neural networks [Kazuhiro18, Othmane19]
These machine learning-based methods (e.g., deep neural
networks) showed the most promising results; but there is still
room for improvement
8/67

9/67
Introduction
Background
Conclusion
Motivation
Approach
Specific Aims
Problem
Modifying the habitual speech of typical and atypical speakers on
the speaker side to increase intelligibility in noise is a challenging
problem
Statement
Speech intelligibility of typical and atypical speakers can be
improved automatically by learning how they map their voice and
make it more intelligible

10/67
Introduction
Background
Conclusion
Motivation
Approach
Specific Aims
Specific Aims
1 Determine effective spectral features for spectral voice and
style conversion for typical and dysarthric speakers
2 Develop effective HAB-to-CLR spectral mappings using
machine learning algorithms for typical and dysarthric speakers
3 Develop effective methods for converting alaryngeal speech
into intelligible speech, using machine learning algorithms
4 Investigate the performance of duration style conversion on
speech intelligibility (Only in dissertation)

Table of Contents
1 Introduction
2 Background
Acoustic Features and Speech Intelligibility: Hybridization
Voice and Style Conversion
Speech
6 Conclusion
11/67

Determine the acoustic causes of improved intelligibility in
clear speech
1 Insert clear components (e.g., clear spectrum) into habitual
speech to create hybrid speech
2 Find acoustic components that make hybrid speech more
intelligible than habitual speech 12/67

Hybridization Findings
For typical speakers, inserting clear spectrum and duration
obtained 24% improvement in sentence transcription accuracy
[Kain08]
For dysarthric speakers, Tjaden found that
Inserting clear energy obtained 8.7% improvement
Inserting clear spectrum obtained 18% improvement
Inserting clear spectrum and duration obtained 13.4%
improvement in scaled intelligibility test [Tjaden14]
13/67

14/67
Introduction
Background
Conclusion
Voice and Style Conversion
Voice Conversion
Voice Conversion (VC) is a process of transforming a source
speaker’s speech so it sounds like a target speaker’s speech
Figure: Voice Conversion framework
During Training Phase,
prepare parallel utterances,
which contain pairs of
utterances from source and
target speakers with the
same words

Voice Conversion: Training Phase
1 Speech Analysis:
1 extract speech features
using Vocoder
2 analyze speech features
into mapping features
(Aim 1)
2 Time Alignment: align
mapping features between
source and target speakers
3 Train mapping function:
produces a mapping
function from aligned
mapping features (Aim 2)
15/67

Voice Conversion: Conversion Phase
1 Speech Analysis: analyze
mapping features of input
utterance from source
speaker
2 Map the features: apply
mapping function
3 Speech Synthesis: synthesize
speech signal using Vocoder
16/67

Style Conversion
Learn how to map one speaking style to another, such as
habitual to clear, of the same speaker
Use VC mapping techniques in this task
Gaussian mixture models were used to map habitual to clear
vowels, resulting in modest results [Mohammadi12]
This mappings are probably limited by:
inappropriate mapping features (Aim 1)
over-smoothing problem of the mapping techniques (Aim 2)
[Toda05]
17/67

Table of Contents
1 Introduction
2 Background
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Speech
6 Conclusion

19/67
Introduction
Background
Conclusion
Manifold Features
Determine effective spectral representations for spectral style
conversion
Contrast two new sets of features:
1 Probabilistic peak tracking (PPT) features
2 Manifold features
Evaluate the two sets in
speech reconstruction
style conversion
Dissertation also has voice conversion evaluation

20/67
Introduction
Background
Conclusion
Manifold Features
Represent spectrum by a set of frequencies of nine peaks in a
magnitude (energy) spectrum and corresponding peak
bandwidths
Similar spectrums have similar peak frequencies
Assume that peak frequencies change slowly and continuously
over time
Sometimes causes the peak frequency contours not to pass
through spectral peaks
Peak bandwidths are used to represent the presence or
absence of magnitude peaks:
wide bandwidth represents the absence of a peak
narrower bandwidth represents the presence

Probabilistic Peak Tracking
Constrain 4 peak frequencies to be the first 4 formant
frequencies (F1–4) that are important for speech intelligibility
Track 4 peak frequencies at high frequency area
Have initial values of 5000, 6000, 7000, and 8000 Hz
Also calculate the glottal formant frequencies that are
correlated to F0
Finally, calculate corresponding peak bandwidths in an
iterative process to best reconstruct the original spectrum
from computed peak frequencies and peak bandwidths
21/67

22/67
Introduction
Background
Conclusion
Manifold Features
Manifold Features
The features are purely machine-learned
The representation is realized through projection of
high-dimensional acoustic features onto a lower-dimensional
manifold
Learn the manifold from a large multi-speaker database of
speech data using Variational Autoencoder

Variational Autoencoder (VAE)
An spectrogram is encoded frame-by-frame
23/67

24/67
Introduction
Background
Conclusion
Manifold Features
Using PPT and manifold features for reconstruction
Figure: Speech reconstruction with PPT features
Figure: Speech reconstruction with manifold features

25/67
Introduction
Background
Conclusion
Manifold Features
Evaluate the speech reconstruction quality of PPT-20,
manifold features (VAE-12) in comparison to 3 baselines:
20th
-order Line Spectral Frequency (LSF-20)
12th
-order Mel-cepstrum coefficient (MCEP-12)
Natural speech
Select data from 4 random speakers (2 male, 2 female) in
Voice Conversion Challenge (VCC) dataset
Conduct comparative mean opinion score (CMOS)
Participants listen to sentences A and B, and specify whether
A is more natural than B
Answer in 5-point scale: “definitely better” (+2), “better”
(+1), “same” (0), “worse” (−1), and “definitely worse” (−2)

CMOS Results
A
B
LSF-20 MCEP-12 VAE-12 PPT-20
NAT +0.77* +1.34* +1.02* +1.28*
LSF-20 +1.08* -0.04 +0.26*
MCEP-12 -0.44* -0.31*
VAE-12 +0.45*
Table: Relative quality between original and vocoded stimuli. Positive values show A is
better than B. Results marked with an asterisk are significantly different.
26/67

CMOS Results
Show ordering of the systems by projecting above table to a
single dimension using Multiple Dimensional Scaling (MDS)
Use all pairs of data to come up with this
Natural speech (NAT) is better than all synthetic systems
There is still a lot of rooms for improving synthetic speech
VAE-12 is significantly better than MCEP-12
VAE-12 is significantly better than PPT-20 and more compact
Although LSF-20 is better than VAE-12 here, VAE-12 is
better for voice conversion (in dissertation)
27/67

28/67
Introduction
Background
Conclusion
Manifold Features
Evaluate the efficacy of manifold feature for mapping habitual
style to clear style to improve intelligibility
We only look at manifold features
Database of 78 speakers: 32 typical speakers (CS), 30 with
multiple sclerosis (MS), and 16 with Parkinson’s disease (PD)
Each read 25 Harvard sentences in habitual and clear style
Establish which speakers benefit from inserting clear spectrum
into habitual speech via Hybridization
Evaluate the intelligibility of hybrid speech (habitual speech
plus clear spectrum) using keyword recall test
66 participants listen and type 25 Harvard sentences
Hybrid speech improved intelligibility of habitual speech for 3
speakers PDF7, PDM6, CSM7

Variational Autoencoder (VAE)
29/67

VAE with Style conversion mapping
Examine two different DNN architectures
1 Feedforward network (called DNN-mapping VAE)
2 Feedforward network with skip connections (called
skip-mapping VAE)
Output is habitual speech plus modified spectrum
30/67

Feedforward network with skip connection
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
The use of skip-connections is motivated by the fact that the
spectral difference in style conversion can be small
31/67

Speech Intelligibility Evaluation
CSM7 PDF7 PDM6
Reconstructed HAB 38 13 24
DNN-mapping VAE 32 13 35
Skip-mapping VAE 38 11 46*
CLR spectrum-hybrid 56* 27* 50*
Reconstructed CLR 69* 23* 41*
Table: Average keyword accuracy. Results marked with an asterisk are significantly
different
CLR spectrum-hybrid is HAB speech plus CLR spectrum
It is the gold standard of spectrum mapping
Conduct keyword recall test, 30 participants
Skip-mapping VAE increased intelligibility of HAB speech
from 24% to 46% for PDM6 (a male with Parkison’s disease)
Show potential of manifold features. But DNN-mapping
might be too simplistic
32/67

Table of Contents
1 Introduction
2 Background
Speech
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
6 Conclusion
33/67

34/67
Introduction
Background
Conclusion
One-to-One Mapping
Spectral Mapping for Style Conversion of Typical and
Dysarthric Speech
Improve HAB-to-CLR spectral mapping for style conversion
Utilize conditional Generative Adversarial Nets (cGANs) to
map the spectral features of habitual speech to those of clear
speech
Investigate the cGANs in three spectral style conversion
mappings:
1 one-to-one mappings
2 many-to-one mappings
3 many-to-many mappings (only in dissertation)

35/67
Introduction
Background
Conclusion
One-to-One Mapping
Generative Adversarial Nets
GAN has a Generator (G) and a Discriminator (D) [Goodfellow14]
G generates images and D decides if they are generated or real
As either gets better so does the other
D is only used during training
Applications: Data Augmentation, face aging, super resolution

cGANs for Style Conversion
HAB VAE-12
Left Context
Right Context
G Generated
CLR VAE-12
Real
CLR VAE-12
D Real or Generated?
cGAN is a GAN conditioned on auxiliary data
G takes as input HAB spectrum and generates CLR spectrum
D discriminates between generated and real CLR spectrum
Real CLR and HAB spectrum from same sentence and speaker
Real CLR spectrum is time-warped to HAB spectrum
D is conditioned on HAB spectrum to learn if a generated
CLR spectrum is a good transformation from a HAB spectrum
By including the D, we learn better loss function for G
36/67

Structure of Generator
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12

38/67
Introduction
Background
Conclusion
One-to-One Mapping
One-to-One Mapping
The goal is to improve performance of style conversion from
previous section
Train a cGAN for each speaker for mapping HAB to CLR
spectrum
In conversion, apply speaker-specific mapping to same speaker
The output is habitual speech plus modified spectrum

Objective Evaluation: Log Spectral Distortion (dB)
Log Spectral Distortion is the root mean square difference
between converted spectrum and target CLR spectrum
mapping
speaker
PDF7 PDM6 CSM7
DNN (previous section) 16.80 16.67 16.44
GAN 12.85 12.58 12.67
GAN has lower log spectral distortion than DNN
39/67

Examples of spectrograms
Note the difference in formants between 2–4 kHz in the red
box
40/67

Subjective Evaluation
Log spectral distortion is rough predictor for human perception
Conduct keyword recall test with 60 participants, listening and
typing 25 Harvard sentences (same as previous experiments)
vocoded HAB DNN GAN hybrid vocoded CLR
0
20
40
60
80
100
Average
keyword
accuracy
CSM7
PDF7
PDM6
cGAN outperforms DNN
cGAN significantly increases intelligibility for two speakers
(one typical and one with Parkinson)
41/67

42/67
Introduction
Background
Conclusion
One-to-One Mapping
Disadvantage of one-to-one mappings as it requires
speaker-specific training data
Difficult to apply it to new speakers in real life applications

Method
Pick two target speakers with best sentence-level intelligibility
one male and one female
both happens to be typical speakers
Map habitual speech of multiple speakers to target
Use all speakers except PDM6, PDF7 and CSM7 for testing
and two target speakers
Use 29 typical speakers, 30 with MS and 14 with Parkinson
In conversion, apply the mapping on unseen speakers
43/67

Conduct keyword recall test with 44 participants
vocoded HAB GAN hybrid vocoded CLR
0
20
40
60
80
100
Average
keyword
accuracy
CSM7
PDF7
PDM6
Figure: Keyword recall accuracy of three speakers. The dashed lines show statistically
significant differences.
many-to-one increases intelligibility for one speaker (person
with Parkison)
promising but not as good as one-to-one
44/67

Table of Contents
1 Introduction
2 Background
Speech
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
6 Conclusion

46/67
Introduction
Background
Conclusion
Data
Predicting Spectrum
Synthesizing Pitch
Alaryngeal Speech
People who undergo total laryngectomy lose their ability to
produce speech sounds normally
Their speech options: esophageal speech, tracheo-esophageal
puncture (TEP), and electrolarynx (ELX) are difficult to
understand due to:
poor voice quality
no voiced/unvoiced differentiation
lack of articulatory precision
no F0
Alaryngeal speech is more distorted than mild Parkinson
No clear speech for LAR speakers

Flowchart of proposed method
LAR
MCEP
MCEP
model
AP
model
VUV
model
spectra
to
MCEP
LAR
spectra
WORLD
vocoder
LAR
speech
INT
MCEP
INT
AP
INT
VUV
MCEP
to
spectra
INT
spectra
pitch
accent
curve
synthesis
INT
F0
LAR
energy
WORLD
vocoder
INT
speech
Propose an approach for transforming alaryngeal speech
(LAR) to intelligible speech (INT):
1 Predict INT binary voicing/unvoicing and degree of voicing
(aperiodicity) from LAR spectrum using DNNs (VUV model
and AP model)
2 Predict INT spectrum from LAR spectrum using cGANs
(MCEP model)
3 Create synthetic F0 from a simple intonation model (Pitch
accent curve synthesis)
47/67

48/67
Introduction
Background
Conclusion
Data
Predicting Spectrum
Synthesizing Pitch
Data
For source LAR speech, database of 4 male speakers: 3
LAR-TEP speakers (L001, L002, L006) and 1 LAR-ELX
speaker (L004)
For target INT speech, ideal option is natural voice, such as
habitual speech or clear speech. I use a synthetic male voice
due to:
expediency
capability of creating a lot of data and arbitrary voices.
Each speaker has 132 sentences (LAR and INT speakers)
Use random split of 100/16/16 sentences for training,
validation, and testing

Pre-training Data
Due to limited amount of LAR training data, we use
pre-training to leverage the general knowledge of speech
Use multi-speaker TIMIT database for pre-training.
Can we make a pre-training set that better matches LAR
speech?
Simulate LAR-TEP speech by creating a fully unvoiced version
of TIMIT (FU-TIMIT)
Simulate LAR-ELX speech by creating a fully voiced version of
TIMIT (FV-TIMIT)
Use standard TIMIT split of 462/144/24 speakers for training,
validation, and testing
49/67

50/67
Introduction
Background
Conclusion
Data
Predicting Spectrum
Synthesizing Pitch
Predicting Voicing and Degree of Voicing
Propose a method for predicting when speech should be
voiced and the degree of voicing from a spectrogram
predict a binary voicing value (VUV) and continuous 2-band
aperiodicity (AP) values from mel-cepstral coefficients
(MCEP), using deep neural networks (DNN).
Pre-train three kinds of speaker-independent DNNs using
either TIMIT, FU-TIMIT, or FV-TIMIT as training data
For each utterance in training data, use VUV and AP from
corresponding utterances in TIMIT as target

Evaluating Pre-trained models on their Test Data
For testing, apply three pre-trained models (TIMIT,
FU-TIMIT, and FV-TIMIT) on corresponding test data
Use balanced accuracy (BAC,defined as average recall) for
VUV classification (since the classes were imbalanced), and r2
for AP regression
Mapping
Pre-training set
TIMIT FU-TIMIT FV-TIMIT
TIMIT → TIMIT 0.99 (0.87)
FU-TIMIT → TIMIT 0.89 (0.72)
FV-TIMIT → TIMIT 0.93 (0.84)
Table: BAC and r2 in brackets, higher is better, closer to 1 is better
As expected, TIMIT model works best because training data
contains voicing that we want to predict
FU-TIMIT and FV-TIMIT also work well
It’s possible to predict voicing from spectral shape along
51/67

52/67
Introduction
Background
Conclusion
Data
Predicting Spectrum
Synthesizing Pitch
Evaluating Pre-trained models on LAR data
Test pre-trained models, without adaptation, to predict target
INT VUV or AP from LAR-TEP and LAR-ELX
Mapping
Pre-training set
L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58)
L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70)
L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28)
L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84)
Table: BAC and r2 in brackets
Our expectation was that matching pre-training models and
source speaker works best (black numbers)
Although the results do not match our expectation entirely,
we still need to adapt our models with LAR speech

Adapting Pre-trained models on LAR data
Adapt the pre-trained models with LAR-TEP and LAR-ELX
speech
Use speaker specific adaptation due to the limited number of
speakers (similar to one-to-one mapping)
Adapt all weights in DNN models
53/67

Evaluating Adapted models
Mapping
Pre-training set
Before adaptation
L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58)
L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70)
L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28)
L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84)
After adaptation
L001 (TEP) → INT 0.70 (0.22) 0.67 (0.21) 0.72 (0.23)
L002 (TEP) → INT 0.73 (0.43) 0.75 (0.43) 0.73 (0.43)
L004 (ELX) → INT 0.72 (0.29) 0.71 (0.27) 0.70 ( 0.29)
L006 (TEP) → INT 0.65 (0.04) 0.67 (0.05) 0.64 (0.05)
Table: BAC and r2 in brackets, higher is better
Adaptation always increases performance
Pre-training with FU and FV-TIMIT as opposed to TIMIT did
not work as expected
54/67

55/67
Introduction
Background
Conclusion
Data
Predicting Spectrum
Synthesizing Pitch
cGANs for Predicting Spectrum
LAR MCEP
Left Context
Right Context
G Generated
INT MCEP
Real
INT MCEP
D Real or Generated?
We use the same structure of cGANs to generate INT
spectrum from LAR spectrum

Structure of Generator
Current
LAR
MCEP
31
Left
Context
155
Right
Context
155
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
31
Current
INT
MCEP
31
56/67

Evaluating Pre-trained models
Pre-train models due to limited amount of LAR data
mapping
pre-trained set
Before FU-TIMIT FV-TIMIT
FU-TIMIT → TIMIT 11.3 7.64
FV-TIMIT → TIMIT 11.0 6.46
L001 (TEP) → INT 60.6 60.0 61.9
L002 (TEP) → INT 46.0 45.0 46.5
L004 (ELX) → INT 51.5 51.1 52.8
L006 (TEP) → INT 61.2 61.6 63.0
Predict TIMIT spectrum from FU- and FV-TIMIT spectrum
Results in 7.64 dB for FU-TIMIT and 6.46 dB for FV-TIMIT
Reduces log spectral distortion from 11.33 and 11, respectively
Apply pre-trained models to predict INT spectrum from LAR
spectrum
No noticeable reduction of distortion
Lack of improvement is disappointing but not unexpected as
FU-TIMIT and FV-TIMIT do not know about LAR speech
57/67

Adapting Pre-trained models on LAR speech
Adapt pre-trained models on LAR speech
mapping
pre-trained set
FU-TIMIT FV-TIMIT
L001 (TEP) → INT 32 (60) 32 (61.9)
L002 (TEP) → INT 33 (45) 33 (46.5)
L004 (ELX) → INT 31.5 (51.1) 32 (52.8)
L006 (TEP) → INT 37.8 (61.6) 37 (63)
Table: Log spectral distortion before and after adaptation in brackets
As expected, the adaptation always improved performance
Pre-training with FU-TIMIT versus FV-TIMIT does not have
noticeable effect on adaptation
58/67

59/67
Introduction
Background
Conclusion
Data
Predicting Spectrum
Synthesizing Pitch
Synthesizing Pitch
LAR F0 is not present in LAR speech
Use a phrase curve and a single accent curve to model
intonation for each utterance
Phrase curve is logarithmic failing curve from 140 to 60 Hz
Accent curve is linearly-proportional to LAR energy
0 200 400 600 800
60
80
100
120
140
160
Hz
Frames

60/67
Introduction
Background
Conclusion
Data
Predicting Spectrum
Synthesizing Pitch
Overall Results
Conduct perceptual naturalness and intelligibility CMOS
Each listened to a pair of sentences A & B, consisting of
modified speech against the LAR speech
answered ”if A is more natural/intelligible than B?” in a 5
point scale: ”definitely worse” (−2), ”worse” (−1), ”same”
(0), ”better” (+)1, ”definitely better” (+2)
There were 48 participants in each CMOS
For LAR speech, we analyzed and re-synthesized it (using WORLD) to make a fair comparison

Intelligibility
INT-spectrum: LAR speech plus predicted spectrum
INT-intonation: LAR speech plus predicted voicing, F0
INT-all: LAR speech plus predicted spectrum, voicing, and F0
Speakers
Systems
INT-spectrum INT-intonation INT-all
L001 (TEP) −0.1 −0.1 0.1
L002 (TEP) 0.1 0.2 −0.3*
L004 (ELX) −0.34* 0.34* −0.2
L006 (TEP) 0.2 −0.1 −0.0
INT-intonation significantly increased intelligibility for L004
INT-all did not increase intelligibility
We did not observe an increasing in overall intelligibility
61/67

Naturalness
Speakers
Systems
INT-spectrum INT-intonation INT-all
L001 (TEP) −0.0 −0.3* 0.4*
L002 (TEP) −0.1 −0.0 0.1
L004 (ELX) −0.56* −0.25 0.22
L006 (TEP) −0.3* −0.2* 0.7*
INT-all increased naturalness for all 4 speakers
but only significant for L001 and L006
But, when testing the individual components (e.g., spectrum),
there is no improvement
62/67

Table of Contents
1 Introduction
2 Background
Speech
6 Conclusion
63/67

64/67
Introduction
Background
Conclusion
Conclusion
Aim 1: Determine effective spectral features for style
conversion
Proposed two sets of features: PPT and manifold features
(VAE-12)
VAE-12 is better than MCEP-12 and PPT in speech
reconstruction
VAE-12 in combination with DNNs significantly increases
intelligibility for one with Parkinson from 24% to 46%

Conclusion
Aim 2: Develop effective HAB-to-CLR style mapping
Proposed a spectral style mapping using cGANs for improving
speech intelligibility
For one-to-one mapping, cGANs outperforms DNN, and
significantly increases the intelligibility for 2 speakers (a typical
speaker and one with Parkinson)
For many-to-one mapping, cGANs significantly increases the
intelligibility for a speaker with Parkinson
65/67

Conclusion
Aim 3: Develop effective methods for LAR-to-INT conversion
Proposed a method to predict binary voicing/unvoicing and
degree of voicing (aperiodicity) from LAR MCEP using DNNs
Proposed a method to predict INT spectrum from LAR
spectrum using cGANs
Proposed a method to create a synthetic fundamental
frequency trajectory from a simple intonation model
INT-intonation significantly increases intelligibility for 1
speaker
INT-all significantly increases naturalness for 2 speakers
66/67

Thanks for your attention
67/67

Final defense

Recommended

Recommended

More Related Content

Similar to Final defense

Similar to Final defense (20)

Recently uploaded

Recently uploaded (20)

Final defense