SlideShare a Scribd company logo
Improving Speech Intelligibility through
Spectral Style Conversion
Tuan Dinh
Oregon Health & Science University
Sep 2021
Table of Contents
1 Introduction
Motivation
Approach
Thesis Problem and Statement
Specific Aims
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
3/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Unintelligible Speech
Speech is important for human communication
Typical way of speaking is referred as habitual speech
Habitual speech becomes less intelligible in noise
Habitual speech is also hard to understand for people with
hearing impairments and non-native speakers
Tuan Dinh Improving Speech Intelligibility
Unintelligible Speech
Figure: Synthetic speech of speaking devices is degraded by noise
Figure: Atypical speech is hard to understand, especially in noise
4/67
5/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Listener Side Solution
Use noise suppression and cancellation methods
Require noise-cancellation devices, which take as input a noisy
speech signal and output an enhanced signal with higher
intelligibility and quality
There are many cases where listeners don’t have
noise-cancellation devices
transit announcements
Tuan Dinh Improving Speech Intelligibility
Lessons from Real Speakers: Habitual vs Clear
Speakers adjust their voice to make it more intelligible
Adopt special clear speaking style to make habitual speech
more resilient to noisy environments and listener deficits
Researchers showed that:
Clear speech features extended phoneme duration, longer and
more frequent pauses [Picheny86, Bradlow03, Krause04]
Clear speech is more intelligible than habitual speech [Picheny85,
Krause02]
Spectral and duration factors are probably significant to the
improved intelligibility of clear speech [Kain08, Tjaden14]
6/67
Speaker Side Solution
Convert habitual speech directly from speakers into clear
speech prior to its distortion due to background noise
Figure: Make habitual speech (generated by speech synthesizer) more resilient to noise
Figure: Make atypical speech (spoken by people with dysarthria) more resilient to noise
7/67
Previous Work on Speaker Side Solution
Applied filters to habitual speech to create spectral
characteristics of clear speech [Koutsogannaki14]
improved intelligibility for typical speakers
had a trade-off between intelligibility and naturalness
did not model the conversion from habitual to clear speech
Utilized HAB-to-CLR spectral style conversion on vowels using
a Gaussian Mixture Model [Mohammadi12]
Converted dysarthric speech into typical speech using a
Gaussian Mixture Model [Kain07]
Converted alaryngeal speech into typical speech using deep
neural networks [Kazuhiro18, Othmane19]
These machine learning-based methods (e.g., deep neural
networks) showed the most promising results; but there is still
room for improvement
8/67
9/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Thesis Problem and Statement
Problem
Modifying the habitual speech of typical and atypical speakers on
the speaker side to increase intelligibility in noise is a challenging
problem
Statement
Speech intelligibility of typical and atypical speakers can be
improved automatically by learning how they map their voice and
make it more intelligible
Tuan Dinh Improving Speech Intelligibility
10/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Motivation
Approach
Thesis Problem and Statement
Specific Aims
Specific Aims
1 Determine effective spectral features for spectral voice and
style conversion for typical and dysarthric speakers
2 Develop effective HAB-to-CLR spectral mappings using
machine learning algorithms for typical and dysarthric speakers
3 Develop effective methods for converting alaryngeal speech
into intelligible speech, using machine learning algorithms
4 Investigate the performance of duration style conversion on
speech intelligibility (Only in dissertation)
Tuan Dinh Improving Speech Intelligibility
Table of Contents
1 Introduction
2 Background
Acoustic Features and Speech Intelligibility: Hybridization
Voice and Style Conversion
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
11/67
Acoustic Features and Speech Intelligibility: Hybridization
Determine the acoustic causes of improved intelligibility in
clear speech
1 Insert clear components (e.g., clear spectrum) into habitual
speech to create hybrid speech
2 Find acoustic components that make hybrid speech more
intelligible than habitual speech 12/67
Hybridization Findings
For typical speakers, inserting clear spectrum and duration
obtained 24% improvement in sentence transcription accuracy
[Kain08]
For dysarthric speakers, Tjaden found that
Inserting clear energy obtained 8.7% improvement
Inserting clear spectrum obtained 18% improvement
Inserting clear spectrum and duration obtained 13.4%
improvement in scaled intelligibility test [Tjaden14]
13/67
14/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Acoustic Features and Speech Intelligibility: Hybridization
Voice and Style Conversion
Voice Conversion
Voice Conversion (VC) is a process of transforming a source
speaker’s speech so it sounds like a target speaker’s speech
Figure: Voice Conversion framework
During Training Phase,
prepare parallel utterances,
which contain pairs of
utterances from source and
target speakers with the
same words
Tuan Dinh Improving Speech Intelligibility
Voice Conversion: Training Phase
Figure: Voice Conversion framework
1 Speech Analysis:
1 extract speech features
using Vocoder
2 analyze speech features
into mapping features
(Aim 1)
2 Time Alignment: align
mapping features between
source and target speakers
3 Train mapping function:
produces a mapping
function from aligned
mapping features (Aim 2)
15/67
Voice Conversion: Conversion Phase
Figure: Voice Conversion framework
1 Speech Analysis: analyze
mapping features of input
utterance from source
speaker
2 Map the features: apply
mapping function
3 Speech Synthesis: synthesize
speech signal using Vocoder
16/67
Style Conversion
Learn how to map one speaking style to another, such as
habitual to clear, of the same speaker
Use VC mapping techniques in this task
Gaussian mixture models were used to map habitual to clear
vowels, resulting in modest results [Mohammadi12]
This mappings are probably limited by:
inappropriate mapping features (Aim 1)
over-smoothing problem of the mapping techniques (Aim 2)
[Toda05]
17/67
Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
19/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Spectral Features for Style Conversion
Determine effective spectral representations for spectral style
conversion
Contrast two new sets of features:
1 Probabilistic peak tracking (PPT) features
2 Manifold features
Evaluate the two sets in
speech reconstruction
style conversion
Dissertation also has voice conversion evaluation
Tuan Dinh Improving Speech Intelligibility
20/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Probabilistic Peak Tracking Features
Represent spectrum by a set of frequencies of nine peaks in a
magnitude (energy) spectrum and corresponding peak
bandwidths
Similar spectrums have similar peak frequencies
Assume that peak frequencies change slowly and continuously
over time
Sometimes causes the peak frequency contours not to pass
through spectral peaks
Peak bandwidths are used to represent the presence or
absence of magnitude peaks:
wide bandwidth represents the absence of a peak
narrower bandwidth represents the presence
Tuan Dinh Improving Speech Intelligibility
Probabilistic Peak Tracking
Constrain 4 peak frequencies to be the first 4 formant
frequencies (F1–4) that are important for speech intelligibility
Track 4 peak frequencies at high frequency area
Have initial values of 5000, 6000, 7000, and 8000 Hz
Also calculate the glottal formant frequencies that are
correlated to F0
Finally, calculate corresponding peak bandwidths in an
iterative process to best reconstruct the original spectrum
from computed peak frequencies and peak bandwidths
21/67
22/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Manifold Features
The features are purely machine-learned
The representation is realized through projection of
high-dimensional acoustic features onto a lower-dimensional
manifold
Learn the manifold from a large multi-speaker database of
speech data using Variational Autoencoder
Tuan Dinh Improving Speech Intelligibility
Variational Autoencoder (VAE)
An spectrogram is encoded frame-by-frame
23/67
24/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Using PPT and manifold features for reconstruction
Figure: Speech reconstruction with PPT features
Figure: Speech reconstruction with manifold features
Tuan Dinh Improving Speech Intelligibility
25/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Experiment: Reconstruction Quality
Evaluate the speech reconstruction quality of PPT-20,
manifold features (VAE-12) in comparison to 3 baselines:
20th
-order Line Spectral Frequency (LSF-20)
12th
-order Mel-cepstrum coefficient (MCEP-12)
Natural speech
Select data from 4 random speakers (2 male, 2 female) in
Voice Conversion Challenge (VCC) dataset
Conduct comparative mean opinion score (CMOS)
Participants listen to sentences A and B, and specify whether
A is more natural than B
Answer in 5-point scale: “definitely better” (+2), “better”
(+1), “same” (0), “worse” (−1), and “definitely worse” (−2)
Tuan Dinh Improving Speech Intelligibility
CMOS Results
A
B
LSF-20 MCEP-12 VAE-12 PPT-20
NAT +0.77* +1.34* +1.02* +1.28*
LSF-20 +1.08* -0.04 +0.26*
MCEP-12 -0.44* -0.31*
VAE-12 +0.45*
Table: Relative quality between original and vocoded stimuli. Positive values show A is
better than B. Results marked with an asterisk are significantly different.
26/67
CMOS Results
Show ordering of the systems by projecting above table to a
single dimension using Multiple Dimensional Scaling (MDS)
Use all pairs of data to come up with this
Natural speech (NAT) is better than all synthetic systems
There is still a lot of rooms for improving synthetic speech
VAE-12 is significantly better than MCEP-12
VAE-12 is significantly better than PPT-20 and more compact
Although LSF-20 is better than VAE-12 here, VAE-12 is
better for voice conversion (in dissertation)
27/67
28/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Probabilistic Peak Tracking Features
Manifold Features
Experiment: Reconstruction Quality
Experiment: Style Conversion
Experiment: Style Conversion
Evaluate the efficacy of manifold feature for mapping habitual
style to clear style to improve intelligibility
We only look at manifold features
Database of 78 speakers: 32 typical speakers (CS), 30 with
multiple sclerosis (MS), and 16 with Parkinson’s disease (PD)
Each read 25 Harvard sentences in habitual and clear style
Establish which speakers benefit from inserting clear spectrum
into habitual speech via Hybridization
Evaluate the intelligibility of hybrid speech (habitual speech
plus clear spectrum) using keyword recall test
66 participants listen and type 25 Harvard sentences
Hybrid speech improved intelligibility of habitual speech for 3
speakers PDF7, PDM6, CSM7
Tuan Dinh Improving Speech Intelligibility
Variational Autoencoder (VAE)
29/67
VAE with Style conversion mapping
Examine two different DNN architectures
1 Feedforward network (called DNN-mapping VAE)
2 Feedforward network with skip connections (called
skip-mapping VAE)
Output is habitual speech plus modified spectrum
30/67
Feedforward network with skip connection
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
The use of skip-connections is motivated by the fact that the
spectral difference in style conversion can be small
31/67
Speech Intelligibility Evaluation
CSM7 PDF7 PDM6
Reconstructed HAB 38 13 24
DNN-mapping VAE 32 13 35
Skip-mapping VAE 38 11 46*
CLR spectrum-hybrid 56* 27* 50*
Reconstructed CLR 69* 23* 41*
Table: Average keyword accuracy. Results marked with an asterisk are significantly
different
CLR spectrum-hybrid is HAB speech plus CLR spectrum
It is the gold standard of spectrum mapping
Conduct keyword recall test, 30 participants
Skip-mapping VAE increased intelligibility of HAB speech
from 24% to 46% for PDM6 (a male with Parkison’s disease)
Show potential of manifold features. But DNN-mapping
might be too simplistic
32/67
Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
33/67
34/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
Spectral Mapping for Style Conversion of Typical and
Dysarthric Speech
Improve HAB-to-CLR spectral mapping for style conversion
Utilize conditional Generative Adversarial Nets (cGANs) to
map the spectral features of habitual speech to those of clear
speech
Investigate the cGANs in three spectral style conversion
mappings:
1 one-to-one mappings
2 many-to-one mappings
3 many-to-many mappings (only in dissertation)
Tuan Dinh Improving Speech Intelligibility
35/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
Generative Adversarial Nets
GAN has a Generator (G) and a Discriminator (D) [Goodfellow14]
G generates images and D decides if they are generated or real
As either gets better so does the other
D is only used during training
Applications: Data Augmentation, face aging, super resolution
Tuan Dinh Improving Speech Intelligibility
cGANs for Style Conversion
HAB VAE-12
Left Context
Right Context
G Generated
CLR VAE-12
Real
CLR VAE-12
D Real or Generated?
cGAN is a GAN conditioned on auxiliary data
G takes as input HAB spectrum and generates CLR spectrum
D discriminates between generated and real CLR spectrum
Real CLR and HAB spectrum from same sentence and speaker
Real CLR spectrum is time-warped to HAB spectrum
D is conditioned on HAB spectrum to learn if a generated
CLR spectrum is a good transformation from a HAB spectrum
By including the D, we learn better loss function for G
36/67
Structure of Generator
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
38/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
One-to-One Mapping
The goal is to improve performance of style conversion from
previous section
Train a cGAN for each speaker for mapping HAB to CLR
spectrum
In conversion, apply speaker-specific mapping to same speaker
The output is habitual speech plus modified spectrum
Tuan Dinh Improving Speech Intelligibility
Objective Evaluation: Log Spectral Distortion (dB)
Log Spectral Distortion is the root mean square difference
between converted spectrum and target CLR spectrum
mapping
speaker
PDF7 PDM6 CSM7
DNN (previous section) 16.80 16.67 16.44
GAN 12.85 12.58 12.67
GAN has lower log spectral distortion than DNN
39/67
Examples of spectrograms
Note the difference in formants between 2–4 kHz in the red
box
40/67
Subjective Evaluation
Log spectral distortion is rough predictor for human perception
Conduct keyword recall test with 60 participants, listening and
typing 25 Harvard sentences (same as previous experiments)
vocoded HAB DNN GAN hybrid vocoded CLR
0
20
40
60
80
100
Average
keyword
accuracy
CSM7
PDF7
PDM6
cGAN outperforms DNN
cGAN significantly increases intelligibility for two speakers
(one typical and one with Parkinson)
41/67
42/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conditional Generative Adversarial Nets: Background
One-to-One Mapping
Many-to-One Mappings
Many-to-One Mappings
Disadvantage of one-to-one mappings as it requires
speaker-specific training data
Difficult to apply it to new speakers in real life applications
Tuan Dinh Improving Speech Intelligibility
Method
Pick two target speakers with best sentence-level intelligibility
one male and one female
both happens to be typical speakers
Map habitual speech of multiple speakers to target
Use all speakers except PDM6, PDF7 and CSM7 for testing
and two target speakers
Use 29 typical speakers, 30 with MS and 14 with Parkinson
In conversion, apply the mapping on unseen speakers
43/67
Subjective Evaluation
Conduct keyword recall test with 44 participants
vocoded HAB GAN hybrid vocoded CLR
0
20
40
60
80
100
Average
keyword
accuracy
CSM7
PDF7
PDM6
Figure: Keyword recall accuracy of three speakers. The dashed lines show statistically
significant differences.
many-to-one increases intelligibility for one speaker (person
with Parkison)
promising but not as good as one-to-one
44/67
Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
6 Conclusion
46/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Alaryngeal Speech
People who undergo total laryngectomy lose their ability to
produce speech sounds normally
Their speech options: esophageal speech, tracheo-esophageal
puncture (TEP), and electrolarynx (ELX) are difficult to
understand due to:
poor voice quality
no voiced/unvoiced differentiation
lack of articulatory precision
no F0
Alaryngeal speech is more distorted than mild Parkinson
No clear speech for LAR speakers
Tuan Dinh Improving Speech Intelligibility
Flowchart of proposed method
LAR
MCEP
MCEP
model
AP
model
VUV
model
spectra
to
MCEP
LAR
spectra
WORLD
vocoder
LAR
speech
INT
MCEP
INT
AP
INT
VUV
MCEP
to
spectra
INT
spectra
pitch
accent
curve
synthesis
INT
F0
LAR
energy
WORLD
vocoder
INT
speech
Propose an approach for transforming alaryngeal speech
(LAR) to intelligible speech (INT):
1 Predict INT binary voicing/unvoicing and degree of voicing
(aperiodicity) from LAR spectrum using DNNs (VUV model
and AP model)
2 Predict INT spectrum from LAR spectrum using cGANs
(MCEP model)
3 Create synthetic F0 from a simple intonation model (Pitch
accent curve synthesis)
47/67
48/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Data
For source LAR speech, database of 4 male speakers: 3
LAR-TEP speakers (L001, L002, L006) and 1 LAR-ELX
speaker (L004)
For target INT speech, ideal option is natural voice, such as
habitual speech or clear speech. I use a synthetic male voice
due to:
expediency
capability of creating a lot of data and arbitrary voices.
Each speaker has 132 sentences (LAR and INT speakers)
Use random split of 100/16/16 sentences for training,
validation, and testing
Tuan Dinh Improving Speech Intelligibility
Pre-training Data
Due to limited amount of LAR training data, we use
pre-training to leverage the general knowledge of speech
Use multi-speaker TIMIT database for pre-training.
Can we make a pre-training set that better matches LAR
speech?
Simulate LAR-TEP speech by creating a fully unvoiced version
of TIMIT (FU-TIMIT)
Simulate LAR-ELX speech by creating a fully voiced version of
TIMIT (FV-TIMIT)
Use standard TIMIT split of 462/144/24 speakers for training,
validation, and testing
49/67
50/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Predicting Voicing and Degree of Voicing
Propose a method for predicting when speech should be
voiced and the degree of voicing from a spectrogram
predict a binary voicing value (VUV) and continuous 2-band
aperiodicity (AP) values from mel-cepstral coefficients
(MCEP), using deep neural networks (DNN).
Pre-train three kinds of speaker-independent DNNs using
either TIMIT, FU-TIMIT, or FV-TIMIT as training data
For each utterance in training data, use VUV and AP from
corresponding utterances in TIMIT as target
Tuan Dinh Improving Speech Intelligibility
Evaluating Pre-trained models on their Test Data
For testing, apply three pre-trained models (TIMIT,
FU-TIMIT, and FV-TIMIT) on corresponding test data
Use balanced accuracy (BAC,defined as average recall) for
VUV classification (since the classes were imbalanced), and r2
for AP regression
Mapping
Pre-training set
TIMIT FU-TIMIT FV-TIMIT
TIMIT → TIMIT 0.99 (0.87)
FU-TIMIT → TIMIT 0.89 (0.72)
FV-TIMIT → TIMIT 0.93 (0.84)
Table: BAC and r2 in brackets, higher is better, closer to 1 is better
As expected, TIMIT model works best because training data
contains voicing that we want to predict
FU-TIMIT and FV-TIMIT also work well
It’s possible to predict voicing from spectral shape along
51/67
52/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Evaluating Pre-trained models on LAR data
Test pre-trained models, without adaptation, to predict target
INT VUV or AP from LAR-TEP and LAR-ELX
Mapping
Pre-training set
TIMIT FU-TIMIT FV-TIMIT
L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58)
L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70)
L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28)
L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84)
Table: BAC and r2 in brackets
Our expectation was that matching pre-training models and
source speaker works best (black numbers)
Although the results do not match our expectation entirely,
we still need to adapt our models with LAR speech
Tuan Dinh Improving Speech Intelligibility
Adapting Pre-trained models on LAR data
Adapt the pre-trained models with LAR-TEP and LAR-ELX
speech
Use speaker specific adaptation due to the limited number of
speakers (similar to one-to-one mapping)
Adapt all weights in DNN models
53/67
Evaluating Adapted models
Mapping
Pre-training set
TIMIT FU-TIMIT FV-TIMIT
Before adaptation
L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58)
L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70)
L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28)
L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84)
After adaptation
L001 (TEP) → INT 0.70 (0.22) 0.67 (0.21) 0.72 (0.23)
L002 (TEP) → INT 0.73 (0.43) 0.75 (0.43) 0.73 (0.43)
L004 (ELX) → INT 0.72 (0.29) 0.71 (0.27) 0.70 ( 0.29)
L006 (TEP) → INT 0.65 (0.04) 0.67 (0.05) 0.64 (0.05)
Table: BAC and r2 in brackets, higher is better
Adaptation always increases performance
Pre-training with FU and FV-TIMIT as opposed to TIMIT did
not work as expected
54/67
55/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
cGANs for Predicting Spectrum
LAR MCEP
Left Context
Right Context
G Generated
INT MCEP
Real
INT MCEP
D Real or Generated?
We use the same structure of cGANs to generate INT
spectrum from LAR spectrum
Tuan Dinh Improving Speech Intelligibility
Structure of Generator
Current
LAR
MCEP
31
Left
Context
155
Right
Context
155
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
31
Current
INT
MCEP
31
56/67
Evaluating Pre-trained models
Pre-train models due to limited amount of LAR data
mapping
pre-trained set
Before FU-TIMIT FV-TIMIT
FU-TIMIT → TIMIT 11.3 7.64
FV-TIMIT → TIMIT 11.0 6.46
L001 (TEP) → INT 60.6 60.0 61.9
L002 (TEP) → INT 46.0 45.0 46.5
L004 (ELX) → INT 51.5 51.1 52.8
L006 (TEP) → INT 61.2 61.6 63.0
Predict TIMIT spectrum from FU- and FV-TIMIT spectrum
Results in 7.64 dB for FU-TIMIT and 6.46 dB for FV-TIMIT
Reduces log spectral distortion from 11.33 and 11, respectively
Apply pre-trained models to predict INT spectrum from LAR
spectrum
No noticeable reduction of distortion
Lack of improvement is disappointing but not unexpected as
FU-TIMIT and FV-TIMIT do not know about LAR speech
57/67
Adapting Pre-trained models on LAR speech
Adapt pre-trained models on LAR speech
mapping
pre-trained set
FU-TIMIT FV-TIMIT
L001 (TEP) → INT 32 (60) 32 (61.9)
L002 (TEP) → INT 33 (45) 33 (46.5)
L004 (ELX) → INT 31.5 (51.1) 32 (52.8)
L006 (TEP) → INT 37.8 (61.6) 37 (63)
Table: Log spectral distortion before and after adaptation in brackets
As expected, the adaptation always improved performance
Pre-training with FU-TIMIT versus FV-TIMIT does not have
noticeable effect on adaptation
58/67
59/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Synthesizing Pitch
LAR F0 is not present in LAR speech
Use a phrase curve and a single accent curve to model
intonation for each utterance
Phrase curve is logarithmic failing curve from 140 to 60 Hz
Accent curve is linearly-proportional to LAR energy
0 200 400 600 800
60
80
100
120
140
160
Hz
Frames
Tuan Dinh Improving Speech Intelligibility
60/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Data
Predicting Voicing or Degree of Voicing
Predicting Spectrum
Synthesizing Pitch
Subjective Evaluation
Overall Results
Conduct perceptual naturalness and intelligibility CMOS
Each listened to a pair of sentences A & B, consisting of
modified speech against the LAR speech
answered ”if A is more natural/intelligible than B?” in a 5
point scale: ”definitely worse” (−2), ”worse” (−1), ”same”
(0), ”better” (+)1, ”definitely better” (+2)
There were 48 participants in each CMOS
For LAR speech, we analyzed and re-synthesized it (using WORLD) to make a fair comparison
Tuan Dinh Improving Speech Intelligibility
Intelligibility
INT-spectrum: LAR speech plus predicted spectrum
INT-intonation: LAR speech plus predicted voicing, F0
INT-all: LAR speech plus predicted spectrum, voicing, and F0
Speakers
Systems
INT-spectrum INT-intonation INT-all
L001 (TEP) −0.1 −0.1 0.1
L002 (TEP) 0.1 0.2 −0.3*
L004 (ELX) −0.34* 0.34* −0.2
L006 (TEP) 0.2 −0.1 −0.0
INT-intonation significantly increased intelligibility for L004
INT-all did not increase intelligibility
We did not observe an increasing in overall intelligibility
61/67
Naturalness
Speakers
Systems
INT-spectrum INT-intonation INT-all
L001 (TEP) −0.0 −0.3* 0.4*
L002 (TEP) −0.1 −0.0 0.1
L004 (ELX) −0.56* −0.25 0.22
L006 (TEP) −0.3* −0.2* 0.7*
INT-all increased naturalness for all 4 speakers
but only significant for L001 and L006
But, when testing the individual components (e.g., spectrum),
there is no improvement
62/67
Table of Contents
1 Introduction
2 Background
3 Spectral Features for Style Conversion
4 Spectral Mapping for Style Conversion of Typical and Dysarthric
Speech
5 Voice Conversion and F0 Synthesis of Alaryngeal Speech
6 Conclusion
63/67
64/67
Introduction
Background
Spectral Features for Style Conversion
Spectral Mapping for Style Conversion of Typical and Dysarthric Speech
Voice Conversion and F0 Synthesis of Alaryngeal Speech
Conclusion
Conclusion
Aim 1: Determine effective spectral features for style
conversion
Proposed two sets of features: PPT and manifold features
(VAE-12)
VAE-12 is better than MCEP-12 and PPT in speech
reconstruction
VAE-12 in combination with DNNs significantly increases
intelligibility for one with Parkinson from 24% to 46%
Tuan Dinh Improving Speech Intelligibility
Conclusion
Aim 2: Develop effective HAB-to-CLR style mapping
Proposed a spectral style mapping using cGANs for improving
speech intelligibility
For one-to-one mapping, cGANs outperforms DNN, and
significantly increases the intelligibility for 2 speakers (a typical
speaker and one with Parkinson)
For many-to-one mapping, cGANs significantly increases the
intelligibility for a speaker with Parkinson
65/67
Conclusion
Aim 3: Develop effective methods for LAR-to-INT conversion
Proposed a method to predict binary voicing/unvoicing and
degree of voicing (aperiodicity) from LAR MCEP using DNNs
Proposed a method to predict INT spectrum from LAR
spectrum using cGANs
Proposed a method to create a synthetic fundamental
frequency trajectory from a simple intonation model
INT-intonation significantly increases intelligibility for 1
speaker
INT-all significantly increases naturalness for 2 speakers
66/67
Thanks for your attention
67/67

More Related Content

Similar to Final defense

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 
LPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A ReviewLPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A Review
ijiert bestjournal
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
inscit2006
 
B110512
B110512B110512
Principal characteristics of speech
Principal characteristics of speechPrincipal characteristics of speech
Principal characteristics of speechNikolay Karpov
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
IJCI JOURNAL
 
Powerpoint on Linear Predictive coding.pptx
Powerpoint on Linear Predictive coding.pptxPowerpoint on Linear Predictive coding.pptx
Powerpoint on Linear Predictive coding.pptx
VinodkumarGaniger1
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translationbehzad66
 
voice morphing.pptx
voice morphing.pptxvoice morphing.pptx
voice morphing.pptx
yashisolanki02
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
simonp16
 
Visual speech to text conversion applicable to telephone communication
Visual speech to text conversion  applicable  to telephone communicationVisual speech to text conversion  applicable  to telephone communication
Visual speech to text conversion applicable to telephone communication
Swathi Venugopal
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Kotaro Hara
 
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...
sipij
 
Bz33462466
Bz33462466Bz33462466
Bz33462466
IJERA Editor
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
Hiroyuki Miyoshi
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
Bilgin Aksoy
 
Survey On Speech Synthesis
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech Synthesis
CSCJournals
 
Voice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from LaryngectomyVoice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from Laryngectomy
International Journal of Science and Research (IJSR)
 
Principal characteristics of speech
Principal characteristics of speechPrincipal characteristics of speech
Principal characteristics of speechNikolay Karpov
 

Similar to Final defense (20)

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
LPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A ReviewLPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A Review
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
B110512
B110512B110512
B110512
 
Principal characteristics of speech
Principal characteristics of speechPrincipal characteristics of speech
Principal characteristics of speech
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
 
Powerpoint on Linear Predictive coding.pptx
Powerpoint on Linear Predictive coding.pptxPowerpoint on Linear Predictive coding.pptx
Powerpoint on Linear Predictive coding.pptx
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
voice morphing.pptx
voice morphing.pptxvoice morphing.pptx
voice morphing.pptx
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
 
Visual speech to text conversion applicable to telephone communication
Visual speech to text conversion  applicable  to telephone communicationVisual speech to text conversion  applicable  to telephone communication
Visual speech to text conversion applicable to telephone communication
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBE...
 
Bz33462466
Bz33462466Bz33462466
Bz33462466
 
Bz33462466
Bz33462466Bz33462466
Bz33462466
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Survey On Speech Synthesis
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech Synthesis
 
Voice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from LaryngectomyVoice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from Laryngectomy
 
Principal characteristics of speech
Principal characteristics of speechPrincipal characteristics of speech
Principal characteristics of speech
 

Recently uploaded

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
SciAstra
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 

Recently uploaded (20)

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 

Final defense

  • 1. Improving Speech Intelligibility through Spectral Style Conversion Tuan Dinh Oregon Health & Science University Sep 2021
  • 2. Table of Contents 1 Introduction Motivation Approach Thesis Problem and Statement Specific Aims 2 Background 3 Spectral Features for Style Conversion 4 Spectral Mapping for Style Conversion of Typical and Dysarthric Speech 5 Voice Conversion and F0 Synthesis of Alaryngeal Speech 6 Conclusion
  • 3. 3/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Motivation Approach Thesis Problem and Statement Specific Aims Unintelligible Speech Speech is important for human communication Typical way of speaking is referred as habitual speech Habitual speech becomes less intelligible in noise Habitual speech is also hard to understand for people with hearing impairments and non-native speakers Tuan Dinh Improving Speech Intelligibility
  • 4. Unintelligible Speech Figure: Synthetic speech of speaking devices is degraded by noise Figure: Atypical speech is hard to understand, especially in noise 4/67
  • 5. 5/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Motivation Approach Thesis Problem and Statement Specific Aims Listener Side Solution Use noise suppression and cancellation methods Require noise-cancellation devices, which take as input a noisy speech signal and output an enhanced signal with higher intelligibility and quality There are many cases where listeners don’t have noise-cancellation devices transit announcements Tuan Dinh Improving Speech Intelligibility
  • 6. Lessons from Real Speakers: Habitual vs Clear Speakers adjust their voice to make it more intelligible Adopt special clear speaking style to make habitual speech more resilient to noisy environments and listener deficits Researchers showed that: Clear speech features extended phoneme duration, longer and more frequent pauses [Picheny86, Bradlow03, Krause04] Clear speech is more intelligible than habitual speech [Picheny85, Krause02] Spectral and duration factors are probably significant to the improved intelligibility of clear speech [Kain08, Tjaden14] 6/67
  • 7. Speaker Side Solution Convert habitual speech directly from speakers into clear speech prior to its distortion due to background noise Figure: Make habitual speech (generated by speech synthesizer) more resilient to noise Figure: Make atypical speech (spoken by people with dysarthria) more resilient to noise 7/67
  • 8. Previous Work on Speaker Side Solution Applied filters to habitual speech to create spectral characteristics of clear speech [Koutsogannaki14] improved intelligibility for typical speakers had a trade-off between intelligibility and naturalness did not model the conversion from habitual to clear speech Utilized HAB-to-CLR spectral style conversion on vowels using a Gaussian Mixture Model [Mohammadi12] Converted dysarthric speech into typical speech using a Gaussian Mixture Model [Kain07] Converted alaryngeal speech into typical speech using deep neural networks [Kazuhiro18, Othmane19] These machine learning-based methods (e.g., deep neural networks) showed the most promising results; but there is still room for improvement 8/67
  • 9. 9/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Motivation Approach Thesis Problem and Statement Specific Aims Thesis Problem and Statement Problem Modifying the habitual speech of typical and atypical speakers on the speaker side to increase intelligibility in noise is a challenging problem Statement Speech intelligibility of typical and atypical speakers can be improved automatically by learning how they map their voice and make it more intelligible Tuan Dinh Improving Speech Intelligibility
  • 10. 10/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Motivation Approach Thesis Problem and Statement Specific Aims Specific Aims 1 Determine effective spectral features for spectral voice and style conversion for typical and dysarthric speakers 2 Develop effective HAB-to-CLR spectral mappings using machine learning algorithms for typical and dysarthric speakers 3 Develop effective methods for converting alaryngeal speech into intelligible speech, using machine learning algorithms 4 Investigate the performance of duration style conversion on speech intelligibility (Only in dissertation) Tuan Dinh Improving Speech Intelligibility
  • 11. Table of Contents 1 Introduction 2 Background Acoustic Features and Speech Intelligibility: Hybridization Voice and Style Conversion 3 Spectral Features for Style Conversion 4 Spectral Mapping for Style Conversion of Typical and Dysarthric Speech 5 Voice Conversion and F0 Synthesis of Alaryngeal Speech 6 Conclusion 11/67
  • 12. Acoustic Features and Speech Intelligibility: Hybridization Determine the acoustic causes of improved intelligibility in clear speech 1 Insert clear components (e.g., clear spectrum) into habitual speech to create hybrid speech 2 Find acoustic components that make hybrid speech more intelligible than habitual speech 12/67
  • 13. Hybridization Findings For typical speakers, inserting clear spectrum and duration obtained 24% improvement in sentence transcription accuracy [Kain08] For dysarthric speakers, Tjaden found that Inserting clear energy obtained 8.7% improvement Inserting clear spectrum obtained 18% improvement Inserting clear spectrum and duration obtained 13.4% improvement in scaled intelligibility test [Tjaden14] 13/67
  • 14. 14/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Acoustic Features and Speech Intelligibility: Hybridization Voice and Style Conversion Voice Conversion Voice Conversion (VC) is a process of transforming a source speaker’s speech so it sounds like a target speaker’s speech Figure: Voice Conversion framework During Training Phase, prepare parallel utterances, which contain pairs of utterances from source and target speakers with the same words Tuan Dinh Improving Speech Intelligibility
  • 15. Voice Conversion: Training Phase Figure: Voice Conversion framework 1 Speech Analysis: 1 extract speech features using Vocoder 2 analyze speech features into mapping features (Aim 1) 2 Time Alignment: align mapping features between source and target speakers 3 Train mapping function: produces a mapping function from aligned mapping features (Aim 2) 15/67
  • 16. Voice Conversion: Conversion Phase Figure: Voice Conversion framework 1 Speech Analysis: analyze mapping features of input utterance from source speaker 2 Map the features: apply mapping function 3 Speech Synthesis: synthesize speech signal using Vocoder 16/67
  • 17. Style Conversion Learn how to map one speaking style to another, such as habitual to clear, of the same speaker Use VC mapping techniques in this task Gaussian mixture models were used to map habitual to clear vowels, resulting in modest results [Mohammadi12] This mappings are probably limited by: inappropriate mapping features (Aim 1) over-smoothing problem of the mapping techniques (Aim 2) [Toda05] 17/67
  • 18. Table of Contents 1 Introduction 2 Background 3 Spectral Features for Style Conversion Probabilistic Peak Tracking Features Manifold Features Experiment: Reconstruction Quality Experiment: Style Conversion 4 Spectral Mapping for Style Conversion of Typical and Dysarthric Speech 5 Voice Conversion and F0 Synthesis of Alaryngeal Speech 6 Conclusion
  • 19. 19/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Probabilistic Peak Tracking Features Manifold Features Experiment: Reconstruction Quality Experiment: Style Conversion Spectral Features for Style Conversion Determine effective spectral representations for spectral style conversion Contrast two new sets of features: 1 Probabilistic peak tracking (PPT) features 2 Manifold features Evaluate the two sets in speech reconstruction style conversion Dissertation also has voice conversion evaluation Tuan Dinh Improving Speech Intelligibility
  • 20. 20/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Probabilistic Peak Tracking Features Manifold Features Experiment: Reconstruction Quality Experiment: Style Conversion Probabilistic Peak Tracking Features Represent spectrum by a set of frequencies of nine peaks in a magnitude (energy) spectrum and corresponding peak bandwidths Similar spectrums have similar peak frequencies Assume that peak frequencies change slowly and continuously over time Sometimes causes the peak frequency contours not to pass through spectral peaks Peak bandwidths are used to represent the presence or absence of magnitude peaks: wide bandwidth represents the absence of a peak narrower bandwidth represents the presence Tuan Dinh Improving Speech Intelligibility
  • 21. Probabilistic Peak Tracking Constrain 4 peak frequencies to be the first 4 formant frequencies (F1–4) that are important for speech intelligibility Track 4 peak frequencies at high frequency area Have initial values of 5000, 6000, 7000, and 8000 Hz Also calculate the glottal formant frequencies that are correlated to F0 Finally, calculate corresponding peak bandwidths in an iterative process to best reconstruct the original spectrum from computed peak frequencies and peak bandwidths 21/67
  • 22. 22/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Probabilistic Peak Tracking Features Manifold Features Experiment: Reconstruction Quality Experiment: Style Conversion Manifold Features The features are purely machine-learned The representation is realized through projection of high-dimensional acoustic features onto a lower-dimensional manifold Learn the manifold from a large multi-speaker database of speech data using Variational Autoencoder Tuan Dinh Improving Speech Intelligibility
  • 23. Variational Autoencoder (VAE) An spectrogram is encoded frame-by-frame 23/67
  • 24. 24/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Probabilistic Peak Tracking Features Manifold Features Experiment: Reconstruction Quality Experiment: Style Conversion Using PPT and manifold features for reconstruction Figure: Speech reconstruction with PPT features Figure: Speech reconstruction with manifold features Tuan Dinh Improving Speech Intelligibility
  • 25. 25/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Probabilistic Peak Tracking Features Manifold Features Experiment: Reconstruction Quality Experiment: Style Conversion Experiment: Reconstruction Quality Evaluate the speech reconstruction quality of PPT-20, manifold features (VAE-12) in comparison to 3 baselines: 20th -order Line Spectral Frequency (LSF-20) 12th -order Mel-cepstrum coefficient (MCEP-12) Natural speech Select data from 4 random speakers (2 male, 2 female) in Voice Conversion Challenge (VCC) dataset Conduct comparative mean opinion score (CMOS) Participants listen to sentences A and B, and specify whether A is more natural than B Answer in 5-point scale: “definitely better” (+2), “better” (+1), “same” (0), “worse” (−1), and “definitely worse” (−2) Tuan Dinh Improving Speech Intelligibility
  • 26. CMOS Results A B LSF-20 MCEP-12 VAE-12 PPT-20 NAT +0.77* +1.34* +1.02* +1.28* LSF-20 +1.08* -0.04 +0.26* MCEP-12 -0.44* -0.31* VAE-12 +0.45* Table: Relative quality between original and vocoded stimuli. Positive values show A is better than B. Results marked with an asterisk are significantly different. 26/67
  • 27. CMOS Results Show ordering of the systems by projecting above table to a single dimension using Multiple Dimensional Scaling (MDS) Use all pairs of data to come up with this Natural speech (NAT) is better than all synthetic systems There is still a lot of rooms for improving synthetic speech VAE-12 is significantly better than MCEP-12 VAE-12 is significantly better than PPT-20 and more compact Although LSF-20 is better than VAE-12 here, VAE-12 is better for voice conversion (in dissertation) 27/67
  • 28. 28/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Probabilistic Peak Tracking Features Manifold Features Experiment: Reconstruction Quality Experiment: Style Conversion Experiment: Style Conversion Evaluate the efficacy of manifold feature for mapping habitual style to clear style to improve intelligibility We only look at manifold features Database of 78 speakers: 32 typical speakers (CS), 30 with multiple sclerosis (MS), and 16 with Parkinson’s disease (PD) Each read 25 Harvard sentences in habitual and clear style Establish which speakers benefit from inserting clear spectrum into habitual speech via Hybridization Evaluate the intelligibility of hybrid speech (habitual speech plus clear spectrum) using keyword recall test 66 participants listen and type 25 Harvard sentences Hybrid speech improved intelligibility of habitual speech for 3 speakers PDF7, PDM6, CSM7 Tuan Dinh Improving Speech Intelligibility
  • 30. VAE with Style conversion mapping Examine two different DNN architectures 1 Feedforward network (called DNN-mapping VAE) 2 Feedforward network with skip connections (called skip-mapping VAE) Output is habitual speech plus modified spectrum 30/67
  • 31. Feedforward network with skip connection Current HAB VAE 12 Left Context 60 Right Context 60 Concat Dense 512 Dense 512 Concat Dense 512 Dense 512 Linear 12 Add Current CLR VAE 12 The use of skip-connections is motivated by the fact that the spectral difference in style conversion can be small 31/67
  • 32. Speech Intelligibility Evaluation CSM7 PDF7 PDM6 Reconstructed HAB 38 13 24 DNN-mapping VAE 32 13 35 Skip-mapping VAE 38 11 46* CLR spectrum-hybrid 56* 27* 50* Reconstructed CLR 69* 23* 41* Table: Average keyword accuracy. Results marked with an asterisk are significantly different CLR spectrum-hybrid is HAB speech plus CLR spectrum It is the gold standard of spectrum mapping Conduct keyword recall test, 30 participants Skip-mapping VAE increased intelligibility of HAB speech from 24% to 46% for PDM6 (a male with Parkison’s disease) Show potential of manifold features. But DNN-mapping might be too simplistic 32/67
  • 33. Table of Contents 1 Introduction 2 Background 3 Spectral Features for Style Conversion 4 Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Conditional Generative Adversarial Nets: Background One-to-One Mapping Many-to-One Mappings 5 Voice Conversion and F0 Synthesis of Alaryngeal Speech 6 Conclusion 33/67
  • 34. 34/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Conditional Generative Adversarial Nets: Background One-to-One Mapping Many-to-One Mappings Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Improve HAB-to-CLR spectral mapping for style conversion Utilize conditional Generative Adversarial Nets (cGANs) to map the spectral features of habitual speech to those of clear speech Investigate the cGANs in three spectral style conversion mappings: 1 one-to-one mappings 2 many-to-one mappings 3 many-to-many mappings (only in dissertation) Tuan Dinh Improving Speech Intelligibility
  • 35. 35/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Conditional Generative Adversarial Nets: Background One-to-One Mapping Many-to-One Mappings Generative Adversarial Nets GAN has a Generator (G) and a Discriminator (D) [Goodfellow14] G generates images and D decides if they are generated or real As either gets better so does the other D is only used during training Applications: Data Augmentation, face aging, super resolution Tuan Dinh Improving Speech Intelligibility
  • 36. cGANs for Style Conversion HAB VAE-12 Left Context Right Context G Generated CLR VAE-12 Real CLR VAE-12 D Real or Generated? cGAN is a GAN conditioned on auxiliary data G takes as input HAB spectrum and generates CLR spectrum D discriminates between generated and real CLR spectrum Real CLR and HAB spectrum from same sentence and speaker Real CLR spectrum is time-warped to HAB spectrum D is conditioned on HAB spectrum to learn if a generated CLR spectrum is a good transformation from a HAB spectrum By including the D, we learn better loss function for G 36/67
  • 38. 38/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Conditional Generative Adversarial Nets: Background One-to-One Mapping Many-to-One Mappings One-to-One Mapping The goal is to improve performance of style conversion from previous section Train a cGAN for each speaker for mapping HAB to CLR spectrum In conversion, apply speaker-specific mapping to same speaker The output is habitual speech plus modified spectrum Tuan Dinh Improving Speech Intelligibility
  • 39. Objective Evaluation: Log Spectral Distortion (dB) Log Spectral Distortion is the root mean square difference between converted spectrum and target CLR spectrum mapping speaker PDF7 PDM6 CSM7 DNN (previous section) 16.80 16.67 16.44 GAN 12.85 12.58 12.67 GAN has lower log spectral distortion than DNN 39/67
  • 40. Examples of spectrograms Note the difference in formants between 2–4 kHz in the red box 40/67
  • 41. Subjective Evaluation Log spectral distortion is rough predictor for human perception Conduct keyword recall test with 60 participants, listening and typing 25 Harvard sentences (same as previous experiments) vocoded HAB DNN GAN hybrid vocoded CLR 0 20 40 60 80 100 Average keyword accuracy CSM7 PDF7 PDM6 cGAN outperforms DNN cGAN significantly increases intelligibility for two speakers (one typical and one with Parkinson) 41/67
  • 42. 42/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Conditional Generative Adversarial Nets: Background One-to-One Mapping Many-to-One Mappings Many-to-One Mappings Disadvantage of one-to-one mappings as it requires speaker-specific training data Difficult to apply it to new speakers in real life applications Tuan Dinh Improving Speech Intelligibility
  • 43. Method Pick two target speakers with best sentence-level intelligibility one male and one female both happens to be typical speakers Map habitual speech of multiple speakers to target Use all speakers except PDM6, PDF7 and CSM7 for testing and two target speakers Use 29 typical speakers, 30 with MS and 14 with Parkinson In conversion, apply the mapping on unseen speakers 43/67
  • 44. Subjective Evaluation Conduct keyword recall test with 44 participants vocoded HAB GAN hybrid vocoded CLR 0 20 40 60 80 100 Average keyword accuracy CSM7 PDF7 PDM6 Figure: Keyword recall accuracy of three speakers. The dashed lines show statistically significant differences. many-to-one increases intelligibility for one speaker (person with Parkison) promising but not as good as one-to-one 44/67
  • 45. Table of Contents 1 Introduction 2 Background 3 Spectral Features for Style Conversion 4 Spectral Mapping for Style Conversion of Typical and Dysarthric Speech 5 Voice Conversion and F0 Synthesis of Alaryngeal Speech Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation 6 Conclusion
  • 46. 46/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation Alaryngeal Speech People who undergo total laryngectomy lose their ability to produce speech sounds normally Their speech options: esophageal speech, tracheo-esophageal puncture (TEP), and electrolarynx (ELX) are difficult to understand due to: poor voice quality no voiced/unvoiced differentiation lack of articulatory precision no F0 Alaryngeal speech is more distorted than mild Parkinson No clear speech for LAR speakers Tuan Dinh Improving Speech Intelligibility
  • 47. Flowchart of proposed method LAR MCEP MCEP model AP model VUV model spectra to MCEP LAR spectra WORLD vocoder LAR speech INT MCEP INT AP INT VUV MCEP to spectra INT spectra pitch accent curve synthesis INT F0 LAR energy WORLD vocoder INT speech Propose an approach for transforming alaryngeal speech (LAR) to intelligible speech (INT): 1 Predict INT binary voicing/unvoicing and degree of voicing (aperiodicity) from LAR spectrum using DNNs (VUV model and AP model) 2 Predict INT spectrum from LAR spectrum using cGANs (MCEP model) 3 Create synthetic F0 from a simple intonation model (Pitch accent curve synthesis) 47/67
  • 48. 48/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation Data For source LAR speech, database of 4 male speakers: 3 LAR-TEP speakers (L001, L002, L006) and 1 LAR-ELX speaker (L004) For target INT speech, ideal option is natural voice, such as habitual speech or clear speech. I use a synthetic male voice due to: expediency capability of creating a lot of data and arbitrary voices. Each speaker has 132 sentences (LAR and INT speakers) Use random split of 100/16/16 sentences for training, validation, and testing Tuan Dinh Improving Speech Intelligibility
  • 49. Pre-training Data Due to limited amount of LAR training data, we use pre-training to leverage the general knowledge of speech Use multi-speaker TIMIT database for pre-training. Can we make a pre-training set that better matches LAR speech? Simulate LAR-TEP speech by creating a fully unvoiced version of TIMIT (FU-TIMIT) Simulate LAR-ELX speech by creating a fully voiced version of TIMIT (FV-TIMIT) Use standard TIMIT split of 462/144/24 speakers for training, validation, and testing 49/67
  • 50. 50/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation Predicting Voicing and Degree of Voicing Propose a method for predicting when speech should be voiced and the degree of voicing from a spectrogram predict a binary voicing value (VUV) and continuous 2-band aperiodicity (AP) values from mel-cepstral coefficients (MCEP), using deep neural networks (DNN). Pre-train three kinds of speaker-independent DNNs using either TIMIT, FU-TIMIT, or FV-TIMIT as training data For each utterance in training data, use VUV and AP from corresponding utterances in TIMIT as target Tuan Dinh Improving Speech Intelligibility
  • 51. Evaluating Pre-trained models on their Test Data For testing, apply three pre-trained models (TIMIT, FU-TIMIT, and FV-TIMIT) on corresponding test data Use balanced accuracy (BAC,defined as average recall) for VUV classification (since the classes were imbalanced), and r2 for AP regression Mapping Pre-training set TIMIT FU-TIMIT FV-TIMIT TIMIT → TIMIT 0.99 (0.87) FU-TIMIT → TIMIT 0.89 (0.72) FV-TIMIT → TIMIT 0.93 (0.84) Table: BAC and r2 in brackets, higher is better, closer to 1 is better As expected, TIMIT model works best because training data contains voicing that we want to predict FU-TIMIT and FV-TIMIT also work well It’s possible to predict voicing from spectral shape along 51/67
  • 52. 52/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation Evaluating Pre-trained models on LAR data Test pre-trained models, without adaptation, to predict target INT VUV or AP from LAR-TEP and LAR-ELX Mapping Pre-training set TIMIT FU-TIMIT FV-TIMIT L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58) L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70) L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28) L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84) Table: BAC and r2 in brackets Our expectation was that matching pre-training models and source speaker works best (black numbers) Although the results do not match our expectation entirely, we still need to adapt our models with LAR speech Tuan Dinh Improving Speech Intelligibility
  • 53. Adapting Pre-trained models on LAR data Adapt the pre-trained models with LAR-TEP and LAR-ELX speech Use speaker specific adaptation due to the limited number of speakers (similar to one-to-one mapping) Adapt all weights in DNN models 53/67
  • 54. Evaluating Adapted models Mapping Pre-training set TIMIT FU-TIMIT FV-TIMIT Before adaptation L001 (TEP) → INT 0.64 (−0.51) 0.60 (−0.17) 0.58 (−0.58) L002 (TEP) → INT 0.56 (−0.70) 0.67 (0.02) 0.55 (−0.70) L004 (ELX) → INT 0.63 (–0.44) 0.49 (−1.00) 0.48 (−0.28) L006 (TEP) → INT 0.53 (−0.84) 0.48 (−0.50) 0.55 (−0.84) After adaptation L001 (TEP) → INT 0.70 (0.22) 0.67 (0.21) 0.72 (0.23) L002 (TEP) → INT 0.73 (0.43) 0.75 (0.43) 0.73 (0.43) L004 (ELX) → INT 0.72 (0.29) 0.71 (0.27) 0.70 ( 0.29) L006 (TEP) → INT 0.65 (0.04) 0.67 (0.05) 0.64 (0.05) Table: BAC and r2 in brackets, higher is better Adaptation always increases performance Pre-training with FU and FV-TIMIT as opposed to TIMIT did not work as expected 54/67
  • 55. 55/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation cGANs for Predicting Spectrum LAR MCEP Left Context Right Context G Generated INT MCEP Real INT MCEP D Real or Generated? We use the same structure of cGANs to generate INT spectrum from LAR spectrum Tuan Dinh Improving Speech Intelligibility
  • 57. Evaluating Pre-trained models Pre-train models due to limited amount of LAR data mapping pre-trained set Before FU-TIMIT FV-TIMIT FU-TIMIT → TIMIT 11.3 7.64 FV-TIMIT → TIMIT 11.0 6.46 L001 (TEP) → INT 60.6 60.0 61.9 L002 (TEP) → INT 46.0 45.0 46.5 L004 (ELX) → INT 51.5 51.1 52.8 L006 (TEP) → INT 61.2 61.6 63.0 Predict TIMIT spectrum from FU- and FV-TIMIT spectrum Results in 7.64 dB for FU-TIMIT and 6.46 dB for FV-TIMIT Reduces log spectral distortion from 11.33 and 11, respectively Apply pre-trained models to predict INT spectrum from LAR spectrum No noticeable reduction of distortion Lack of improvement is disappointing but not unexpected as FU-TIMIT and FV-TIMIT do not know about LAR speech 57/67
  • 58. Adapting Pre-trained models on LAR speech Adapt pre-trained models on LAR speech mapping pre-trained set FU-TIMIT FV-TIMIT L001 (TEP) → INT 32 (60) 32 (61.9) L002 (TEP) → INT 33 (45) 33 (46.5) L004 (ELX) → INT 31.5 (51.1) 32 (52.8) L006 (TEP) → INT 37.8 (61.6) 37 (63) Table: Log spectral distortion before and after adaptation in brackets As expected, the adaptation always improved performance Pre-training with FU-TIMIT versus FV-TIMIT does not have noticeable effect on adaptation 58/67
  • 59. 59/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation Synthesizing Pitch LAR F0 is not present in LAR speech Use a phrase curve and a single accent curve to model intonation for each utterance Phrase curve is logarithmic failing curve from 140 to 60 Hz Accent curve is linearly-proportional to LAR energy 0 200 400 600 800 60 80 100 120 140 160 Hz Frames Tuan Dinh Improving Speech Intelligibility
  • 60. 60/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Data Predicting Voicing or Degree of Voicing Predicting Spectrum Synthesizing Pitch Subjective Evaluation Overall Results Conduct perceptual naturalness and intelligibility CMOS Each listened to a pair of sentences A & B, consisting of modified speech against the LAR speech answered ”if A is more natural/intelligible than B?” in a 5 point scale: ”definitely worse” (−2), ”worse” (−1), ”same” (0), ”better” (+)1, ”definitely better” (+2) There were 48 participants in each CMOS For LAR speech, we analyzed and re-synthesized it (using WORLD) to make a fair comparison Tuan Dinh Improving Speech Intelligibility
  • 61. Intelligibility INT-spectrum: LAR speech plus predicted spectrum INT-intonation: LAR speech plus predicted voicing, F0 INT-all: LAR speech plus predicted spectrum, voicing, and F0 Speakers Systems INT-spectrum INT-intonation INT-all L001 (TEP) −0.1 −0.1 0.1 L002 (TEP) 0.1 0.2 −0.3* L004 (ELX) −0.34* 0.34* −0.2 L006 (TEP) 0.2 −0.1 −0.0 INT-intonation significantly increased intelligibility for L004 INT-all did not increase intelligibility We did not observe an increasing in overall intelligibility 61/67
  • 62. Naturalness Speakers Systems INT-spectrum INT-intonation INT-all L001 (TEP) −0.0 −0.3* 0.4* L002 (TEP) −0.1 −0.0 0.1 L004 (ELX) −0.56* −0.25 0.22 L006 (TEP) −0.3* −0.2* 0.7* INT-all increased naturalness for all 4 speakers but only significant for L001 and L006 But, when testing the individual components (e.g., spectrum), there is no improvement 62/67
  • 63. Table of Contents 1 Introduction 2 Background 3 Spectral Features for Style Conversion 4 Spectral Mapping for Style Conversion of Typical and Dysarthric Speech 5 Voice Conversion and F0 Synthesis of Alaryngeal Speech 6 Conclusion 63/67
  • 64. 64/67 Introduction Background Spectral Features for Style Conversion Spectral Mapping for Style Conversion of Typical and Dysarthric Speech Voice Conversion and F0 Synthesis of Alaryngeal Speech Conclusion Conclusion Aim 1: Determine effective spectral features for style conversion Proposed two sets of features: PPT and manifold features (VAE-12) VAE-12 is better than MCEP-12 and PPT in speech reconstruction VAE-12 in combination with DNNs significantly increases intelligibility for one with Parkinson from 24% to 46% Tuan Dinh Improving Speech Intelligibility
  • 65. Conclusion Aim 2: Develop effective HAB-to-CLR style mapping Proposed a spectral style mapping using cGANs for improving speech intelligibility For one-to-one mapping, cGANs outperforms DNN, and significantly increases the intelligibility for 2 speakers (a typical speaker and one with Parkinson) For many-to-one mapping, cGANs significantly increases the intelligibility for a speaker with Parkinson 65/67
  • 66. Conclusion Aim 3: Develop effective methods for LAR-to-INT conversion Proposed a method to predict binary voicing/unvoicing and degree of voicing (aperiodicity) from LAR MCEP using DNNs Proposed a method to predict INT spectrum from LAR spectrum using cGANs Proposed a method to create a synthetic fundamental frequency trajectory from a simple intonation model INT-intonation significantly increases intelligibility for 1 speaker INT-all significantly increases naturalness for 2 speakers 66/67
  • 67. Thanks for your attention 67/67