SlideShare a Scribd company logo
1/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Improving Speech Intelligibility through Speaker
Dependent and Independent Spectral Style
Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden
Oregon Health & Science University, University at Bualo
October 23, 2020
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
2/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Background
Approximately 28 × 106 people in the United States have some
degree of hearing loss
Speakers naturally adopt a special clear speaking style when
talking to
listeners with hearing loss
normal-hearing listeners in adverse environments
Clear speech features
high degree of articulation
slower speaking rate
more frequent and longer pauses
exact strategy varies from speaker to speaker
Clear speech is more intelligible than habitual speech
1424% improvement in keyword recall in noise [Kain08]
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
Hybridization
Figure: Hybridization Algorithm Flowchart
4/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Hybridization
Replacing certain acoustic features of habitual speech with
those from clear speech cause improved intelligibility
for typical speakers, incorporating [Kain08]
clear spectrum and duration yielded 24% improvement
for dysarthric speakers, incorporating [Tjaden14]
clear energy yielded 8.7% improvement
clear spectrum yielded 18% improvement
clear spectrum and duration yielded 13.4% improvement
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
5/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Style conversion converts speaking style
Previously
mapping habitual (HAB) to clear (CLR) VAE-12 resulted in
improvement of intelligibility for one speaker from 24% to 46%
[Dinh19]
Generated parameters from DNN-mapping can be
over-smoothing
Generative adversarial nets (GANs) can be a promising
approach to address over-smoothness
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
6/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Aim
To further increase intelligibility automatically by style conversion,
through the use of a conditional GANs (cGANs)
Experiments showing ecacy of cGANs in terms of speech
intelligibility when performing
1 speaker dependent one-to-one mapping
2 speaker independent many-to-one mapping
3 speaker independent many-to-many mapping
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
GANs
Traditional GAN has 2 components: a Generator (G) and a
Discriminator (D) that play a min-max game [Goodfellow14]
Figure: GANs
7/26
8/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed cGANs for style conversion
Left Context
HAB VAE
Right Context
G
D
Mapped VAE
HAB VAE
CLR VAE
D
Real Pairs?
Real Pairs?
Figure: cGAN framework for style conversion
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
9/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed Generator
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
Figure: Generator architecture
No random noise z
The component G learns the dierences between HAB VAE-12
and CLR VAE-12
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
10/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed Discriminator
Discriminator has 2 hidden layers of 256 nodes, an output layer
of 1 nodes with sigmoid function
In addition to adversarial loss, we use mean-absolute dierence
loss between G(z) and aligned real data x
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
11/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Tips and Tricks to Train cGANs
a leaky ReLU activation function with a negative slope of 0.2
for both G and D
a dropout layer following each hidden layer of D with a
dropout rate of 0.5,
use the Adam optimizer:
learning rate: 0.0001, momentum β1: 0.5 and learning rate
decay: 0.00001 for D
learning rate: 0.0002, momentum β1: 0.5 and learning rate
decay: 0.00001 for G
weights initialized from a zero-centered Normal distribution
with standard deviation 0.02
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
12/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Experiment: One-to-one mapping
Train speaker-dependent HAB-to-CLR mapping:
Require parallel data of HAB and CLR speech
Database: Used a 78 speaker database:
Consisting of control speakers (CS, N = 32)
Speaker with multiple sclerosis (MS, N = 30)
Speakers with Parkinson's disease (PD, N = 16)
A speaker read 25 Harvard sentences in 2 speaking styles
(HAB, CLR)
Select three speakers: PDM6, CSM7, PDF7 that showed the
most benet from the CLR spectrum
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
13/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Method
HAB
VAE-
12
style
mapping
CLR
VAE-
12
Figure: cGANs-based mapping
We aligned each HAB utterance to its parallel CLR utterance
of the same speaker using DTW on 32nd-order log lter-bank
features.
Then, we pre-trained the generator that maps HAB VAE-12 to
CLR VAE-12 to minimize mean-squared-error loss function
Then, we trained our proposed cGANs structure
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
14/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Log Spectral Distortion
mapping  speakers PD_F7 PD_M6 C_M7
DNN 16.8 16.67 16.44
GAN 12.85 12.58 12.67
Table: Average LSD (in dB)
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
15/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: LSD
0
10
20
LSD(dB)
Speaker PD_F7
DNN
GAN
0
10
LSD(dB)
Speaker PD_M6
DNN
GAN
0 5 10 15 20 25
Sentence ID
0
10
20
LSD(dB)
Speaker C_M7
DNN
GAN
Figure: LSD of 25 test sentences for 3 speakers; GAN vs DNN
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
16/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Variance ratio
0
1
22
CLR
2
MAP
Speaker PD_F7
DNN
GAN
0
1
2
2
CLR
2
MAP
Speaker PD_M6
DNN
GAN
2 4 6 8 10 12
VAE-12 component
0
1
2
2
CLR
2
MAP
Speaker C_M7
DNN
GAN
Figure: Variance ratio
σ2
CLR
σ2
MAP
between CLR VAE-12 (CLR) and mapped
VAE-12 (MAP); between GAN and DNN. Smaller is better.
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
17/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Example
Figure: Sentence: Four hours of steady work faced us.
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
18/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Subjective Evaluation
Loudness dierence was minimized using RMSA measure
Stimuli was mixed with babble noise at 0 dB SNR
The test consists of 25 sentences × 3 speakers × 5 conditions
(2 purely vocoded, 1 hybrid, 2 mappings) = 375 unique trails
60 participants on AMT, each listened to 25 sentences then
typed down the sentences
We manually counted the accurate keywords of each sentence
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
19/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Subjective Evaluation
vocoded HAB DNN GAN hybrid vocoded CLR
0
20
40
60
80
100
Averagekeywordaccuracy CSM7
PDF7
PDM6
Figure: Keyword recall accuracy
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
20/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Experiment: Many-to-one mapping
SPK-i
HAB
VAE-
12
SPK-1
HAB
VAE-
12
SPK-N
HAB
VAE-
12
mapping
Best
CLR
VAE-
12
Figure: cGANs-based mapping
Maps HAB speech from many speakers to CLR speech of a
target speaker
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
21/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Subjective Evaluation
Loudness dierence was minimized using RMSA measure.
Stimuli was mix with babble noise at 0 dB SNR
Test consists of 25 sentences ×3 source speakers (CSM7,
PDM7, PDM6) ×3 conditions (vocoded HAB, cGAN-mapping,
hybrid) + 25 sentences ×2 target speakers (CSM10, CSF15)
×1 conditions (vocoded CLR) = 275 unique trials
44 participants on AMT
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
22/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Subjective Evaluation
vocoded HAB GAN hybrid vocoded CLR
0
20
40
60
80
100
Averagekeywordaccuracy
CSM7
PDF7
PDM6
Figure: Keyword recall accuracy
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
23/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Experiment: Many-to-many mapping
SPK-i
HAB
VAE-
12
SPK-1
HAB
VAE-
12
SPK-N
HAB
VAE-
12
style
mapping
style
mapping
style
mapping
SPK-i
CLR
VAE-
12
SPK-1
CLR
VAE-
12
SPK-N
CLR
VAE-
12
Figure: cGANs-based mapping
Learn the style dierences, preserve speaker identities
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
24/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Subjective Evaluation
A test consists of 5 sentences ×3 speakers ×4 conditions
(vocoded HAB, GAN, hybrid, vocoded CLR) = 300 unique
trials
24 listeners participated
Loudness dierence was minimized using RMSA measure.
Stimuli was mix with babble noise at 0 dB SNR
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
Subjective Evaluation
CSM7 PDF7 PDM6
vocoded HAB 36.8 10 28.8
GAN 39.6 15.6 26.8
hybrid 62 22.8 57.6
vocoded CLR 66.8 22.4 48
Table: Average keyword accuracy
25/26
26/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Conclusion
Apply cGANs in HAB-to-CLR style conversion
1 In speaker-dependent one-to-one mapping, cGANs outperform
DNN in term keyword recall accuracy. cGANs improved
intelligibility of two of three speakers
2 In speaker-independent many-to-one mapping, cGANs can
improve speech intelligibility of one of three speakers
3 In speaker-independent many-to-many mapping, cGANs can
improve keyword recall accuracy of two speakers but the
results are not signicant
The modest results of speaker-independent style conversion are
due to small dataset, and the fact that we did not attempt to
transform additional acoustic features, such as phoneme
durations
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion

More Related Content

Similar to Improving speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion

Iterative usability evaluation of DSLs
Iterative usability evaluation of DSLsIterative usability evaluation of DSLs
Iterative usability evaluation of DSLs
Ankica Barisic
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
Universitat Politècnica de Catalunya
 
Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...
Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...
Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...
Association for Computational Linguistics
 
SEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial NetworkSEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial Network
Universitat Politècnica de Catalunya
 
Classics 2011
Classics 2011Classics 2011
Classics 2011
goodbeem
 
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
Giovanni Murru
 
PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention
PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal AttentionPR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention
PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention
Taesu Kim
 
Ihdels presentation
Ihdels presentationIhdels presentation
Ihdels presentation
Daniel Molina Cabrera
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
Arnab Bhadury
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Jinho Choi
 
Evolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environmentEvolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environment
Florence (Flo) Debarre
 
[KCC 2020] 군집 기반 색상 팔레트 비교
[KCC 2020] 군집 기반 색상 팔레트 비교[KCC 2020] 군집 기반 색상 팔레트 비교
[KCC 2020] 군집 기반 색상 팔레트 비교
Suzi Kim
 
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker Indentification
Sai Kiran Kadam
 

Similar to Improving speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion (13)

Iterative usability evaluation of DSLs
Iterative usability evaluation of DSLsIterative usability evaluation of DSLs
Iterative usability evaluation of DSLs
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...
Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...
Florian Boudin - 2015 - Reducing Over-generation Errors for Automatic Keyphra...
 
SEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial NetworkSEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial Network
 
Classics 2011
Classics 2011Classics 2011
Classics 2011
 
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
 
PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention
PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal AttentionPR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention
PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention
 
Ihdels presentation
Ihdels presentationIhdels presentation
Ihdels presentation
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Evolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environmentEvolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environment
 
[KCC 2020] 군집 기반 색상 팔레트 비교
[KCC 2020] 군집 기반 색상 팔레트 비교[KCC 2020] 군집 기반 색상 팔레트 비교
[KCC 2020] 군집 기반 색상 팔레트 비교
 
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker Indentification
 

Recently uploaded

Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 

Recently uploaded (20)

Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 

Improving speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion

  • 1. 1/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Improving Speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion Tuan Dinh, Alexander Kain, Kris Tjaden Oregon Health & Science University, University at Bualo October 23, 2020 Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 2. 2/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Background Approximately 28 × 106 people in the United States have some degree of hearing loss Speakers naturally adopt a special clear speaking style when talking to listeners with hearing loss normal-hearing listeners in adverse environments Clear speech features high degree of articulation slower speaking rate more frequent and longer pauses exact strategy varies from speaker to speaker Clear speech is more intelligible than habitual speech 1424% improvement in keyword recall in noise [Kain08] Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 4. 4/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Hybridization Replacing certain acoustic features of habitual speech with those from clear speech cause improved intelligibility for typical speakers, incorporating [Kain08] clear spectrum and duration yielded 24% improvement for dysarthric speakers, incorporating [Tjaden14] clear energy yielded 8.7% improvement clear spectrum yielded 18% improvement clear spectrum and duration yielded 13.4% improvement Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 5. 5/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Style Conversion Style conversion converts speaking style Previously mapping habitual (HAB) to clear (CLR) VAE-12 resulted in improvement of intelligibility for one speaker from 24% to 46% [Dinh19] Generated parameters from DNN-mapping can be over-smoothing Generative adversarial nets (GANs) can be a promising approach to address over-smoothness Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 6. 6/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Style Conversion Aim To further increase intelligibility automatically by style conversion, through the use of a conditional GANs (cGANs) Experiments showing ecacy of cGANs in terms of speech intelligibility when performing 1 speaker dependent one-to-one mapping 2 speaker independent many-to-one mapping 3 speaker independent many-to-many mapping Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 7. GANs Traditional GAN has 2 components: a Generator (G) and a Discriminator (D) that play a min-max game [Goodfellow14] Figure: GANs 7/26
  • 8. 8/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Proposed cGANs for style conversion Left Context HAB VAE Right Context G D Mapped VAE HAB VAE CLR VAE D Real Pairs? Real Pairs? Figure: cGAN framework for style conversion Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 9. 9/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Proposed Generator Current HAB VAE 12 Left Context 60 Right Context 60 Concat Dense 512 Dense 512 Concat Dense 512 Dense 512 Linear 12 Add Current CLR VAE 12 Figure: Generator architecture No random noise z The component G learns the dierences between HAB VAE-12 and CLR VAE-12 Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 10. 10/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Proposed Discriminator Discriminator has 2 hidden layers of 256 nodes, an output layer of 1 nodes with sigmoid function In addition to adversarial loss, we use mean-absolute dierence loss between G(z) and aligned real data x Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 11. 11/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Tips and Tricks to Train cGANs a leaky ReLU activation function with a negative slope of 0.2 for both G and D a dropout layer following each hidden layer of D with a dropout rate of 0.5, use the Adam optimizer: learning rate: 0.0001, momentum β1: 0.5 and learning rate decay: 0.00001 for D learning rate: 0.0002, momentum β1: 0.5 and learning rate decay: 0.00001 for G weights initialized from a zero-centered Normal distribution with standard deviation 0.02 Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 12. 12/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Experiment: One-to-one mapping Train speaker-dependent HAB-to-CLR mapping: Require parallel data of HAB and CLR speech Database: Used a 78 speaker database: Consisting of control speakers (CS, N = 32) Speaker with multiple sclerosis (MS, N = 30) Speakers with Parkinson's disease (PD, N = 16) A speaker read 25 Harvard sentences in 2 speaking styles (HAB, CLR) Select three speakers: PDM6, CSM7, PDF7 that showed the most benet from the CLR spectrum Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 13. 13/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Method HAB VAE- 12 style mapping CLR VAE- 12 Figure: cGANs-based mapping We aligned each HAB utterance to its parallel CLR utterance of the same speaker using DTW on 32nd-order log lter-bank features. Then, we pre-trained the generator that maps HAB VAE-12 to CLR VAE-12 to minimize mean-squared-error loss function Then, we trained our proposed cGANs structure Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 14. 14/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: Log Spectral Distortion mapping speakers PD_F7 PD_M6 C_M7 DNN 16.8 16.67 16.44 GAN 12.85 12.58 12.67 Table: Average LSD (in dB) Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 15. 15/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: LSD 0 10 20 LSD(dB) Speaker PD_F7 DNN GAN 0 10 LSD(dB) Speaker PD_M6 DNN GAN 0 5 10 15 20 25 Sentence ID 0 10 20 LSD(dB) Speaker C_M7 DNN GAN Figure: LSD of 25 test sentences for 3 speakers; GAN vs DNN Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 16. 16/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: Variance ratio 0 1 22 CLR 2 MAP Speaker PD_F7 DNN GAN 0 1 2 2 CLR 2 MAP Speaker PD_M6 DNN GAN 2 4 6 8 10 12 VAE-12 component 0 1 2 2 CLR 2 MAP Speaker C_M7 DNN GAN Figure: Variance ratio σ2 CLR σ2 MAP between CLR VAE-12 (CLR) and mapped VAE-12 (MAP); between GAN and DNN. Smaller is better. Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 17. 17/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: Example Figure: Sentence: Four hours of steady work faced us. Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 18. 18/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Subjective Evaluation Loudness dierence was minimized using RMSA measure Stimuli was mixed with babble noise at 0 dB SNR The test consists of 25 sentences × 3 speakers × 5 conditions (2 purely vocoded, 1 hybrid, 2 mappings) = 375 unique trails 60 participants on AMT, each listened to 25 sentences then typed down the sentences We manually counted the accurate keywords of each sentence Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 19. 19/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Subjective Evaluation vocoded HAB DNN GAN hybrid vocoded CLR 0 20 40 60 80 100 Averagekeywordaccuracy CSM7 PDF7 PDM6 Figure: Keyword recall accuracy Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 20. 20/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Experiment: Many-to-one mapping SPK-i HAB VAE- 12 SPK-1 HAB VAE- 12 SPK-N HAB VAE- 12 mapping Best CLR VAE- 12 Figure: cGANs-based mapping Maps HAB speech from many speakers to CLR speech of a target speaker Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 21. 21/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Subjective Evaluation Loudness dierence was minimized using RMSA measure. Stimuli was mix with babble noise at 0 dB SNR Test consists of 25 sentences ×3 source speakers (CSM7, PDM7, PDM6) ×3 conditions (vocoded HAB, cGAN-mapping, hybrid) + 25 sentences ×2 target speakers (CSM10, CSF15) ×1 conditions (vocoded CLR) = 275 unique trials 44 participants on AMT Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 22. 22/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Subjective Evaluation vocoded HAB GAN hybrid vocoded CLR 0 20 40 60 80 100 Averagekeywordaccuracy CSM7 PDF7 PDM6 Figure: Keyword recall accuracy Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 23. 23/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Experiment: Many-to-many mapping SPK-i HAB VAE- 12 SPK-1 HAB VAE- 12 SPK-N HAB VAE- 12 style mapping style mapping style mapping SPK-i CLR VAE- 12 SPK-1 CLR VAE- 12 SPK-N CLR VAE- 12 Figure: cGANs-based mapping Learn the style dierences, preserve speaker identities Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 24. 24/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Subjective Evaluation A test consists of 5 sentences ×3 speakers ×4 conditions (vocoded HAB, GAN, hybrid, vocoded CLR) = 300 unique trials 24 listeners participated Loudness dierence was minimized using RMSA measure. Stimuli was mix with babble noise at 0 dB SNR Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 25. Subjective Evaluation CSM7 PDF7 PDM6 vocoded HAB 36.8 10 28.8 GAN 39.6 15.6 26.8 hybrid 62 22.8 57.6 vocoded CLR 66.8 22.4 48 Table: Average keyword accuracy 25/26
  • 26. 26/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Conclusion Apply cGANs in HAB-to-CLR style conversion 1 In speaker-dependent one-to-one mapping, cGANs outperform DNN in term keyword recall accuracy. cGANs improved intelligibility of two of three speakers 2 In speaker-independent many-to-one mapping, cGANs can improve speech intelligibility of one of three speakers 3 In speaker-independent many-to-many mapping, cGANs can improve keyword recall accuracy of two speakers but the results are not signicant The modest results of speaker-independent style conversion are due to small dataset, and the fact that we did not attempt to transform additional acoustic features, such as phoneme durations Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion