SlideShare a Scribd company logo
©Yuki Saito, Aug. 18, 2022.
Towards Human-In-The-Loop
Speech Synthesis Technology
The University of Tokyo (UTokyo), Japan.
Online Research Talk Hosted by NUS, Singapore @ Zoom
Yuki Saito
/51
1
Self Introduction
➢ SAITO Yuki (齋藤 佑樹)
– Born in
• Kushiro-shi, Hokkaido, Japan
– Educational Background
• Apr. 2016 ~ Mar. 2018: UTokyo (MS)
• Apr. 2018 ~ Mar. 2021: UTokyo (PhD)
– Research interests
• Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning
– Selected publications (from 11 journal papers & 25 conf. papers)
• Saito et al., "Statistical parametric speech synthesis incorporating
generative adversarial networks," IEEE/ACM TASLP, 2018.
• Saito et al., "Perceptual-similarity-aware deep speaker
representation learning for multi-speaker generative modeling,"
IEEE/ACM TASLP, 2021.
/51
2
Our Lab. in UTokyo, Japan
➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama
– Source separation, sound field analysis & synthesis, and TTS & VC
/51
3
TTS/VC Research Group in Our Lab.
➢ Organized by Dr. Takamichi & me (since Apr. 2016)
– Current students: 4 PhD & 8 MS
– Past students: 1 PhD (me) & 10 MS
TTS
VC
Toward universal
speech communication
based on TTS/VC
technologies
/51
4
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
5
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
➢ Speech synthesis
– Technology for synthesizing speech using a computer
➢ Applications
– Speech communication assistance (e.g., speech translation)
– Entertainments (e.g., singing voice synthesis/conversion)
➢ DNN-based speech synthesis [Zen+13][Oord+16]
– Using a DNN for learning statistical relation betw. input-to-speech
6
Research Field: Speech Synthesis
Text-To-Speech (TTS)
Text Speech
Voice Conversion (VC)
Output
speech
Input
speech
Hello Hello
[Sagisaka+88]
[Stylianou+88]
DNN: Deep Neural Network
/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
7
Discriminator
1: natural
Adversarial
loss
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
8
Human listener
Human
perception
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
Can we replace the GAN discriminator with a human listener?
/51
9
Motivation of Human-In-The-Loop
Speech Synthesis Technologies
➢ Speech communication: intrinsically imperfect
– Humans often make mistakes, but we can communicate!
• Mispronunciations, wrong accents, unnatural pausing, etc...
– Mistakes can be corrected thru interaction betw. speaker & listener.
• c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20]
– Intervention of human listeners will cultivate advanced research field!
➢
Possible applications
– Human-machine interaction
• e.g., spoken dialogue systems
– Media creation
• e.g., singing voice synthesis & dubbing
The image was automatically generated by craiyon
/51
10
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
11
Overview: Deep Speaker Representation Learning
➢ Deep Speaker Representation Learning (DSRL)
– DNN-based technology for learning Speaker Embeddings (SEs)
• Feature extraction for discriminative tasks (e.g., [Variani+14])
• Control of speaker ID in generative tasks (e.g., [Jia+18])
➢ This talk: method to learn SEs suitable for generative tasks
– Purpose: improving quality & controllability of synthetic speech
– Core idea: introducing human listeners for learning SEs that are highly
correlated with perceptual similarity among speakers
DNN
NG
ASV
DNN
TTS
Discriminative task
(e.g., automatic speaker verification: ASV)
Generative task
(e.g., TTS and VC)
DNN: Deep Neural Network
/51
12
Conventional Method:
Speaker-Classification-Based DSRL
➢ Learning to predict speaker ID from input speech parameters
– SEs suitable for speaker classification → also suitable for TTS/VC?
– One reason: low interpretability of SEs
Minimizing
cross-entropy
Speech
params.
d-vectors
[Variani+14]
Spkr.
classification
Spkr.
encoder
Spkr.
IDs
Distance metric in SE space
≠
Perceptual metric
(i.e., speaker similarity)
SE
space
/51
13
Our Method:
Perceptual-Similarity-Aware DSRL
➢ 1. Large-scale scoring of perceptual speaker similarity
➢ 2. SE learning considering the similarity scores
DNN
(Spkr. encoder)
Learned
similarity
Speech
params.
Similarity
score
SEs
Similarity
score
Perceptual
similarity
scoring
Spkr.
pairs
𝐿SIM
(∗)
Vector Matrix Graph
Loss to predict sim.
/51
14
Large Scale Scoring of
Perceptual Speaker Similarity
➢ Crowdsourcing of perceptual speaker similarity scores
– Dataset we used: 153 females in JNAS corpus [Itou+99]
– 4,000↑ listeners scored the similarity of two speakers' voices.
➢ Histogram of the collected scores
Instruction of the scoring
To what degree do these two speakers'
voices sound similar?
(−3: dissimilar ~ +3: similar)
( , ) → +2
( , ) → −3
( , ) → −2
/51
15
Perceptual Speaker Similarity Matrix
➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s
– 𝑁s: # of pre-stored (i.e., closed) speakers
– 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s
⊤
: the 𝑖th similarity score vector
• 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣
3
2
1
0
−1
−2
−3
(a) Full score matrix
(153 females)
(b) Sub-matrix of (a)
(13 females)
I'll present three algorithms to learn the similarity.
/51
16
Algorithm 1: Similarity Vector Embedding
➢ Predict a vector of the matrix 𝐒 from speech parameters
𝐿SIM
(vec)
𝒔, ො
𝒔 =
1
𝑁𝑠
ො
𝒔 − 𝒔 ⊤ ො
𝒔 − 𝒔
Spkr.
encoder
𝐿SIM
(vec)
𝒔
ො
𝒔
𝐒
Sim. score
vector Sim.
matrix
Speech
params.
Similarity
prediction
𝒅
SEs
/51
17
Algorithm 2: Similarity Matrix Embedding
➢ Associate the Gram matrix of SEs with the matrix 𝐒
𝐿SIM
(mat)
𝐿SIM
(mat)
𝐃, 𝐒 =
1
𝑍s
෩
𝐊𝐃 − ෨
𝐒 𝐹
2
𝐊𝐃
Gram
matrix
Calc.
kernel
𝑘 ⋅,⋅
𝑍s: Normalization coefficient (෨
𝐒 represents off-diagonal matrix of 𝐒)
𝐒
Sim.
matrix
Speech
params.
Spkr.
encoder
𝒅
SEs
/51
18
Algorithm 3: Similarity Graph Embedding
➢ Learn the structure of speaker similarity graph from SE pairs
𝐿SIM
graph
𝒅𝑖, 𝒅𝑗 = −𝑎𝑖,𝑗 log 𝑝𝑖,𝑗 − 1 − 𝑎𝑖,𝑗 log 1 − 𝑝𝑖,𝑗
Spkr. sim.
graph
Edge
prediction 0: no edge
1: exist edge
𝐿SIM
(graph)
𝑝𝑖,𝑗 = exp − 𝒅𝑖 − 𝒅𝑗 2
2
: edge probability (referring to [Li+18])
Spkr.
encoder
𝐒
Sim.
matrix
Speech
params.
𝒅
𝑎𝑖,𝑗
SEs
/51
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
19
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ Overall framework: iterate similarity scoring & SE learning
– Obtaining better SEs while reducing costs of scoring & learning
– Using partially observed similarity scores
/51
20
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 1: train spkr. encoder using partially observed scores
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
/51
21
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 2: predict similarity scores for unscored spkr. pairs
: +3
: 0
: -2
Predicted
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
/51
22
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 3: select unscored pairs to be scored next
– Query strategy: criterion to determine priority of scoring
: +3
: 0
: -2
Predicted
: HSF
: MSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
{ Higher, Middle, Lower }-Similarity First
/51
23
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 4: annotate similarity scores to selected spkr. pairs
– → return to AL step 1
: +3
: 0
: -2
Predicted
: HSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
: +1
: MSF
/51
24
➢ Experimental Evaluations
/51
25
Experimental Conditions
Dataset
(16 kHz sampling)
JNAS [Itou+99] 153 female speakers
5 utterances per speaker for scoring
About 130 / 15 utterances for DSRL & evaluation
(F001 ~ F013: unseen speakers for evaluation)
Similarity score
-3 (dissimilar) ~ +3 (similar)
(Normalized to [-1, +1] or [0, 1] in DSRL)
Speech parameters
40-dimensional mel-cepstra, F0, aperiodicity
(extracted by STRAIGHT analysis [Kawahara+99])
DNNs Fully-connected (for details, please see our paper)
Dimensionality of SEs 8
AL setting
Pool-based simulation
(Using binary masking for excluding unobserved scores)
DSRL methods
Conventional: d-vectors [Variani+14]
Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
/51
26
Evaluation 1: SE Interpretability
➢ Scatter plots of human-/SE-derived similarity scores
– Prop. (*) highly correlated with the human-derived sim. scores.
• → Our DSRL can learn interpretable SEs better than d-vec!
d-vec.
Prop.
(graph)
Prop.
(mat)
Prop.
(vec)
SE-derived
0 1
Human-derived
1
0
Seen-Seen
Seen-Unseen
/51
27
Evaluation 2: Speaker Interpolation Controllability
➢ Task: generate new speaker identity by mixing two SEs
– We evaluated spkr. sim. between interpolated speech with
𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1).
– The score curves of Prop. (*) were closer to the red line.
• → Our SEs achieve higher controllability than d-vec.!
(20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test)
Mixing coefficient 𝛼
0.0 0.5 1.0
1.0
0.5
0.0
Preference
score
A (mixed w/ 𝛼 = 0)
B (mixed w/ 𝛼 = 1)
/51
28
Evaluation 3: AL Cost Efficacy
➢ AL setting: starting DSRL from PS to reach FS situation
– MSF was the best query strategy for all proposed methods.
– Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work.
In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated.
Fully Scored
(FS)
Partially Scored
(PS)
AUC
of
similar
speaker-pair
detection
/51
29
Summary
➢ Purpose
– Learning SEs highly correlated with perceptual speaker similarity
➢ Proposed methods
– 1) Perceptual-similarity-aware learning of SEs
– 2) Human-in-the-loop AL for DSRL
➢ Results of our methods
– 1) learned SEs having high correlation with human perception
– 2) achieved better controllability in speaker interpolation
– 3) reduced costs of scoring/training by introducing AL
➢ For detailed discussion...
– Please read our TASLP paper (open access)!
/51
30
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
– Speech emotion recognition for nonverbal vocalizations
➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
/51
31
Overview: Speaker Adaptation for
Multi-Speaker TTS
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
DNN-based multi-speaker TTS [Fan+15][Hojo+18]
– Single DNN to generate multiple speakers' voices
• SE: conditional input to control speaker ID of synthetic speech
➢
Speaker adaptation for multi-speaker TTS (e.g., [Jia+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
SE
Multi-speaker
TTS model
/51
32
Conventional Speaker Adaptation Method
➢ Transfer Learning (TL) from speaker verification [Jia+18]
– Speaker encoder for extracting SE from reference speech
• Pretrained on speaker verification (e.g., GE2E loss [Wan+18])
– Multi-speaker TTS model for synthesizing speech from (text, SE) pairs
• Training: generate voices of seen speakers (∈ training data)
• Inference: extract SE of unseen speaker & input to TTS model
– Issue: cannot be used w/o the reference speech
• e.g., deceased person w/o any speech recordings
Multi-speaker
TTS model
Speaker
encoder
Ref.
speech
FROZEN
Can we find the target speaker's SE w/o using ref. speech?
/51
33
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidates
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
SE selection
by user
Selected
SE
Selected Waveform
/51
34
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 1: define line segment in SE space
Candidate
SEs
⋯
SE
space
/51
35
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 2: synthesize waveforms using candidate SEs
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
/51
36
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 3: select one SE based on user's speech perception
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Selection
by user
Selected
SE
Selected Waveform
/51
37
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 4: update line segment using Bayesian Optimization
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
/51
Candidate
SEs
⋯
SE
space
38
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ ... and loops SLS steps until the user gets desired outcome
– Ref. speech & spkr. encoder are no longer needed in adaptation!
Multi-speaker
TTS system
Text ⋰
Waveforms
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
/51
39
Two Strategies for Improving Search Efficacy
➢ Performing SLS in original SE space is inefficient because...
– It assumes the search space to be
𝐷-dimensional hypercube 0, 1 𝐷.
However, actual SEs are NOT distributed
uniformly (e.g., right figure).
– SEs in the dead space can degrade
the naturalness of synthetic speech...
➢ Our strategies for SLS-based speaker adaptation
– 1) Use mean {male, female} speakers' SEs as initial line endpoints
• → Start the search from more natural voices
– 2) Set the search space to a quantile of SEs in the training data
• → Search for more natural voice (but limit the search space)
– We empirically confirmed that these strategies significantly
improved the naturalness of synthetic speech during search.
Dead
space
/51
40
➢ Experimental Evaluations
/51
41
Experimental Conditions
Corpus for training
speaker encoder
Corpus of Spontaneous Japanese (CSJ) [Maekawa03]
(947 males and 470 females, 660h)
TTS model FastSpeech 2 [Ren+21]
Corpus for TTS
model
"parallel100" subset of Japanese Versatile Speech (JVS) corpus
[Takamichi+20]
(49 males and 51 females, 22h, 100 sentences / speaker)
Data
split
Train 90 speakers (44 males, 46 females)
Test 4 speakers (2 males, 2 females)
Validation 6 speakers (3 males, 3 females)
Vocoder
Pretrained "universal_v1" model of HiFi-GAN [Kong+20]
(published in ming024's GitHub repository)
/51
42
➢ Interface for SLS experiment
– Button to play reference speaker's voice
• Simulating situation where users have their desired voice in mind
– Slider to change multiple speakers' IDs smoothly
Demonstration
/51
➢ Conditions
– 8 participants searched for 4 target speakers w/ SLS (30 iterations).
– We computed the mel-spectrogram MAE betw. natural & synthetic
speech for each SE and selected one based on the MAE values.
43
Ref.
waveform
Participant 1
Participant 8
SLS for
our method
⋮
⋮
Searched
SEs
⋮
SLS-best:
Lowest MAE
SLS-mean:
Closest to mean MAE
SLS-worst:
Highest MAE
Human-In-The-Loop Experiment
/51
44
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
/51
45
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
Our methods achieve MOSs comparable to TL-based method!
/51
46
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
SLS-worst tends to degrade the naturalness significantly.
/51
47
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
/51
48
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
We observe similar tendency to the naturalness MOS results.
/51
49
Speech Samples
Ground-
Truth
TL
Mean-
Speaker
SLS-
worst
SLS-
mean
SLS-
best
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →
/51
50
Summary
➢ Purpose
– Speaker adaptation for multi-speaker TTS w/o ref. speech
➢ Proposed method
– SLS-based human-in-the-loop speaker adaptation algorithm
➢ Results of our method
– 1) achieved comparable performance to TL-based adaptation method
– 2) showed the difficulty in finding desirable SEs (less interpretability?)
➢ For detailed discussion...
– Please read our INTERSPEECH2022 paper (ACCEPTED)!
• Mr. Kenta Udagawa will talk about this work in poster session.
/51
51
Conclusions (Part 1)
➢ Main topic: human-in-the-loop speech synthesis
– Intervening human listeners in SOTA DNN-based TTS/VC methods
➢ Presented work
– 1) Human-in-the-loop deep speaker representation learning
– 2) Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Future prospection
– Continually trainable TTS/VC technology with the aid of humans
• As we grow, so do speech synthesis technologies!
➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members!
– Very looking forward to meet you at Incheon, South Korea :)
Thank you for your attention!

More Related Content

Similar to saito22research_talk_at_NUS

What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication? What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication?
Shinnosuke Takamichi
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
ParrotAI
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
Mounika715343
 
Incremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIncremental Difference as Feature for Lipreading
Incremental Difference as Feature for Lipreading
IDES Editor
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
karthik annam
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
linshanleearchive
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
IRJET Journal
 
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
Hussein Ghaly
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021
KunZhou18
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
Ramya Nellutla
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
Hiroyuki Miyoshi
 
Development of text to speech system for yoruba language
Development of text to speech system for yoruba languageDevelopment of text to speech system for yoruba language
Development of text to speech system for yoruba language
Alexander Decker
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
inscit2006
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
OlusolaTop
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
DataScienceConferenc1
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
DataScienceConferenc1
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
Autotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismAutotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualism
IRJET Journal
 

Similar to saito22research_talk_at_NUS (20)

What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication? What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication?
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
Incremental Difference as Feature for Lipreading
Incremental Difference as Feature for LipreadingIncremental Difference as Feature for Lipreading
Incremental Difference as Feature for Lipreading
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...
 
IEEE ICASSP 2021
IEEE ICASSP 2021IEEE ICASSP 2021
IEEE ICASSP 2021
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Development of text to speech system for yoruba language
Development of text to speech system for yoruba languageDevelopment of text to speech system for yoruba language
Development of text to speech system for yoruba language
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Autotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualismAutotuned voice cloning enabling multilingualism
Autotuned voice cloning enabling multilingualism
 

More from Yuki Saito

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
Yuki Saito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
Yuki Saito
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
Yuki Saito
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
Yuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
Yuki Saito
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
Yuki Saito
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
Yuki Saito
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
Yuki Saito
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
Yuki Saito
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
Yuki Saito
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
Yuki Saito
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
Yuki Saito
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
Yuki Saito
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
Yuki Saito
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
Yuki Saito
 
Une18apsipa
Une18apsipaUne18apsipa
Une18apsipa
Yuki Saito
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
Yuki Saito
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
Yuki Saito
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
Yuki Saito
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
Yuki Saito
 

More from Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Une18apsipa
Une18apsipaUne18apsipa
Une18apsipa
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
 

Recently uploaded

Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 

Recently uploaded (20)

Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 

saito22research_talk_at_NUS

  • 1. ©Yuki Saito, Aug. 18, 2022. Towards Human-In-The-Loop Speech Synthesis Technology The University of Tokyo (UTokyo), Japan. Online Research Talk Hosted by NUS, Singapore @ Zoom Yuki Saito
  • 2. /51 1 Self Introduction ➢ SAITO Yuki (齋藤 佑樹) – Born in • Kushiro-shi, Hokkaido, Japan – Educational Background • Apr. 2016 ~ Mar. 2018: UTokyo (MS) • Apr. 2018 ~ Mar. 2021: UTokyo (PhD) – Research interests • Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning – Selected publications (from 11 journal papers & 25 conf. papers) • Saito et al., "Statistical parametric speech synthesis incorporating generative adversarial networks," IEEE/ACM TASLP, 2018. • Saito et al., "Perceptual-similarity-aware deep speaker representation learning for multi-speaker generative modeling," IEEE/ACM TASLP, 2021.
  • 3. /51 2 Our Lab. in UTokyo, Japan ➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama – Source separation, sound field analysis & synthesis, and TTS & VC
  • 4. /51 3 TTS/VC Research Group in Our Lab. ➢ Organized by Dr. Takamichi & me (since Apr. 2016) – Current students: 4 PhD & 8 MS – Past students: 1 PhD (me) & 10 MS TTS VC Toward universal speech communication based on TTS/VC technologies
  • 5. /51 4 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 6. /51 5 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 7. /51 ➢ Speech synthesis – Technology for synthesizing speech using a computer ➢ Applications – Speech communication assistance (e.g., speech translation) – Entertainments (e.g., singing voice synthesis/conversion) ➢ DNN-based speech synthesis [Zen+13][Oord+16] – Using a DNN for learning statistical relation betw. input-to-speech 6 Research Field: Speech Synthesis Text-To-Speech (TTS) Text Speech Voice Conversion (VC) Output speech Input speech Hello Hello [Sagisaka+88] [Stylianou+88] DNN: Deep Neural Network
  • 8. /51 ➢ SOTA DNN-based speech synthesis methods – Quality of synthetic speech: as natural as human speech – Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14]) • Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20], HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc... 7 Discriminator 1: natural Adversarial loss 𝒙 Input feats. Acoustic model (generator) ෝ 𝒚 Synthetic speech Natural speech 𝒚 General Background Reconstruction loss GAN: Generative Adversarial Network
  • 9. /51 ➢ SOTA DNN-based speech synthesis methods – Quality of synthetic speech: as natural as human speech – Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14]) • Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20], HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc... 8 Human listener Human perception 𝒙 Input feats. Acoustic model (generator) ෝ 𝒚 Synthetic speech Natural speech 𝒚 General Background Reconstruction loss GAN: Generative Adversarial Network Can we replace the GAN discriminator with a human listener?
  • 10. /51 9 Motivation of Human-In-The-Loop Speech Synthesis Technologies ➢ Speech communication: intrinsically imperfect – Humans often make mistakes, but we can communicate! • Mispronunciations, wrong accents, unnatural pausing, etc... – Mistakes can be corrected thru interaction betw. speaker & listener. • c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20] – Intervention of human listeners will cultivate advanced research field! ➢ Possible applications – Human-machine interaction • e.g., spoken dialogue systems – Media creation • e.g., singing voice synthesis & dubbing The image was automatically generated by craiyon
  • 11. /51 10 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 12. /51 11 Overview: Deep Speaker Representation Learning ➢ Deep Speaker Representation Learning (DSRL) – DNN-based technology for learning Speaker Embeddings (SEs) • Feature extraction for discriminative tasks (e.g., [Variani+14]) • Control of speaker ID in generative tasks (e.g., [Jia+18]) ➢ This talk: method to learn SEs suitable for generative tasks – Purpose: improving quality & controllability of synthetic speech – Core idea: introducing human listeners for learning SEs that are highly correlated with perceptual similarity among speakers DNN NG ASV DNN TTS Discriminative task (e.g., automatic speaker verification: ASV) Generative task (e.g., TTS and VC) DNN: Deep Neural Network
  • 13. /51 12 Conventional Method: Speaker-Classification-Based DSRL ➢ Learning to predict speaker ID from input speech parameters – SEs suitable for speaker classification → also suitable for TTS/VC? – One reason: low interpretability of SEs Minimizing cross-entropy Speech params. d-vectors [Variani+14] Spkr. classification Spkr. encoder Spkr. IDs Distance metric in SE space ≠ Perceptual metric (i.e., speaker similarity) SE space
  • 14. /51 13 Our Method: Perceptual-Similarity-Aware DSRL ➢ 1. Large-scale scoring of perceptual speaker similarity ➢ 2. SE learning considering the similarity scores DNN (Spkr. encoder) Learned similarity Speech params. Similarity score SEs Similarity score Perceptual similarity scoring Spkr. pairs 𝐿SIM (∗) Vector Matrix Graph Loss to predict sim.
  • 15. /51 14 Large Scale Scoring of Perceptual Speaker Similarity ➢ Crowdsourcing of perceptual speaker similarity scores – Dataset we used: 153 females in JNAS corpus [Itou+99] – 4,000↑ listeners scored the similarity of two speakers' voices. ➢ Histogram of the collected scores Instruction of the scoring To what degree do these two speakers' voices sound similar? (−3: dissimilar ~ +3: similar) ( , ) → +2 ( , ) → −3 ( , ) → −2
  • 16. /51 15 Perceptual Speaker Similarity Matrix ➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s – 𝑁s: # of pre-stored (i.e., closed) speakers – 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s ⊤ : the 𝑖th similarity score vector • 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣 3 2 1 0 −1 −2 −3 (a) Full score matrix (153 females) (b) Sub-matrix of (a) (13 females) I'll present three algorithms to learn the similarity.
  • 17. /51 16 Algorithm 1: Similarity Vector Embedding ➢ Predict a vector of the matrix 𝐒 from speech parameters 𝐿SIM (vec) 𝒔, ො 𝒔 = 1 𝑁𝑠 ො 𝒔 − 𝒔 ⊤ ො 𝒔 − 𝒔 Spkr. encoder 𝐿SIM (vec) 𝒔 ො 𝒔 𝐒 Sim. score vector Sim. matrix Speech params. Similarity prediction 𝒅 SEs
  • 18. /51 17 Algorithm 2: Similarity Matrix Embedding ➢ Associate the Gram matrix of SEs with the matrix 𝐒 𝐿SIM (mat) 𝐿SIM (mat) 𝐃, 𝐒 = 1 𝑍s ෩ 𝐊𝐃 − ෨ 𝐒 𝐹 2 𝐊𝐃 Gram matrix Calc. kernel 𝑘 ⋅,⋅ 𝑍s: Normalization coefficient (෨ 𝐒 represents off-diagonal matrix of 𝐒) 𝐒 Sim. matrix Speech params. Spkr. encoder 𝒅 SEs
  • 19. /51 18 Algorithm 3: Similarity Graph Embedding ➢ Learn the structure of speaker similarity graph from SE pairs 𝐿SIM graph 𝒅𝑖, 𝒅𝑗 = −𝑎𝑖,𝑗 log 𝑝𝑖,𝑗 − 1 − 𝑎𝑖,𝑗 log 1 − 𝑝𝑖,𝑗 Spkr. sim. graph Edge prediction 0: no edge 1: exist edge 𝐿SIM (graph) 𝑝𝑖,𝑗 = exp − 𝒅𝑖 − 𝒅𝑗 2 2 : edge probability (referring to [Li+18]) Spkr. encoder 𝐒 Sim. matrix Speech params. 𝒅 𝑎𝑖,𝑗 SEs
  • 20. /51 Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs 19 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ Overall framework: iterate similarity scoring & SE learning – Obtaining better SEs while reducing costs of scoring & learning – Using partially observed similarity scores
  • 21. /51 20 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 1: train spkr. encoder using partially observed scores Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph
  • 22. /51 21 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 2: predict similarity scores for unscored spkr. pairs : +3 : 0 : -2 Predicted Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph
  • 23. /51 22 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 3: select unscored pairs to be scored next – Query strategy: criterion to determine priority of scoring : +3 : 0 : -2 Predicted : HSF : MSF : LSF Selected Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph Query strategy { Higher, Middle, Lower }-Similarity First
  • 24. /51 23 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ AL step 4: annotate similarity scores to selected spkr. pairs – → return to AL step 1 : +3 : 0 : -2 Predicted : HSF : LSF Selected Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph Query strategy : +1 : MSF
  • 26. /51 25 Experimental Conditions Dataset (16 kHz sampling) JNAS [Itou+99] 153 female speakers 5 utterances per speaker for scoring About 130 / 15 utterances for DSRL & evaluation (F001 ~ F013: unseen speakers for evaluation) Similarity score -3 (dissimilar) ~ +3 (similar) (Normalized to [-1, +1] or [0, 1] in DSRL) Speech parameters 40-dimensional mel-cepstra, F0, aperiodicity (extracted by STRAIGHT analysis [Kawahara+99]) DNNs Fully-connected (for details, please see our paper) Dimensionality of SEs 8 AL setting Pool-based simulation (Using binary masking for excluding unobserved scores) DSRL methods Conventional: d-vectors [Variani+14] Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
  • 27. /51 26 Evaluation 1: SE Interpretability ➢ Scatter plots of human-/SE-derived similarity scores – Prop. (*) highly correlated with the human-derived sim. scores. • → Our DSRL can learn interpretable SEs better than d-vec! d-vec. Prop. (graph) Prop. (mat) Prop. (vec) SE-derived 0 1 Human-derived 1 0 Seen-Seen Seen-Unseen
  • 28. /51 27 Evaluation 2: Speaker Interpolation Controllability ➢ Task: generate new speaker identity by mixing two SEs – We evaluated spkr. sim. between interpolated speech with 𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1). – The score curves of Prop. (*) were closer to the red line. • → Our SEs achieve higher controllability than d-vec.! (20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test) Mixing coefficient 𝛼 0.0 0.5 1.0 1.0 0.5 0.0 Preference score A (mixed w/ 𝛼 = 0) B (mixed w/ 𝛼 = 1)
  • 29. /51 28 Evaluation 3: AL Cost Efficacy ➢ AL setting: starting DSRL from PS to reach FS situation – MSF was the best query strategy for all proposed methods. – Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work. In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated. Fully Scored (FS) Partially Scored (PS) AUC of similar speaker-pair detection
  • 30. /51 29 Summary ➢ Purpose – Learning SEs highly correlated with perceptual speaker similarity ➢ Proposed methods – 1) Perceptual-similarity-aware learning of SEs – 2) Human-in-the-loop AL for DSRL ➢ Results of our methods – 1) learned SEs having high correlation with human perception – 2) achieved better controllability in speaker interpolation – 3) reduced costs of scoring/training by introducing AL ➢ For detailed discussion... – Please read our TASLP paper (open access)!
  • 31. /51 30 Table of Contents ➢ Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech – Speech emotion recognition for nonverbal vocalizations ➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
  • 32. /51 31 Overview: Speaker Adaptation for Multi-Speaker TTS ➢ Text-To-Speech (TTS) [Sagisaka+88] – Technology to artificially synthesize speech from given text ➢ DNN-based multi-speaker TTS [Fan+15][Hojo+18] – Single DNN to generate multiple speakers' voices • SE: conditional input to control speaker ID of synthetic speech ➢ Speaker adaptation for multi-speaker TTS (e.g., [Jia+18]) – TTS of unseen speaker's voice with small amount of data Text-To-Speech (TTS) Text Speech SE Multi-speaker TTS model
  • 33. /51 32 Conventional Speaker Adaptation Method ➢ Transfer Learning (TL) from speaker verification [Jia+18] – Speaker encoder for extracting SE from reference speech • Pretrained on speaker verification (e.g., GE2E loss [Wan+18]) – Multi-speaker TTS model for synthesizing speech from (text, SE) pairs • Training: generate voices of seen speakers (∈ training data) • Inference: extract SE of unseen speaker & input to TTS model – Issue: cannot be used w/o the reference speech • e.g., deceased person w/o any speech recordings Multi-speaker TTS model Speaker encoder Ref. speech FROZEN Can we find the target speaker's SE w/o using ref. speech?
  • 34. /51 33 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space Multi-speaker TTS system Text ⋰ Waveforms Candidates SEs ⋯ SE space Bayesian optimization Update line segment SE selection by user Selected SE Selected Waveform
  • 35. /51 34 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 1: define line segment in SE space Candidate SEs ⋯ SE space
  • 36. /51 35 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 2: synthesize waveforms using candidate SEs Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space
  • 37. /51 36 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 3: select one SE based on user's speech perception Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space Selection by user Selected SE Selected Waveform
  • 38. /51 37 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ SLS step 4: update line segment using Bayesian Optimization Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space Bayesian optimization Update line segment Selection by user Selected SE Selected Waveform
  • 39. /51 Candidate SEs ⋯ SE space 38 Proposed Method: Human-In-The-Loop Speaker Adaptation ➢ ... and loops SLS steps until the user gets desired outcome – Ref. speech & spkr. encoder are no longer needed in adaptation! Multi-speaker TTS system Text ⋰ Waveforms Bayesian optimization Update line segment Selection by user Selected SE Selected Waveform
  • 40. /51 39 Two Strategies for Improving Search Efficacy ➢ Performing SLS in original SE space is inefficient because... – It assumes the search space to be 𝐷-dimensional hypercube 0, 1 𝐷. However, actual SEs are NOT distributed uniformly (e.g., right figure). – SEs in the dead space can degrade the naturalness of synthetic speech... ➢ Our strategies for SLS-based speaker adaptation – 1) Use mean {male, female} speakers' SEs as initial line endpoints • → Start the search from more natural voices – 2) Set the search space to a quantile of SEs in the training data • → Search for more natural voice (but limit the search space) – We empirically confirmed that these strategies significantly improved the naturalness of synthetic speech during search. Dead space
  • 42. /51 41 Experimental Conditions Corpus for training speaker encoder Corpus of Spontaneous Japanese (CSJ) [Maekawa03] (947 males and 470 females, 660h) TTS model FastSpeech 2 [Ren+21] Corpus for TTS model "parallel100" subset of Japanese Versatile Speech (JVS) corpus [Takamichi+20] (49 males and 51 females, 22h, 100 sentences / speaker) Data split Train 90 speakers (44 males, 46 females) Test 4 speakers (2 males, 2 females) Validation 6 speakers (3 males, 3 females) Vocoder Pretrained "universal_v1" model of HiFi-GAN [Kong+20] (published in ming024's GitHub repository)
  • 43. /51 42 ➢ Interface for SLS experiment – Button to play reference speaker's voice • Simulating situation where users have their desired voice in mind – Slider to change multiple speakers' IDs smoothly Demonstration
  • 44. /51 ➢ Conditions – 8 participants searched for 4 target speakers w/ SLS (30 iterations). – We computed the mel-spectrogram MAE betw. natural & synthetic speech for each SE and selected one based on the MAE values. 43 Ref. waveform Participant 1 Participant 8 SLS for our method ⋮ ⋮ Searched SEs ⋮ SLS-best: Lowest MAE SLS-mean: Closest to mean MAE SLS-worst: Highest MAE Human-In-The-Loop Experiment
  • 45. /51 44 Subjective Evaluation (Naturalness MOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
  • 46. /51 45 Subjective Evaluation (Naturalness MOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) Our methods achieve MOSs comparable to TL-based method!
  • 47. /51 46 Subjective Evaluation (Naturalness MOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) SLS-worst tends to degrade the naturalness significantly.
  • 48. /51 47 Subjective Evaluation (Similarity MOS) (20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
  • 49. /51 48 Subjective Evaluation (Similarity MOS) (20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) We observe similar tendency to the naturalness MOS results.
  • 51. /51 50 Summary ➢ Purpose – Speaker adaptation for multi-speaker TTS w/o ref. speech ➢ Proposed method – SLS-based human-in-the-loop speaker adaptation algorithm ➢ Results of our method – 1) achieved comparable performance to TL-based adaptation method – 2) showed the difficulty in finding desirable SEs (less interpretability?) ➢ For detailed discussion... – Please read our INTERSPEECH2022 paper (ACCEPTED)! • Mr. Kenta Udagawa will talk about this work in poster session.
  • 52. /51 51 Conclusions (Part 1) ➢ Main topic: human-in-the-loop speech synthesis – Intervening human listeners in SOTA DNN-based TTS/VC methods ➢ Presented work – 1) Human-in-the-loop deep speaker representation learning – 2) Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Future prospection – Continually trainable TTS/VC technology with the aid of humans • As we grow, so do speech synthesis technologies! ➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members! – Very looking forward to meet you at Incheon, South Korea :) Thank you for your attention!