saito22research_talk_at_NUS

©Yuki Saito, Aug. 18, 2022.
Towards Human-In-The-Loop
Speech Synthesis Technology
The University of Tokyo (UTokyo), Japan.
Online Research Talk Hosted by NUS, Singapore @ Zoom
Yuki Saito

/51
1
Self Introduction
➢ SAITO Yuki (齋藤佑樹)
– Born in
• Kushiro-shi, Hokkaido, Japan
– Educational Background
• Apr. 2016 ~ Mar. 2018: UTokyo (MS)
• Apr. 2018 ~ Mar. 2021: UTokyo (PhD)
– Research interests
• Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning
– Selected publications (from 11 journal papers & 25 conf. papers)
• Saito et al., "Statistical parametric speech synthesis incorporating
generative adversarial networks," IEEE/ACM TASLP, 2018.
• Saito et al., "Perceptual-similarity-aware deep speaker
representation learning for multi-speaker generative modeling,"
IEEE/ACM TASLP, 2021.

/51
2
Our Lab. in UTokyo, Japan
➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama
– Source separation, sound field analysis & synthesis, and TTS & VC

/51
3
TTS/VC Research Group in Our Lab.
➢ Organized by Dr. Takamichi & me (since Apr. 2016)
– Current students: 4 PhD & 8 MS
– Past students: 1 PhD (me) & 10 MS
TTS
VC
Toward universal
speech communication
based on TTS/VC
technologies

/51
4
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)

/51
5
Table of Contents

/51
➢ Speech synthesis
– Technology for synthesizing speech using a computer
➢ Applications
– Speech communication assistance (e.g., speech translation)
– Entertainments (e.g., singing voice synthesis/conversion)
➢ DNN-based speech synthesis [Zen+13][Oord+16]
– Using a DNN for learning statistical relation betw. input-to-speech
6
Research Field: Speech Synthesis
Text-To-Speech (TTS)
Text Speech
Voice Conversion (VC)
Output
speech
Input
speech
Hello Hello
[Sagisaka+88]
[Stylianou+88]
DNN: Deep Neural Network

/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
7
Discriminator
1: natural
Adversarial
loss
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network

/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
8
Human listener
Human
perception
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
Can we replace the GAN discriminator with a human listener?

/51
9
Motivation of Human-In-The-Loop
Speech Synthesis Technologies
➢ Speech communication: intrinsically imperfect
– Humans often make mistakes, but we can communicate!
• Mispronunciations, wrong accents, unnatural pausing, etc...
– Mistakes can be corrected thru interaction betw. speaker & listener.
• c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20]
– Intervention of human listeners will cultivate advanced research field!
➢
Possible applications
– Human-machine interaction
• e.g., spoken dialogue systems
– Media creation
• e.g., singing voice synthesis & dubbing
The image was automatically generated by craiyon

/51
10
Table of Contents

/51
11
Overview: Deep Speaker Representation Learning
➢ Deep Speaker Representation Learning (DSRL)
– DNN-based technology for learning Speaker Embeddings (SEs)
• Feature extraction for discriminative tasks (e.g., [Variani+14])
• Control of speaker ID in generative tasks (e.g., [Jia+18])
➢ This talk: method to learn SEs suitable for generative tasks
– Purpose: improving quality & controllability of synthetic speech
– Core idea: introducing human listeners for learning SEs that are highly
correlated with perceptual similarity among speakers
DNN
NG
ASV
DNN
TTS
Discriminative task
(e.g., automatic speaker verification: ASV)
Generative task
(e.g., TTS and VC)
DNN: Deep Neural Network

/51
12
Conventional Method:
Speaker-Classification-Based DSRL
➢ Learning to predict speaker ID from input speech parameters
– SEs suitable for speaker classification → also suitable for TTS/VC?
– One reason: low interpretability of SEs
Minimizing
cross-entropy
Speech
params.
d-vectors
[Variani+14]
Spkr.
classification
Spkr.
encoder
Spkr.
IDs
Distance metric in SE space
≠
Perceptual metric
(i.e., speaker similarity)
SE
space

/51
13
Our Method:
Perceptual-Similarity-Aware DSRL
➢ 1. Large-scale scoring of perceptual speaker similarity
➢ 2. SE learning considering the similarity scores
DNN
(Spkr. encoder)
Learned
similarity
Speech
params.
Similarity
score
SEs
Similarity
score
Perceptual
similarity
scoring
Spkr.
pairs
𝐿SIM
(∗)
Vector Matrix Graph
Loss to predict sim.

/51
14
Large Scale Scoring of
Perceptual Speaker Similarity
➢ Crowdsourcing of perceptual speaker similarity scores
– Dataset we used: 153 females in JNAS corpus [Itou+99]
– 4,000↑ listeners scored the similarity of two speakers' voices.
➢ Histogram of the collected scores
Instruction of the scoring
To what degree do these two speakers'
voices sound similar?
(−3: dissimilar ～ +3: similar)
( , ) → +2
( , ) → −3
( , ) → −2

/51
15
Perceptual Speaker Similarity Matrix
➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s
– 𝑁s: # of pre-stored (i.e., closed) speakers
– 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s
⊤
: the 𝑖th similarity score vector
• 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣
3
2
1
0
−1
−2
−3
(a) Full score matrix
（153 females）
(b) Sub-matrix of (a)
（13 females）
I'll present three algorithms to learn the similarity.

/51
16
Algorithm 1: Similarity Vector Embedding
➢ Predict a vector of the matrix 𝐒 from speech parameters
𝐿SIM
(vec)
𝒔, ො
𝒔 =
1
𝑁𝑠
ො
𝒔 − 𝒔 ⊤ ො
𝒔 − 𝒔
Spkr.
encoder
𝐿SIM
(vec)
𝒔
ො
𝒔
𝐒
Sim. score
vector Sim.
matrix
Speech
params.
Similarity
prediction
𝒅
SEs

/51
17
Algorithm 2: Similarity Matrix Embedding
➢ Associate the Gram matrix of SEs with the matrix 𝐒
𝐿SIM
(mat)
𝐿SIM
(mat)
𝐃, 𝐒 =
1
𝑍s
෩
𝐊𝐃 − ෨
𝐒 𝐹
2
𝐊𝐃
Gram
matrix
Calc.
kernel
𝑘 ⋅,⋅
𝑍s: Normalization coefficient (෨
𝐒 represents off-diagonal matrix of 𝐒)
𝐒
Sim.
matrix
Speech
params.
Spkr.
encoder
𝒅
SEs

/51
18
Algorithm 3: Similarity Graph Embedding
➢ Learn the structure of speaker similarity graph from SE pairs
𝐿SIM
graph
𝒅𝑖, 𝒅𝑗 = −𝑎𝑖,𝑗 log 𝑝𝑖,𝑗 − 1 − 𝑎𝑖,𝑗 log 1 − 𝑝𝑖,𝑗
Spkr. sim.
graph
Edge
prediction 0: no edge
1: exist edge
𝐿SIM
(graph)
𝑝𝑖,𝑗 = exp − 𝒅𝑖 − 𝒅𝑗 2
2
: edge probability (referring to [Li+18])
Spkr.
encoder
𝐒
Sim.
matrix
Speech
params.
𝒅
𝑎𝑖,𝑗
SEs

/51
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
19
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ Overall framework: iterate similarity scoring & SE learning
– Obtaining better SEs while reducing costs of scoring & learning
– Using partially observed similarity scores

/51
20
➢ AL step 1: train spkr. encoder using partially observed scores
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph

/51
21
➢ AL step 2: predict similarity scores for unscored spkr. pairs
: +3
: 0
: -2
Predicted
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph

/51
22
➢ AL step 3: select unscored pairs to be scored next
– Query strategy: criterion to determine priority of scoring
: +3
: 0
: -2
Predicted
: HSF
: MSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
{ Higher, Middle, Lower }-Similarity First

/51
23
➢ AL step 4: annotate similarity scores to selected spkr. pairs
– → return to AL step 1
: +3
: 0
: -2
Predicted
: HSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
: +1
: MSF

/51
24
➢ Experimental Evaluations

/51
25
Experimental Conditions
Dataset
(16 kHz sampling)
JNAS [Itou+99] 153 female speakers
5 utterances per speaker for scoring
About 130 / 15 utterances for DSRL & evaluation
(F001 ~ F013: unseen speakers for evaluation)
Similarity score
－3 (dissimilar) ~ ＋3 (similar)
(Normalized to [－1, ＋1] or [0, 1] in DSRL)
Speech parameters
40-dimensional mel-cepstra, F0, aperiodicity
(extracted by STRAIGHT analysis [Kawahara+99])
DNNs Fully-connected (for details, please see our paper)
Dimensionality of SEs 8
AL setting
Pool-based simulation
(Using binary masking for excluding unobserved scores)
DSRL methods
Conventional: d-vectors [Variani+14]
Ours: Prop. (vec), Prop. (mat), or Prop. (graph)

/51
26
Evaluation 1: SE Interpretability
➢ Scatter plots of human-/SE-derived similarity scores
– Prop. (*) highly correlated with the human-derived sim. scores.
• → Our DSRL can learn interpretable SEs better than d-vec!
d-vec.
Prop.
(graph)
Prop.
(mat)
Prop.
(vec)
SE-derived
0 1
Human-derived
1
0
Seen-Seen
Seen-Unseen

/51
27
Evaluation 2: Speaker Interpolation Controllability
➢ Task: generate new speaker identity by mixing two SEs
– We evaluated spkr. sim. between interpolated speech with
𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1).
– The score curves of Prop. (*) were closer to the red line.
• → Our SEs achieve higher controllability than d-vec.!
(20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test)
Mixing coefficient 𝛼
0.0 0.5 1.0
1.0
0.5
0.0
Preference
score
A (mixed w/ 𝛼 = 0)
B (mixed w/ 𝛼 = 1)

/51
28
Evaluation 3: AL Cost Efficacy
➢ AL setting: starting DSRL from PS to reach FS situation
– MSF was the best query strategy for all proposed methods.
– Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work.
In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated.
Fully Scored
(FS)
Partially Scored
(PS)
AUC
of
similar
speaker-pair
detection

/51
29
Summary
➢ Purpose
– Learning SEs highly correlated with perceptual speaker similarity
➢ Proposed methods
– 1) Perceptual-similarity-aware learning of SEs
– 2) Human-in-the-loop AL for DSRL
➢ Results of our methods
– 1) learned SEs having high correlation with human perception
– 2) achieved better controllability in speaker interpolation
– 3) reduced costs of scoring/training by introducing AL
➢ For detailed discussion...
– Please read our TASLP paper (open access)!

/51
30
Table of Contents
➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)

/51
31
Overview: Speaker Adaptation for
Multi-Speaker TTS
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
DNN-based multi-speaker TTS [Fan+15][Hojo+18]
– Single DNN to generate multiple speakers' voices
• SE: conditional input to control speaker ID of synthetic speech
➢
Speaker adaptation for multi-speaker TTS (e.g., [Jia+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
SE
Multi-speaker
TTS model

/51
32
Conventional Speaker Adaptation Method
➢ Transfer Learning (TL) from speaker verification [Jia+18]
– Speaker encoder for extracting SE from reference speech
• Pretrained on speaker verification (e.g., GE2E loss [Wan+18])
– Multi-speaker TTS model for synthesizing speech from (text, SE) pairs
• Training: generate voices of seen speakers (∈ training data)
• Inference: extract SE of unseen speaker & input to TTS model
– Issue: cannot be used w/o the reference speech
• e.g., deceased person w/o any speech recordings
Multi-speaker
TTS model
Speaker
encoder
Ref.
speech
FROZEN
Can we find the target speaker's SE w/o using ref. speech?

/51
33
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidates
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
SE selection
by user
Selected
SE
Selected Waveform

/51
34
Proposed Method:
➢ SLS step 1: define line segment in SE space
Candidate
SEs
⋯
SE
space

/51
35
Proposed Method:
➢ SLS step 2: synthesize waveforms using candidate SEs
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space

/51
36
Proposed Method:
➢ SLS step 3: select one SE based on user's speech perception
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Selection
by user
Selected
SE
Selected Waveform

/51
37
Proposed Method:
➢ SLS step 4: update line segment using Bayesian Optimization
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform

/51
Candidate
SEs
⋯
SE
space
38
Proposed Method:
➢ ... and loops SLS steps until the user gets desired outcome
– Ref. speech & spkr. encoder are no longer needed in adaptation!
Multi-speaker
TTS system
Text ⋰
Waveforms
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform

/51
39
Two Strategies for Improving Search Efficacy
➢ Performing SLS in original SE space is inefficient because...
– It assumes the search space to be
𝐷-dimensional hypercube 0, 1 𝐷.
However, actual SEs are NOT distributed
uniformly (e.g., right figure).
– SEs in the dead space can degrade
the naturalness of synthetic speech...
➢ Our strategies for SLS-based speaker adaptation
– 1) Use mean {male, female} speakers' SEs as initial line endpoints
• → Start the search from more natural voices
– 2) Set the search space to a quantile of SEs in the training data
• → Search for more natural voice (but limit the search space)
– We empirically confirmed that these strategies significantly
improved the naturalness of synthetic speech during search.
Dead
space

/51
40
➢ Experimental Evaluations

/51
41
Experimental Conditions
Corpus for training
speaker encoder
Corpus of Spontaneous Japanese (CSJ) [Maekawa03]
(947 males and 470 females, 660h)
TTS model FastSpeech 2 [Ren+21]
Corpus for TTS
model
"parallel100" subset of Japanese Versatile Speech (JVS) corpus
[Takamichi+20]
(49 males and 51 females, 22h, 100 sentences / speaker)
Data
split
Train 90 speakers (44 males, 46 females)
Test 4 speakers (2 males, 2 females)
Validation 6 speakers (3 males, 3 females)
Vocoder
Pretrained "universal_v1" model of HiFi-GAN [Kong+20]
(published in ming024's GitHub repository)

/51
42
➢ Interface for SLS experiment
– Button to play reference speaker's voice
• Simulating situation where users have their desired voice in mind
– Slider to change multiple speakers' IDs smoothly
Demonstration

/51
➢ Conditions
– 8 participants searched for 4 target speakers w/ SLS (30 iterations).
– We computed the mel-spectrogram MAE betw. natural & synthetic
speech for each SE and selected one based on the MAE values.
43
Ref.
waveform
Participant 1
Participant 8
SLS for
our method
⋮
⋮
Searched
SEs
⋮
SLS-best:
Lowest MAE
SLS-mean:
Closest to mean MAE
SLS-worst:
Highest MAE
Human-In-The-Loop Experiment

/51
44
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)

/51
45
Our methods achieve MOSs comparable to TL-based method!

/51
46
SLS-worst tends to degrade the naturalness significantly.

/51
47
Subjective Evaluation (Similarity MOS)

/51
48
Subjective Evaluation (Similarity MOS)
We observe similar tendency to the naturalness MOS results.

/51
49
Speech Samples
Ground-
Truth
TL
Mean-
Speaker
SLS-
worst
SLS-
mean
SLS-
best
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →

/51
50
Summary
➢ Purpose
– Speaker adaptation for multi-speaker TTS w/o ref. speech
➢ Proposed method
– SLS-based human-in-the-loop speaker adaptation algorithm
➢ Results of our method
– 1) achieved comparable performance to TL-based adaptation method
– 2) showed the difficulty in finding desirable SEs (less interpretability?)
➢ For detailed discussion...
– Please read our INTERSPEECH2022 paper (ACCEPTED)!
• Mr. Kenta Udagawa will talk about this work in poster session.

/51
51
Conclusions (Part 1)
➢ Main topic: human-in-the-loop speech synthesis
– Intervening human listeners in SOTA DNN-based TTS/VC methods
➢ Presented work
– 1) Human-in-the-loop deep speaker representation learning
– 2) Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Future prospection
– Continually trainable TTS/VC technology with the aid of humans
• As we grow, so do speech synthesis technologies!
➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members!
– Very looking forward to meet you at Incheon, South Korea :)
Thank you for your attention!

saito22research_talk_at_NUS

Recommended

Recommended

More Related Content

Similar to saito22research_talk_at_NUS

Similar to saito22research_talk_at_NUS (20)

More from Yuki Saito

More from Yuki Saito (20)

Recently uploaded

Recently uploaded (20)

saito22research_talk_at_NUS