©Yuki Saito, Aug. 18, 2022.
Towards Human-In-The-Loop
Speech Synthesis Technology
The University of Tokyo (UTokyo), Japan.
Online Research Talk Hosted by NUS, Singapore @ Zoom
Yuki Saito
/51
1
Self Introduction
➢ SAITO Yuki (齋藤 佑樹)
– Born in
• Kushiro-shi, Hokkaido, Japan
– Educational Background
• Apr. 2016 ~ Mar. 2018: UTokyo (MS)
• Apr. 2018 ~ Mar. 2021: UTokyo (PhD)
– Research interests
• Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning
– Selected publications (from 11 journal papers & 25 conf. papers)
• Saito et al., "Statistical parametric speech synthesis incorporating
generative adversarial networks," IEEE/ACM TASLP, 2018.
• Saito et al., "Perceptual-similarity-aware deep speaker
representation learning for multi-speaker generative modeling,"
IEEE/ACM TASLP, 2021.
/51
2
Our Lab. in UTokyo, Japan
➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama
– Source separation, sound field analysis & synthesis, and TTS & VC
/51
3
TTS/VC Research Group in Our Lab.
➢ Organized by Dr. Takamichi & me (since Apr. 2016)
– Current students: 4 PhD & 8 MS
– Past students: 1 PhD (me) & 10 MS
TTS
VC
Toward universal
speech communication
based on TTS/VC
technologies
/51
4
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
5
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
➢ Speech synthesis
– Technology for synthesizing speech using a computer
➢ Applications
– Speech communication assistance (e.g., speech translation)
– Entertainments (e.g., singing voice synthesis/conversion)
➢ DNN-based speech synthesis [Zen+13][Oord+16]
– Using a DNN for learning statistical relation betw. input-to-speech
6
Research Field: Speech Synthesis
Text-To-Speech (TTS)
Text Speech
Voice Conversion (VC)
Output
speech
Input
speech
Hello Hello
[Sagisaka+88]
[Stylianou+88]
DNN: Deep Neural Network
/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
7
Discriminator
1: natural
Adversarial
loss
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
/51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
8
Human listener
Human
perception
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
Can we replace the GAN discriminator with a human listener?
/51
9
Motivation of Human-In-The-Loop
Speech Synthesis Technologies
➢ Speech communication: intrinsically imperfect
– Humans often make mistakes, but we can communicate!
• Mispronunciations, wrong accents, unnatural pausing, etc...
– Mistakes can be corrected thru interaction betw. speaker & listener.
• c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20]
– Intervention of human listeners will cultivate advanced research field!
➢
Possible applications
– Human-machine interaction
• e.g., spoken dialogue systems
– Media creation
• e.g., singing voice synthesis & dubbing
The image was automatically generated by craiyon
/51
10
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
/51
11
Overview: Deep Speaker Representation Learning
➢ Deep Speaker Representation Learning (DSRL)
– DNN-based technology for learning Speaker Embeddings (SEs)
• Feature extraction for discriminative tasks (e.g., [Variani+14])
• Control of speaker ID in generative tasks (e.g., [Jia+18])
➢ This talk: method to learn SEs suitable for generative tasks
– Purpose: improving quality & controllability of synthetic speech
– Core idea: introducing human listeners for learning SEs that are highly
correlated with perceptual similarity among speakers
DNN
NG
ASV
DNN
TTS
Discriminative task
(e.g., automatic speaker verification: ASV)
Generative task
(e.g., TTS and VC)
DNN: Deep Neural Network
/51
12
Conventional Method:
Speaker-Classification-Based DSRL
➢ Learning to predict speaker ID from input speech parameters
– SEs suitable for speaker classification → also suitable for TTS/VC?
– One reason: low interpretability of SEs
Minimizing
cross-entropy
Speech
params.
d-vectors
[Variani+14]
Spkr.
classification
Spkr.
encoder
Spkr.
IDs
Distance metric in SE space
≠
Perceptual metric
(i.e., speaker similarity)
SE
space
/51
13
Our Method:
Perceptual-Similarity-Aware DSRL
➢ 1. Large-scale scoring of perceptual speaker similarity
➢ 2. SE learning considering the similarity scores
DNN
(Spkr. encoder)
Learned
similarity
Speech
params.
Similarity
score
SEs
Similarity
score
Perceptual
similarity
scoring
Spkr.
pairs
𝐿SIM
(∗)
Vector Matrix Graph
Loss to predict sim.
/51
14
Large Scale Scoring of
Perceptual Speaker Similarity
➢ Crowdsourcing of perceptual speaker similarity scores
– Dataset we used: 153 females in JNAS corpus [Itou+99]
– 4,000↑ listeners scored the similarity of two speakers' voices.
➢ Histogram of the collected scores
Instruction of the scoring
To what degree do these two speakers'
voices sound similar?
(−3: dissimilar ~ +3: similar)
( , ) → +2
( , ) → −3
( , ) → −2
/51
15
Perceptual Speaker Similarity Matrix
➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s
– 𝑁s: # of pre-stored (i.e., closed) speakers
– 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s
⊤
: the 𝑖th similarity score vector
• 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣
3
2
1
0
−1
−2
−3
(a) Full score matrix
(153 females)
(b) Sub-matrix of (a)
(13 females)
I'll present three algorithms to learn the similarity.
/51
16
Algorithm 1: Similarity Vector Embedding
➢ Predict a vector of the matrix 𝐒 from speech parameters
𝐿SIM
(vec)
𝒔, ො
𝒔 =
1
𝑁𝑠
ො
𝒔 − 𝒔 ⊤ ො
𝒔 − 𝒔
Spkr.
encoder
𝐿SIM
(vec)
𝒔
ො
𝒔
𝐒
Sim. score
vector Sim.
matrix
Speech
params.
Similarity
prediction
𝒅
SEs
/51
17
Algorithm 2: Similarity Matrix Embedding
➢ Associate the Gram matrix of SEs with the matrix 𝐒
𝐿SIM
(mat)
𝐿SIM
(mat)
𝐃, 𝐒 =
1
𝑍s
෩
𝐊𝐃 − ෨
𝐒 𝐹
2
𝐊𝐃
Gram
matrix
Calc.
kernel
𝑘 ⋅,⋅
𝑍s: Normalization coefficient (෨
𝐒 represents off-diagonal matrix of 𝐒)
𝐒
Sim.
matrix
Speech
params.
Spkr.
encoder
𝒅
SEs
/51
18
Algorithm 3: Similarity Graph Embedding
➢ Learn the structure of speaker similarity graph from SE pairs
𝐿SIM
graph
𝒅𝑖, 𝒅𝑗 = −𝑎𝑖,𝑗 log 𝑝𝑖,𝑗 − 1 − 𝑎𝑖,𝑗 log 1 − 𝑝𝑖,𝑗
Spkr. sim.
graph
Edge
prediction 0: no edge
1: exist edge
𝐿SIM
(graph)
𝑝𝑖,𝑗 = exp − 𝒅𝑖 − 𝒅𝑗 2
2
: edge probability (referring to [Li+18])
Spkr.
encoder
𝐒
Sim.
matrix
Speech
params.
𝒅
𝑎𝑖,𝑗
SEs
/51
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
19
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ Overall framework: iterate similarity scoring & SE learning
– Obtaining better SEs while reducing costs of scoring & learning
– Using partially observed similarity scores
/51
20
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 1: train spkr. encoder using partially observed scores
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
/51
21
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 2: predict similarity scores for unscored spkr. pairs
: +3
: 0
: -2
Predicted
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
/51
22
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 3: select unscored pairs to be scored next
– Query strategy: criterion to determine priority of scoring
: +3
: 0
: -2
Predicted
: HSF
: MSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
{ Higher, Middle, Lower }-Similarity First
/51
23
Human-In-The-Loop Active Learning (AL) for
Perceptual-Similarity-Aware SEs
➢ AL step 4: annotate similarity scores to selected spkr. pairs
– → return to AL step 1
: +3
: 0
: -2
Predicted
: HSF
: LSF
Selected
Spkr. encoder
training
Score
prediction
Query
selection
Score
annotation
: +3
: -1
: ??
: ??
: ??
Spkr. encoder
Scored
spkr. pairs
Listeners
Unscored spkr.
pairs
Vector Matrix Graph
Query
strategy
: +1
: MSF
/51
24
➢ Experimental Evaluations
/51
25
Experimental Conditions
Dataset
(16 kHz sampling)
JNAS [Itou+99] 153 female speakers
5 utterances per speaker for scoring
About 130 / 15 utterances for DSRL & evaluation
(F001 ~ F013: unseen speakers for evaluation)
Similarity score
-3 (dissimilar) ~ +3 (similar)
(Normalized to [-1, +1] or [0, 1] in DSRL)
Speech parameters
40-dimensional mel-cepstra, F0, aperiodicity
(extracted by STRAIGHT analysis [Kawahara+99])
DNNs Fully-connected (for details, please see our paper)
Dimensionality of SEs 8
AL setting
Pool-based simulation
(Using binary masking for excluding unobserved scores)
DSRL methods
Conventional: d-vectors [Variani+14]
Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
/51
26
Evaluation 1: SE Interpretability
➢ Scatter plots of human-/SE-derived similarity scores
– Prop. (*) highly correlated with the human-derived sim. scores.
• → Our DSRL can learn interpretable SEs better than d-vec!
d-vec.
Prop.
(graph)
Prop.
(mat)
Prop.
(vec)
SE-derived
0 1
Human-derived
1
0
Seen-Seen
Seen-Unseen
/51
27
Evaluation 2: Speaker Interpolation Controllability
➢ Task: generate new speaker identity by mixing two SEs
– We evaluated spkr. sim. between interpolated speech with
𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1).
– The score curves of Prop. (*) were closer to the red line.
• → Our SEs achieve higher controllability than d-vec.!
(20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test)
Mixing coefficient 𝛼
0.0 0.5 1.0
1.0
0.5
0.0
Preference
score
A (mixed w/ 𝛼 = 0)
B (mixed w/ 𝛼 = 1)
/51
28
Evaluation 3: AL Cost Efficacy
➢ AL setting: starting DSRL from PS to reach FS situation
– MSF was the best query strategy for all proposed methods.
– Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work.
In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated.
Fully Scored
(FS)
Partially Scored
(PS)
AUC
of
similar
speaker-pair
detection
/51
29
Summary
➢ Purpose
– Learning SEs highly correlated with perceptual speaker similarity
➢ Proposed methods
– 1) Perceptual-similarity-aware learning of SEs
– 2) Human-in-the-loop AL for DSRL
➢ Results of our methods
– 1) learned SEs having high correlation with human perception
– 2) achieved better controllability in speaker interpolation
– 3) reduced costs of scoring/training by introducing AL
➢ For detailed discussion...
– Please read our TASLP paper (open access)!
/51
30
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
– Speech emotion recognition for nonverbal vocalizations
➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
/51
31
Overview: Speaker Adaptation for
Multi-Speaker TTS
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
DNN-based multi-speaker TTS [Fan+15][Hojo+18]
– Single DNN to generate multiple speakers' voices
• SE: conditional input to control speaker ID of synthetic speech
➢
Speaker adaptation for multi-speaker TTS (e.g., [Jia+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
SE
Multi-speaker
TTS model
/51
32
Conventional Speaker Adaptation Method
➢ Transfer Learning (TL) from speaker verification [Jia+18]
– Speaker encoder for extracting SE from reference speech
• Pretrained on speaker verification (e.g., GE2E loss [Wan+18])
– Multi-speaker TTS model for synthesizing speech from (text, SE) pairs
• Training: generate voices of seen speakers (∈ training data)
• Inference: extract SE of unseen speaker & input to TTS model
– Issue: cannot be used w/o the reference speech
• e.g., deceased person w/o any speech recordings
Multi-speaker
TTS model
Speaker
encoder
Ref.
speech
FROZEN
Can we find the target speaker's SE w/o using ref. speech?
/51
33
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidates
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
SE selection
by user
Selected
SE
Selected Waveform
/51
34
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 1: define line segment in SE space
Candidate
SEs
⋯
SE
space
/51
35
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 2: synthesize waveforms using candidate SEs
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
/51
36
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 3: select one SE based on user's speech perception
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Selection
by user
Selected
SE
Selected Waveform
/51
37
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 4: update line segment using Bayesian Optimization
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
/51
Candidate
SEs
⋯
SE
space
38
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ ... and loops SLS steps until the user gets desired outcome
– Ref. speech & spkr. encoder are no longer needed in adaptation!
Multi-speaker
TTS system
Text ⋰
Waveforms
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
/51
39
Two Strategies for Improving Search Efficacy
➢ Performing SLS in original SE space is inefficient because...
– It assumes the search space to be
𝐷-dimensional hypercube 0, 1 𝐷.
However, actual SEs are NOT distributed
uniformly (e.g., right figure).
– SEs in the dead space can degrade
the naturalness of synthetic speech...
➢ Our strategies for SLS-based speaker adaptation
– 1) Use mean {male, female} speakers' SEs as initial line endpoints
• → Start the search from more natural voices
– 2) Set the search space to a quantile of SEs in the training data
• → Search for more natural voice (but limit the search space)
– We empirically confirmed that these strategies significantly
improved the naturalness of synthetic speech during search.
Dead
space
/51
40
➢ Experimental Evaluations
/51
41
Experimental Conditions
Corpus for training
speaker encoder
Corpus of Spontaneous Japanese (CSJ) [Maekawa03]
(947 males and 470 females, 660h)
TTS model FastSpeech 2 [Ren+21]
Corpus for TTS
model
"parallel100" subset of Japanese Versatile Speech (JVS) corpus
[Takamichi+20]
(49 males and 51 females, 22h, 100 sentences / speaker)
Data
split
Train 90 speakers (44 males, 46 females)
Test 4 speakers (2 males, 2 females)
Validation 6 speakers (3 males, 3 females)
Vocoder
Pretrained "universal_v1" model of HiFi-GAN [Kong+20]
(published in ming024's GitHub repository)
/51
42
➢ Interface for SLS experiment
– Button to play reference speaker's voice
• Simulating situation where users have their desired voice in mind
– Slider to change multiple speakers' IDs smoothly
Demonstration
/51
➢ Conditions
– 8 participants searched for 4 target speakers w/ SLS (30 iterations).
– We computed the mel-spectrogram MAE betw. natural & synthetic
speech for each SE and selected one based on the MAE values.
43
Ref.
waveform
Participant 1
Participant 8
SLS for
our method
⋮
⋮
Searched
SEs
⋮
SLS-best:
Lowest MAE
SLS-mean:
Closest to mean MAE
SLS-worst:
Highest MAE
Human-In-The-Loop Experiment
/51
44
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
/51
45
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
Our methods achieve MOSs comparable to TL-based method!
/51
46
Subjective Evaluation (Naturalness MOS)
(24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
SLS-worst tends to degrade the naturalness significantly.
/51
47
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
/51
48
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
We observe similar tendency to the naturalness MOS results.
/51
49
Speech Samples
Ground-
Truth
TL
Mean-
Speaker
SLS-
worst
SLS-
mean
SLS-
best
jvs078
(male)
jvs005
(male)
jvs060
(female)
jvs010
(female)
Other samples are available online! →
/51
50
Summary
➢ Purpose
– Speaker adaptation for multi-speaker TTS w/o ref. speech
➢ Proposed method
– SLS-based human-in-the-loop speaker adaptation algorithm
➢ Results of our method
– 1) achieved comparable performance to TL-based adaptation method
– 2) showed the difficulty in finding desirable SEs (less interpretability?)
➢ For detailed discussion...
– Please read our INTERSPEECH2022 paper (ACCEPTED)!
• Mr. Kenta Udagawa will talk about this work in poster session.
/51
51
Conclusions (Part 1)
➢ Main topic: human-in-the-loop speech synthesis
– Intervening human listeners in SOTA DNN-based TTS/VC methods
➢ Presented work
– 1) Human-in-the-loop deep speaker representation learning
– 2) Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Future prospection
– Continually trainable TTS/VC technology with the aid of humans
• As we grow, so do speech synthesis technologies!
➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members!
– Very looking forward to meet you at Incheon, South Korea :)
Thank you for your attention!

saito22research_talk_at_NUS

  • 1.
    ©Yuki Saito, Aug.18, 2022. Towards Human-In-The-Loop Speech Synthesis Technology The University of Tokyo (UTokyo), Japan. Online Research Talk Hosted by NUS, Singapore @ Zoom Yuki Saito
  • 2.
    /51 1 Self Introduction ➢ SAITOYuki (齋藤 佑樹) – Born in • Kushiro-shi, Hokkaido, Japan – Educational Background • Apr. 2016 ~ Mar. 2018: UTokyo (MS) • Apr. 2018 ~ Mar. 2021: UTokyo (PhD) – Research interests • Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning – Selected publications (from 11 journal papers & 25 conf. papers) • Saito et al., "Statistical parametric speech synthesis incorporating generative adversarial networks," IEEE/ACM TASLP, 2018. • Saito et al., "Perceptual-similarity-aware deep speaker representation learning for multi-speaker generative modeling," IEEE/ACM TASLP, 2021.
  • 3.
    /51 2 Our Lab. inUTokyo, Japan ➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama – Source separation, sound field analysis & synthesis, and TTS & VC
  • 4.
    /51 3 TTS/VC Research Groupin Our Lab. ➢ Organized by Dr. Takamichi & me (since Apr. 2016) – Current students: 4 PhD & 8 MS – Past students: 1 PhD (me) & 10 MS TTS VC Toward universal speech communication based on TTS/VC technologies
  • 5.
    /51 4 Table of Contents ➢Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 6.
    /51 5 Table of Contents ➢Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 7.
    /51 ➢ Speech synthesis –Technology for synthesizing speech using a computer ➢ Applications – Speech communication assistance (e.g., speech translation) – Entertainments (e.g., singing voice synthesis/conversion) ➢ DNN-based speech synthesis [Zen+13][Oord+16] – Using a DNN for learning statistical relation betw. input-to-speech 6 Research Field: Speech Synthesis Text-To-Speech (TTS) Text Speech Voice Conversion (VC) Output speech Input speech Hello Hello [Sagisaka+88] [Stylianou+88] DNN: Deep Neural Network
  • 8.
    /51 ➢ SOTA DNN-basedspeech synthesis methods – Quality of synthetic speech: as natural as human speech – Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14]) • Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20], HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc... 7 Discriminator 1: natural Adversarial loss 𝒙 Input feats. Acoustic model (generator) ෝ 𝒚 Synthetic speech Natural speech 𝒚 General Background Reconstruction loss GAN: Generative Adversarial Network
  • 9.
    /51 ➢ SOTA DNN-basedspeech synthesis methods – Quality of synthetic speech: as natural as human speech – Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14]) • Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20], HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc... 8 Human listener Human perception 𝒙 Input feats. Acoustic model (generator) ෝ 𝒚 Synthetic speech Natural speech 𝒚 General Background Reconstruction loss GAN: Generative Adversarial Network Can we replace the GAN discriminator with a human listener?
  • 10.
    /51 9 Motivation of Human-In-The-Loop SpeechSynthesis Technologies ➢ Speech communication: intrinsically imperfect – Humans often make mistakes, but we can communicate! • Mispronunciations, wrong accents, unnatural pausing, etc... – Mistakes can be corrected thru interaction betw. speaker & listener. • c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20] – Intervention of human listeners will cultivate advanced research field! ➢ Possible applications – Human-machine interaction • e.g., spoken dialogue systems – Media creation • e.g., singing voice synthesis & dubbing The image was automatically generated by craiyon
  • 11.
    /51 10 Table of Contents ➢Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech • (The UTMOS system at The VoiceMOS Challenge 2022) – Speech emotion recognition for nonverbal vocalizations • (The 1st place at The ICML ExVo Competition 2022) ➢ Q&A (until 4pm in SGT / 5pm in JST)
  • 12.
    /51 11 Overview: Deep SpeakerRepresentation Learning ➢ Deep Speaker Representation Learning (DSRL) – DNN-based technology for learning Speaker Embeddings (SEs) • Feature extraction for discriminative tasks (e.g., [Variani+14]) • Control of speaker ID in generative tasks (e.g., [Jia+18]) ➢ This talk: method to learn SEs suitable for generative tasks – Purpose: improving quality & controllability of synthetic speech – Core idea: introducing human listeners for learning SEs that are highly correlated with perceptual similarity among speakers DNN NG ASV DNN TTS Discriminative task (e.g., automatic speaker verification: ASV) Generative task (e.g., TTS and VC) DNN: Deep Neural Network
  • 13.
    /51 12 Conventional Method: Speaker-Classification-Based DSRL ➢Learning to predict speaker ID from input speech parameters – SEs suitable for speaker classification → also suitable for TTS/VC? – One reason: low interpretability of SEs Minimizing cross-entropy Speech params. d-vectors [Variani+14] Spkr. classification Spkr. encoder Spkr. IDs Distance metric in SE space ≠ Perceptual metric (i.e., speaker similarity) SE space
  • 14.
    /51 13 Our Method: Perceptual-Similarity-Aware DSRL ➢1. Large-scale scoring of perceptual speaker similarity ➢ 2. SE learning considering the similarity scores DNN (Spkr. encoder) Learned similarity Speech params. Similarity score SEs Similarity score Perceptual similarity scoring Spkr. pairs 𝐿SIM (∗) Vector Matrix Graph Loss to predict sim.
  • 15.
    /51 14 Large Scale Scoringof Perceptual Speaker Similarity ➢ Crowdsourcing of perceptual speaker similarity scores – Dataset we used: 153 females in JNAS corpus [Itou+99] – 4,000↑ listeners scored the similarity of two speakers' voices. ➢ Histogram of the collected scores Instruction of the scoring To what degree do these two speakers' voices sound similar? (−3: dissimilar ~ +3: similar) ( , ) → +2 ( , ) → −3 ( , ) → −2
  • 16.
    /51 15 Perceptual Speaker SimilarityMatrix ➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s – 𝑁s: # of pre-stored (i.e., closed) speakers – 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s ⊤ : the 𝑖th similarity score vector • 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣 3 2 1 0 −1 −2 −3 (a) Full score matrix (153 females) (b) Sub-matrix of (a) (13 females) I'll present three algorithms to learn the similarity.
  • 17.
    /51 16 Algorithm 1: SimilarityVector Embedding ➢ Predict a vector of the matrix 𝐒 from speech parameters 𝐿SIM (vec) 𝒔, ො 𝒔 = 1 𝑁𝑠 ො 𝒔 − 𝒔 ⊤ ො 𝒔 − 𝒔 Spkr. encoder 𝐿SIM (vec) 𝒔 ො 𝒔 𝐒 Sim. score vector Sim. matrix Speech params. Similarity prediction 𝒅 SEs
  • 18.
    /51 17 Algorithm 2: SimilarityMatrix Embedding ➢ Associate the Gram matrix of SEs with the matrix 𝐒 𝐿SIM (mat) 𝐿SIM (mat) 𝐃, 𝐒 = 1 𝑍s ෩ 𝐊𝐃 − ෨ 𝐒 𝐹 2 𝐊𝐃 Gram matrix Calc. kernel 𝑘 ⋅,⋅ 𝑍s: Normalization coefficient (෨ 𝐒 represents off-diagonal matrix of 𝐒) 𝐒 Sim. matrix Speech params. Spkr. encoder 𝒅 SEs
  • 19.
    /51 18 Algorithm 3: SimilarityGraph Embedding ➢ Learn the structure of speaker similarity graph from SE pairs 𝐿SIM graph 𝒅𝑖, 𝒅𝑗 = −𝑎𝑖,𝑗 log 𝑝𝑖,𝑗 − 1 − 𝑎𝑖,𝑗 log 1 − 𝑝𝑖,𝑗 Spkr. sim. graph Edge prediction 0: no edge 1: exist edge 𝐿SIM (graph) 𝑝𝑖,𝑗 = exp − 𝒅𝑖 − 𝒅𝑗 2 2 : edge probability (referring to [Li+18]) Spkr. encoder 𝐒 Sim. matrix Speech params. 𝒅 𝑎𝑖,𝑗 SEs
  • 20.
    /51 Spkr. encoder training Score prediction Query selection Score annotation : +3 :-1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs 19 Human-In-The-Loop Active Learning (AL) for Perceptual-Similarity-Aware SEs ➢ Overall framework: iterate similarity scoring & SE learning – Obtaining better SEs while reducing costs of scoring & learning – Using partially observed similarity scores
  • 21.
    /51 20 Human-In-The-Loop Active Learning(AL) for Perceptual-Similarity-Aware SEs ➢ AL step 1: train spkr. encoder using partially observed scores Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph
  • 22.
    /51 21 Human-In-The-Loop Active Learning(AL) for Perceptual-Similarity-Aware SEs ➢ AL step 2: predict similarity scores for unscored spkr. pairs : +3 : 0 : -2 Predicted Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph
  • 23.
    /51 22 Human-In-The-Loop Active Learning(AL) for Perceptual-Similarity-Aware SEs ➢ AL step 3: select unscored pairs to be scored next – Query strategy: criterion to determine priority of scoring : +3 : 0 : -2 Predicted : HSF : MSF : LSF Selected Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph Query strategy { Higher, Middle, Lower }-Similarity First
  • 24.
    /51 23 Human-In-The-Loop Active Learning(AL) for Perceptual-Similarity-Aware SEs ➢ AL step 4: annotate similarity scores to selected spkr. pairs – → return to AL step 1 : +3 : 0 : -2 Predicted : HSF : LSF Selected Spkr. encoder training Score prediction Query selection Score annotation : +3 : -1 : ?? : ?? : ?? Spkr. encoder Scored spkr. pairs Listeners Unscored spkr. pairs Vector Matrix Graph Query strategy : +1 : MSF
  • 25.
  • 26.
    /51 25 Experimental Conditions Dataset (16 kHzsampling) JNAS [Itou+99] 153 female speakers 5 utterances per speaker for scoring About 130 / 15 utterances for DSRL & evaluation (F001 ~ F013: unseen speakers for evaluation) Similarity score -3 (dissimilar) ~ +3 (similar) (Normalized to [-1, +1] or [0, 1] in DSRL) Speech parameters 40-dimensional mel-cepstra, F0, aperiodicity (extracted by STRAIGHT analysis [Kawahara+99]) DNNs Fully-connected (for details, please see our paper) Dimensionality of SEs 8 AL setting Pool-based simulation (Using binary masking for excluding unobserved scores) DSRL methods Conventional: d-vectors [Variani+14] Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
  • 27.
    /51 26 Evaluation 1: SEInterpretability ➢ Scatter plots of human-/SE-derived similarity scores – Prop. (*) highly correlated with the human-derived sim. scores. • → Our DSRL can learn interpretable SEs better than d-vec! d-vec. Prop. (graph) Prop. (mat) Prop. (vec) SE-derived 0 1 Human-derived 1 0 Seen-Seen Seen-Unseen
  • 28.
    /51 27 Evaluation 2: SpeakerInterpolation Controllability ➢ Task: generate new speaker identity by mixing two SEs – We evaluated spkr. sim. between interpolated speech with 𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1). – The score curves of Prop. (*) were closer to the red line. • → Our SEs achieve higher controllability than d-vec.! (20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test) Mixing coefficient 𝛼 0.0 0.5 1.0 1.0 0.5 0.0 Preference score A (mixed w/ 𝛼 = 0) B (mixed w/ 𝛼 = 1)
  • 29.
    /51 28 Evaluation 3: ALCost Efficacy ➢ AL setting: starting DSRL from PS to reach FS situation – MSF was the best query strategy for all proposed methods. – Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work. In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated. Fully Scored (FS) Partially Scored (PS) AUC of similar speaker-pair detection
  • 30.
    /51 29 Summary ➢ Purpose – LearningSEs highly correlated with perceptual speaker similarity ➢ Proposed methods – 1) Perceptual-similarity-aware learning of SEs – 2) Human-in-the-loop AL for DSRL ➢ Results of our methods – 1) learned SEs having high correlation with human perception – 2) achieved better controllability in speaker interpolation – 3) reduced costs of scoring/training by introducing AL ➢ For detailed discussion... – Please read our TASLP paper (open access)!
  • 31.
    /51 30 Table of Contents ➢Part 1: approx. 45 min (presented by me) – Human-in-the-loop deep speaker representation learning – Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Part 2: approx. 45 min (presented by Mr. Xin) – Automatic quality assessment of synthetic speech – Speech emotion recognition for nonverbal vocalizations ➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
  • 32.
    /51 31 Overview: Speaker Adaptationfor Multi-Speaker TTS ➢ Text-To-Speech (TTS) [Sagisaka+88] – Technology to artificially synthesize speech from given text ➢ DNN-based multi-speaker TTS [Fan+15][Hojo+18] – Single DNN to generate multiple speakers' voices • SE: conditional input to control speaker ID of synthetic speech ➢ Speaker adaptation for multi-speaker TTS (e.g., [Jia+18]) – TTS of unseen speaker's voice with small amount of data Text-To-Speech (TTS) Text Speech SE Multi-speaker TTS model
  • 33.
    /51 32 Conventional Speaker AdaptationMethod ➢ Transfer Learning (TL) from speaker verification [Jia+18] – Speaker encoder for extracting SE from reference speech • Pretrained on speaker verification (e.g., GE2E loss [Wan+18]) – Multi-speaker TTS model for synthesizing speech from (text, SE) pairs • Training: generate voices of seen speakers (∈ training data) • Inference: extract SE of unseen speaker & input to TTS model – Issue: cannot be used w/o the reference speech • e.g., deceased person w/o any speech recordings Multi-speaker TTS model Speaker encoder Ref. speech FROZEN Can we find the target speaker's SE w/o using ref. speech?
  • 34.
    /51 33 Proposed Method: Human-In-The-Loop SpeakerAdaptation ➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space Multi-speaker TTS system Text ⋰ Waveforms Candidates SEs ⋯ SE space Bayesian optimization Update line segment SE selection by user Selected SE Selected Waveform
  • 35.
    /51 34 Proposed Method: Human-In-The-Loop SpeakerAdaptation ➢ SLS step 1: define line segment in SE space Candidate SEs ⋯ SE space
  • 36.
    /51 35 Proposed Method: Human-In-The-Loop SpeakerAdaptation ➢ SLS step 2: synthesize waveforms using candidate SEs Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space
  • 37.
    /51 36 Proposed Method: Human-In-The-Loop SpeakerAdaptation ➢ SLS step 3: select one SE based on user's speech perception Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space Selection by user Selected SE Selected Waveform
  • 38.
    /51 37 Proposed Method: Human-In-The-Loop SpeakerAdaptation ➢ SLS step 4: update line segment using Bayesian Optimization Multi-speaker TTS system Text ⋰ Waveforms Candidate SEs ⋯ SE space Bayesian optimization Update line segment Selection by user Selected SE Selected Waveform
  • 39.
    /51 Candidate SEs ⋯ SE space 38 Proposed Method: Human-In-The-Loop SpeakerAdaptation ➢ ... and loops SLS steps until the user gets desired outcome – Ref. speech & spkr. encoder are no longer needed in adaptation! Multi-speaker TTS system Text ⋰ Waveforms Bayesian optimization Update line segment Selection by user Selected SE Selected Waveform
  • 40.
    /51 39 Two Strategies forImproving Search Efficacy ➢ Performing SLS in original SE space is inefficient because... – It assumes the search space to be 𝐷-dimensional hypercube 0, 1 𝐷. However, actual SEs are NOT distributed uniformly (e.g., right figure). – SEs in the dead space can degrade the naturalness of synthetic speech... ➢ Our strategies for SLS-based speaker adaptation – 1) Use mean {male, female} speakers' SEs as initial line endpoints • → Start the search from more natural voices – 2) Set the search space to a quantile of SEs in the training data • → Search for more natural voice (but limit the search space) – We empirically confirmed that these strategies significantly improved the naturalness of synthetic speech during search. Dead space
  • 41.
  • 42.
    /51 41 Experimental Conditions Corpus fortraining speaker encoder Corpus of Spontaneous Japanese (CSJ) [Maekawa03] (947 males and 470 females, 660h) TTS model FastSpeech 2 [Ren+21] Corpus for TTS model "parallel100" subset of Japanese Versatile Speech (JVS) corpus [Takamichi+20] (49 males and 51 females, 22h, 100 sentences / speaker) Data split Train 90 speakers (44 males, 46 females) Test 4 speakers (2 males, 2 females) Validation 6 speakers (3 males, 3 females) Vocoder Pretrained "universal_v1" model of HiFi-GAN [Kong+20] (published in ming024's GitHub repository)
  • 43.
    /51 42 ➢ Interface forSLS experiment – Button to play reference speaker's voice • Simulating situation where users have their desired voice in mind – Slider to change multiple speakers' IDs smoothly Demonstration
  • 44.
    /51 ➢ Conditions – 8participants searched for 4 target speakers w/ SLS (30 iterations). – We computed the mel-spectrogram MAE betw. natural & synthetic speech for each SE and selected one based on the MAE values. 43 Ref. waveform Participant 1 Participant 8 SLS for our method ⋮ ⋮ Searched SEs ⋮ SLS-best: Lowest MAE SLS-mean: Closest to mean MAE SLS-worst: Highest MAE Human-In-The-Loop Experiment
  • 45.
    /51 44 Subjective Evaluation (NaturalnessMOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
  • 46.
    /51 45 Subjective Evaluation (NaturalnessMOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) Our methods achieve MOSs comparable to TL-based method!
  • 47.
    /51 46 Subjective Evaluation (NaturalnessMOS) (24 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) SLS-worst tends to degrade the naturalness significantly.
  • 48.
    /51 47 Subjective Evaluation (SimilarityMOS) (20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
  • 49.
    /51 48 Subjective Evaluation (SimilarityMOS) (20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test) We observe similar tendency to the naturalness MOS results.
  • 50.
  • 51.
    /51 50 Summary ➢ Purpose – Speakeradaptation for multi-speaker TTS w/o ref. speech ➢ Proposed method – SLS-based human-in-the-loop speaker adaptation algorithm ➢ Results of our method – 1) achieved comparable performance to TL-based adaptation method – 2) showed the difficulty in finding desirable SEs (less interpretability?) ➢ For detailed discussion... – Please read our INTERSPEECH2022 paper (ACCEPTED)! • Mr. Kenta Udagawa will talk about this work in poster session.
  • 52.
    /51 51 Conclusions (Part 1) ➢Main topic: human-in-the-loop speech synthesis – Intervening human listeners in SOTA DNN-based TTS/VC methods ➢ Presented work – 1) Human-in-the-loop deep speaker representation learning – 2) Human-in-the-loop speaker adaptation for multi-speaker TTS ➢ Future prospection – Continually trainable TTS/VC technology with the aid of humans • As we grow, so do speech synthesis technologies! ➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members! – Very looking forward to meet you at Incheon, South Korea :) Thank you for your attention!