GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
Incremental Difference as Feature for LipreadingIDES Editor
This paper represents a method of computing
incremental dif ference f eatures on the basis of scan line
projection and scan converting lines for the lipreading problem
on a set of isolated word utterances. These features are affine
invariants and found to be eff ective in identification of
similarity between utterances by the speaker in spatial domain
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...Hussein Ghaly
Main Goal:
Improve automatic syntactic parsing of spontaneous spoken sentences using prosodic cues
Theoretical Motivation:
Automatic parsing is negatively affected by syntactic ambiguity (Kummerfeld et al., 2012)
Prosody can help resolving some syntactic ambiguities (Cutler et al., 1997)
Syntactic structure is related to prosodic structure (Selkirk, 1986, among many other studies)
This is the presentation of our IEEE ICASSP 2021 paper "seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset".
Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, we assume that the training data partly include parallel speech data and propose sequence-to-sequence learning between the source and target posterior probabilities. The conversion models perform non-linear and variable-length transformation from the source probability sequence to the target one. Further, we propose a joint training algorithm for the modules. In contrast to conventional VC, which separately trains the speech recognition that estimates posterior probabilities and the speech synthesis that predicts target speech parameters, our proposed method jointly trains these modules along with the proposed probability conversion modules. Experimental results demonstrate that our approach outperforms the conventional VC.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
How to get a good transcript and then supplement it with information about the background sounds? The leitmotif will be the use of Azure Video Indexer cognitive technology - a tool that enables transcription - and its integration with custom code written in Azure Databricks. During the lecture, I will show the integration of two technologies and present the approach to machine learning - I will review the available models for audio transcription, but also for image classification - which is especially useful for visual audio representation - spectrograms.
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
How to get a good transcript and then supplement it with information about the background sounds? The leitmotif will be the use of Azure Video Indexer cognitive technology - a tool that enables transcription - and its integration with custom code written in Azure Databricks. During the lecture, I will show the integration of two technologies and present the approach to machine learning - I will review the available models for audio transcription, but also for image classification - which is especially useful for visual audio representation - spectrograms.
Incremental Difference as Feature for LipreadingIDES Editor
This paper represents a method of computing
incremental dif ference f eatures on the basis of scan line
projection and scan converting lines for the lipreading problem
on a set of isolated word utterances. These features are affine
invariants and found to be eff ective in identification of
similarity between utterances by the speaker in spatial domain
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...Hussein Ghaly
Main Goal:
Improve automatic syntactic parsing of spontaneous spoken sentences using prosodic cues
Theoretical Motivation:
Automatic parsing is negatively affected by syntactic ambiguity (Kummerfeld et al., 2012)
Prosody can help resolving some syntactic ambiguities (Cutler et al., 1997)
Syntactic structure is related to prosodic structure (Selkirk, 1986, among many other studies)
This is the presentation of our IEEE ICASSP 2021 paper "seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset".
Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, we assume that the training data partly include parallel speech data and propose sequence-to-sequence learning between the source and target posterior probabilities. The conversion models perform non-linear and variable-length transformation from the source probability sequence to the target one. Further, we propose a joint training algorithm for the modules. In contrast to conventional VC, which separately trains the speech recognition that estimates posterior probabilities and the speech synthesis that predicts target speech parameters, our proposed method jointly trains these modules along with the proposed probability conversion modules. Experimental results demonstrate that our approach outperforms the conventional VC.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
How to get a good transcript and then supplement it with information about the background sounds? The leitmotif will be the use of Azure Video Indexer cognitive technology - a tool that enables transcription - and its integration with custom code written in Azure Databricks. During the lecture, I will show the integration of two technologies and present the approach to machine learning - I will review the available models for audio transcription, but also for image classification - which is especially useful for visual audio representation - spectrograms.
[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learn...DataScienceConferenc1
How to get a good transcript and then supplement it with information about the background sounds? The leitmotif will be the use of Azure Video Indexer cognitive technology - a tool that enables transcription - and its integration with custom code written in Azure Databricks. During the lecture, I will show the integration of two technologies and present the approach to machine learning - I will review the available models for audio transcription, but also for image classification - which is especially useful for visual audio representation - spectrograms.
Investigation of effects caused by catastrophic forgetting in continual learning of end-to-end text-to-speech synthesis
Google slides: https://docs.google.com/presentation/d/1dj2eudW3MH1gh_M6ML0oPN8wHLGTtkSLJjCuboROfNY/edit?usp=sharing
Speech samples:
http://sarulab.sakura.ne.jp/ysaito/demo_continual.html
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
2. /51
1
Self Introduction
➢ SAITO Yuki (齋藤 佑樹)
– Born in
• Kushiro-shi, Hokkaido, Japan
– Educational Background
• Apr. 2016 ~ Mar. 2018: UTokyo (MS)
• Apr. 2018 ~ Mar. 2021: UTokyo (PhD)
– Research interests
• Text-To-Speech (TTS) & Voice Conversion (VC) based on deep learning
– Selected publications (from 11 journal papers & 25 conf. papers)
• Saito et al., "Statistical parametric speech synthesis incorporating
generative adversarial networks," IEEE/ACM TASLP, 2018.
• Saito et al., "Perceptual-similarity-aware deep speaker
representation learning for multi-speaker generative modeling,"
IEEE/ACM TASLP, 2021.
3. /51
2
Our Lab. in UTokyo, Japan
➢ 3 groups organized by Prof. Saruwatari & Lect. Koyama
– Source separation, sound field analysis & synthesis, and TTS & VC
4. /51
3
TTS/VC Research Group in Our Lab.
➢ Organized by Dr. Takamichi & me (since Apr. 2016)
– Current students: 4 PhD & 8 MS
– Past students: 1 PhD (me) & 10 MS
TTS
VC
Toward universal
speech communication
based on TTS/VC
technologies
5. /51
4
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
6. /51
5
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
7. /51
➢ Speech synthesis
– Technology for synthesizing speech using a computer
➢ Applications
– Speech communication assistance (e.g., speech translation)
– Entertainments (e.g., singing voice synthesis/conversion)
➢ DNN-based speech synthesis [Zen+13][Oord+16]
– Using a DNN for learning statistical relation betw. input-to-speech
6
Research Field: Speech Synthesis
Text-To-Speech (TTS)
Text Speech
Voice Conversion (VC)
Output
speech
Input
speech
Hello Hello
[Sagisaka+88]
[Stylianou+88]
DNN: Deep Neural Network
8. /51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
7
Discriminator
1: natural
Adversarial
loss
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
9. /51
➢ SOTA DNN-based speech synthesis methods
– Quality of synthetic speech: as natural as human speech
– Adversarial training betw. two DNNs (i.e., GANs [Goodfellow+14])
• Ours [Saito+18], MelGAN [Kumar+19], Parallel WaveGAN [Yamamoto+20],
HiFi-GAN [Kong+20], VITS [Kim+21], JETS [Lim+22], etc...
8
Human listener
Human
perception
𝒙
Input
feats.
Acoustic model
(generator)
ෝ
𝒚 Synthetic
speech
Natural
speech
𝒚
General Background
Reconstruction
loss
GAN: Generative Adversarial Network
Can we replace the GAN discriminator with a human listener?
10. /51
9
Motivation of Human-In-The-Loop
Speech Synthesis Technologies
➢ Speech communication: intrinsically imperfect
– Humans often make mistakes, but we can communicate!
• Mispronunciations, wrong accents, unnatural pausing, etc...
– Mistakes can be corrected thru interaction betw. speaker & listener.
• c.f., Machine speech chain (speech synth. & recog.) [Tjandra+20]
– Intervention of human listeners will cultivate advanced research field!
➢
Possible applications
– Human-machine interaction
• e.g., spoken dialogue systems
– Media creation
• e.g., singing voice synthesis & dubbing
The image was automatically generated by craiyon
11. /51
10
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
• (The UTMOS system at The VoiceMOS Challenge 2022)
– Speech emotion recognition for nonverbal vocalizations
• (The 1st place at The ICML ExVo Competition 2022)
➢ Q&A (until 4pm in SGT / 5pm in JST)
12. /51
11
Overview: Deep Speaker Representation Learning
➢ Deep Speaker Representation Learning (DSRL)
– DNN-based technology for learning Speaker Embeddings (SEs)
• Feature extraction for discriminative tasks (e.g., [Variani+14])
• Control of speaker ID in generative tasks (e.g., [Jia+18])
➢ This talk: method to learn SEs suitable for generative tasks
– Purpose: improving quality & controllability of synthetic speech
– Core idea: introducing human listeners for learning SEs that are highly
correlated with perceptual similarity among speakers
DNN
NG
ASV
DNN
TTS
Discriminative task
(e.g., automatic speaker verification: ASV)
Generative task
(e.g., TTS and VC)
DNN: Deep Neural Network
13. /51
12
Conventional Method:
Speaker-Classification-Based DSRL
➢ Learning to predict speaker ID from input speech parameters
– SEs suitable for speaker classification → also suitable for TTS/VC?
– One reason: low interpretability of SEs
Minimizing
cross-entropy
Speech
params.
d-vectors
[Variani+14]
Spkr.
classification
Spkr.
encoder
Spkr.
IDs
Distance metric in SE space
≠
Perceptual metric
(i.e., speaker similarity)
SE
space
14. /51
13
Our Method:
Perceptual-Similarity-Aware DSRL
➢ 1. Large-scale scoring of perceptual speaker similarity
➢ 2. SE learning considering the similarity scores
DNN
(Spkr. encoder)
Learned
similarity
Speech
params.
Similarity
score
SEs
Similarity
score
Perceptual
similarity
scoring
Spkr.
pairs
𝐿SIM
(∗)
Vector Matrix Graph
Loss to predict sim.
15. /51
14
Large Scale Scoring of
Perceptual Speaker Similarity
➢ Crowdsourcing of perceptual speaker similarity scores
– Dataset we used: 153 females in JNAS corpus [Itou+99]
– 4,000↑ listeners scored the similarity of two speakers' voices.
➢ Histogram of the collected scores
Instruction of the scoring
To what degree do these two speakers'
voices sound similar?
(−3: dissimilar ~ +3: similar)
( , ) → +2
( , ) → −3
( , ) → −2
16. /51
15
Perceptual Speaker Similarity Matrix
➢ Similarity matrix 𝐒 = 𝒔1, ⋯ , 𝒔𝑖, ⋯ , 𝒔𝑁s
– 𝑁s: # of pre-stored (i.e., closed) speakers
– 𝒔𝑖 = 𝑠𝑖,1, ⋯ , 𝑠𝑖,𝑗, ⋯ , 𝑠𝑖,𝑁s
⊤
: the 𝑖th similarity score vector
• 𝑠𝑖,𝑗: similarity of the 𝑖th & 𝑗th speakers −𝑣 ≤ 𝑠𝑖,𝑗 ≤ 𝑣
3
2
1
0
−1
−2
−3
(a) Full score matrix
(153 females)
(b) Sub-matrix of (a)
(13 females)
I'll present three algorithms to learn the similarity.
26. /51
25
Experimental Conditions
Dataset
(16 kHz sampling)
JNAS [Itou+99] 153 female speakers
5 utterances per speaker for scoring
About 130 / 15 utterances for DSRL & evaluation
(F001 ~ F013: unseen speakers for evaluation)
Similarity score
-3 (dissimilar) ~ +3 (similar)
(Normalized to [-1, +1] or [0, 1] in DSRL)
Speech parameters
40-dimensional mel-cepstra, F0, aperiodicity
(extracted by STRAIGHT analysis [Kawahara+99])
DNNs Fully-connected (for details, please see our paper)
Dimensionality of SEs 8
AL setting
Pool-based simulation
(Using binary masking for excluding unobserved scores)
DSRL methods
Conventional: d-vectors [Variani+14]
Ours: Prop. (vec), Prop. (mat), or Prop. (graph)
27. /51
26
Evaluation 1: SE Interpretability
➢ Scatter plots of human-/SE-derived similarity scores
– Prop. (*) highly correlated with the human-derived sim. scores.
• → Our DSRL can learn interpretable SEs better than d-vec!
d-vec.
Prop.
(graph)
Prop.
(mat)
Prop.
(vec)
SE-derived
0 1
Human-derived
1
0
Seen-Seen
Seen-Unseen
28. /51
27
Evaluation 2: Speaker Interpolation Controllability
➢ Task: generate new speaker identity by mixing two SEs
– We evaluated spkr. sim. between interpolated speech with
𝛼 ∈ 0.0, 0.25, 0.5, 0.75, 1.0 and original speaker's (𝛼 = 0 or 1).
– The score curves of Prop. (*) were closer to the red line.
• → Our SEs achieve higher controllability than d-vec.!
(20 answers/listener, total 30 × 2 listeners, method-wise preference XAB test)
Mixing coefficient 𝛼
0.0 0.5 1.0
1.0
0.5
0.0
Preference
score
A (mixed w/ 𝛼 = 0)
B (mixed w/ 𝛼 = 1)
29. /51
28
Evaluation 3: AL Cost Efficacy
➢ AL setting: starting DSRL from PS to reach FS situation
– MSF was the best query strategy for all proposed methods.
– Prop. (vec / graph) reduced the cost, but Prop. (mat) didn't work.
In each AL iteration, sim. scores of 43 speaker-pairs were newly annotated.
Fully Scored
(FS)
Partially Scored
(PS)
AUC
of
similar
speaker-pair
detection
30. /51
29
Summary
➢ Purpose
– Learning SEs highly correlated with perceptual speaker similarity
➢ Proposed methods
– 1) Perceptual-similarity-aware learning of SEs
– 2) Human-in-the-loop AL for DSRL
➢ Results of our methods
– 1) learned SEs having high correlation with human perception
– 2) achieved better controllability in speaker interpolation
– 3) reduced costs of scoring/training by introducing AL
➢ For detailed discussion...
– Please read our TASLP paper (open access)!
31. /51
30
Table of Contents
➢ Part 1: approx. 45 min (presented by me)
– Human-in-the-loop deep speaker representation learning
– Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Part 2: approx. 45 min (presented by Mr. Xin)
– Automatic quality assessment of synthetic speech
– Speech emotion recognition for nonverbal vocalizations
➢ Q&A (until 4 p.m. in SGT & 5 p.m. in JST)
32. /51
31
Overview: Speaker Adaptation for
Multi-Speaker TTS
➢ Text-To-Speech (TTS) [Sagisaka+88]
– Technology to artificially synthesize speech from given text
➢
DNN-based multi-speaker TTS [Fan+15][Hojo+18]
– Single DNN to generate multiple speakers' voices
• SE: conditional input to control speaker ID of synthetic speech
➢
Speaker adaptation for multi-speaker TTS (e.g., [Jia+18])
– TTS of unseen speaker's voice with small amount of data
Text-To-Speech (TTS)
Text Speech
SE
Multi-speaker
TTS model
33. /51
32
Conventional Speaker Adaptation Method
➢ Transfer Learning (TL) from speaker verification [Jia+18]
– Speaker encoder for extracting SE from reference speech
• Pretrained on speaker verification (e.g., GE2E loss [Wan+18])
– Multi-speaker TTS model for synthesizing speech from (text, SE) pairs
• Training: generate voices of seen speakers (∈ training data)
• Inference: extract SE of unseen speaker & input to TTS model
– Issue: cannot be used w/o the reference speech
• e.g., deceased person w/o any speech recordings
Multi-speaker
TTS model
Speaker
encoder
Ref.
speech
FROZEN
Can we find the target speaker's SE w/o using ref. speech?
34. /51
33
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ Core algorithm: Sequential Line Search (SLS) [Koyama+17] on SE space
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidates
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
SE selection
by user
Selected
SE
Selected Waveform
37. /51
36
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 3: select one SE based on user's speech perception
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Selection
by user
Selected
SE
Selected Waveform
38. /51
37
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ SLS step 4: update line segment using Bayesian Optimization
Multi-speaker
TTS system
Text ⋰
Waveforms
Candidate
SEs
⋯
SE
space
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
39. /51
Candidate
SEs
⋯
SE
space
38
Proposed Method:
Human-In-The-Loop Speaker Adaptation
➢ ... and loops SLS steps until the user gets desired outcome
– Ref. speech & spkr. encoder are no longer needed in adaptation!
Multi-speaker
TTS system
Text ⋰
Waveforms
Bayesian
optimization
Update
line segment
Selection
by user
Selected
SE
Selected Waveform
40. /51
39
Two Strategies for Improving Search Efficacy
➢ Performing SLS in original SE space is inefficient because...
– It assumes the search space to be
𝐷-dimensional hypercube 0, 1 𝐷.
However, actual SEs are NOT distributed
uniformly (e.g., right figure).
– SEs in the dead space can degrade
the naturalness of synthetic speech...
➢ Our strategies for SLS-based speaker adaptation
– 1) Use mean {male, female} speakers' SEs as initial line endpoints
• → Start the search from more natural voices
– 2) Set the search space to a quantile of SEs in the training data
• → Search for more natural voice (but limit the search space)
– We empirically confirmed that these strategies significantly
improved the naturalness of synthetic speech during search.
Dead
space
42. /51
41
Experimental Conditions
Corpus for training
speaker encoder
Corpus of Spontaneous Japanese (CSJ) [Maekawa03]
(947 males and 470 females, 660h)
TTS model FastSpeech 2 [Ren+21]
Corpus for TTS
model
"parallel100" subset of Japanese Versatile Speech (JVS) corpus
[Takamichi+20]
(49 males and 51 females, 22h, 100 sentences / speaker)
Data
split
Train 90 speakers (44 males, 46 females)
Test 4 speakers (2 males, 2 females)
Validation 6 speakers (3 males, 3 females)
Vocoder
Pretrained "universal_v1" model of HiFi-GAN [Kong+20]
(published in ming024's GitHub repository)
43. /51
42
➢ Interface for SLS experiment
– Button to play reference speaker's voice
• Simulating situation where users have their desired voice in mind
– Slider to change multiple speakers' IDs smoothly
Demonstration
44. /51
➢ Conditions
– 8 participants searched for 4 target speakers w/ SLS (30 iterations).
– We computed the mel-spectrogram MAE betw. natural & synthetic
speech for each SE and selected one based on the MAE values.
43
Ref.
waveform
Participant 1
Participant 8
SLS for
our method
⋮
⋮
Searched
SEs
⋮
SLS-best:
Lowest MAE
SLS-mean:
Closest to mean MAE
SLS-worst:
Highest MAE
Human-In-The-Loop Experiment
49. /51
48
Subjective Evaluation (Similarity MOS)
(20 answers/listener, total 50 × 4 listeners, target-speaker-wise MOS test)
We observe similar tendency to the naturalness MOS results.
51. /51
50
Summary
➢ Purpose
– Speaker adaptation for multi-speaker TTS w/o ref. speech
➢ Proposed method
– SLS-based human-in-the-loop speaker adaptation algorithm
➢ Results of our method
– 1) achieved comparable performance to TL-based adaptation method
– 2) showed the difficulty in finding desirable SEs (less interpretability?)
➢ For detailed discussion...
– Please read our INTERSPEECH2022 paper (ACCEPTED)!
• Mr. Kenta Udagawa will talk about this work in poster session.
52. /51
51
Conclusions (Part 1)
➢ Main topic: human-in-the-loop speech synthesis
– Intervening human listeners in SOTA DNN-based TTS/VC methods
➢ Presented work
– 1) Human-in-the-loop deep speaker representation learning
– 2) Human-in-the-loop speaker adaptation for multi-speaker TTS
➢ Future prospection
– Continually trainable TTS/VC technology with the aid of humans
• As we grow, so do speech synthesis technologies!
➢ I'll physical attend INTERSPEECH2022 w/ 8 lab members!
– Very looking forward to meet you at Incheon, South Korea :)
Thank you for your attention!