2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Hands-on slides
Tomoki Toda: Hands on Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
12. Example of Speech Dataset
• In this session, let’s use a part of VCC2018 database [Lorenzo‐Trueba; ’18b].
• Sampling frequency: 22.05 kHz
• Only speakers for parallel training
• Only 60 out of 81 training utterances and 20 out of 35 evaluation utterances
• Note: The following wav files have already been set in your computer.
data/wav/{SF1, SF2, SM1, SM2}/ # 4 source speakers
data/wav/{TF1, TF2, TM1, TM2}/ # 4 target speakers
10001.wav – 10060.wav # 60 utterances for training
30001.wav – 30020.wav # 20 utterances for evaluation
• Let’s check each speaker’s voice by listening to some wav files, e.g.,
data/wav/SF1/10001.wav & data/wav/TF1/10001.wav
data/wav/SM1/30001.wav & data/wav/TM1/30001.wav
: :
GMM-based VC: 4
You can download VCC2018 database by executing
$ python3 download_speech_corpus.py downloader_conf/vcc2018.yml
and use it. Note that speaker names are slightly different from those shown in these slides.
13. Converted voices
Procedure: Initialization Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Conversion modelsConverted mel-cepstrum
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Speaker-dependent statistics
Converted
feature sequence
Time warping function
1. Speech analysis
2. Statistics calculation
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
0. Parallel data preparation
& parameter configurations
Parameter
configurations
GMM-based VC: 5
14. Initialization 1: List File Generation
• Select a source & target speaker pair for the same‐gender conversion
• Source speakers: females = SF1 & SF2, males = SM1 & SM2
• Target speakers: females = TF1 & TF2, males = TM1 & TM2
• Generate list files for your selected speaker‐pair by executing
$ python3 initialize.py ‐1 SourceSpeaker TargetSpeaker SamplingRate
e.g., if setting source = SF1, target = TF1, sampling = 22.05 kHz,
$ python3 initialize.py ‐1 SF1 TF1 22050
• 4 list files will be generated under list/ directory.
list/SF1_train.list # training data list for the source speaker, SF1
list/TF1_train.list # training data list for the target speaker, TF1
list/SF1_eval.list # evaluation data list for SF1
list/TF1_eval.list # evaluation data list for TF1
• Modify each list file to define training & evaluation utterance pairs.
list/{SF1,TF1}_train.list # remain only 10001 – 10060 utterances
list/{SF1,TF1}_eval.list # remain only 30001 – 30020 utterances
GMM-based VC: 6
19. Initialization 3: Manual Settings
• Perform speech analysis for your selected speaker‐pair by executing
$ python3 initialize.py ‐3 SourceSpeaker TargetSpeaker SamplingRate
e.g.,
$ python3 initialize.py ‐3 SF1 TF1 22050
• The following message will be printed out.
### 3. create figures to define parameters ###
Extract: data/wav/SF1/10001.wav
Extract: data/wav/SF1/10002.wav
:
• Finally, 4 PNG files will be generated under conf/figure/ directory.
conf/figure/SF1_f0histogram.png # F0 histogram of SF1
conf/figure/SF1_npowhistogram.png # normalized power histogram of SF1
conf/figure/TF1_f0histogram.png # F0 histogram of TF1
conf/figure/TF1_npowhistogram.png # normalized power histogram of TF1
GMM-based VC: 11
20. Example of F0 Histogram
• It is very effective to adjust an F0 search range to each speaker for reducing
F0 extraction errors, such as half F0 and double F0 errors.
• NOTE: This is very important process as I will explain later.
Source speaker: SF1 Target speaker: TF1
Proper F0 search range
might be 140 – 400 Hz
Supposed to be
half F0 error
Supposed to be
half F0 error
Proper F0 search range
might be 140 – 340 Hz
conf/figure/SF1_f0histogram.png conf/figure/TF1_f0histogram.png
GMM-based VC: 12
35. Feature Extraction for DTW
• It is very important to align source frames to target frames so that they
share the same linguistic contents.
• There are several tips to robustly perform time alignment!
• Joint static and dynamic mcep features are used.
• Power differences between source & target voices are ignored.
• Silence frames are discarded (automatically by normalized power, npow).
This process is effective to deal with mismatches of short pause positions.
mcep seq Remove the 0th
coefficients
Append
dynamic features
mcep
feature seq
npow seq Remove
silence frames
*NOTE: power is NOT converted in
sprocket. As a conversion model is
developed without using silence
frames, conversion accuracy at
those frames significantly degrades.
However, it will not cause significant
issues as power of those frames is
too small to be perceived.
GMM-based VC: 27
39. GMM Training & Conversion Process
• Joint GMM training [Kain; ’98]
• Joint probability density function (p.d.f.) of the source & target mcep features
is modeled by a joint GMM.
• Trajectory‐based conversion [Toda; ’07]
• The source mcep seq is converted into the target one by maximum likelihood
parameter generation using a conditional p.d.f. derived from the joint GMM
Joint mcep
feature seqs
Joint
GMM
Maximum likelihood estimation
using EM algorithm
Source mcep seq
Remove the 0th
coefficients
Append
dynamic features
Converted mcep seq w/o
the 0th coefficients
Conditional p.d.f.
Joint
GMM
GMM-based VC: 31
48. Converted Speech Generation by DIFFVC
• Source code: src/convert.py
• MLSA filter [Tokuda; ’94] is used as to directly convert source waveform
Speech
waveform
f0 seq
mcep seq
Statistics
HDF5 file (gv)
GMM PKL
file (mcep)
Differential mcep
(diffmcep) seq
GV postfiltered
diffmcep seq
Converted wav
file (DIFFVC)
Converted
waveform
GV statistics
HDF5 file (cvgv)
Power adjusted
diffmcep seq
Source
wav file
GV postfiltering
w/o power
conversion
Trajectory‐based conversion
DIFFGMM
for mcep
*NOTE: DIFFVC can generate much higher quality of
converted speech than VC, but F0 is NOT converted.
Thus, DIFFVC is very effective for a speaker‐pair with
a similar F0 range, e.g., in the same gender conversion.
DIFFGMM-based VC: 1
51. DIFFGMM-based VC w/ F0 transformation: 1
Overall Procedure
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
0. Parallel data preparation
& parameter configurations
Parameter
configurations
F0 transformed
source voices
1. F0 transformed source
voice generation
The same procedure applied to the new parallel dataset
0. Parallel data preparation & parameter configurations
1. Speech analysis
2. Statistics calculation
3. Joint feature development
4. Model training
5. Conversion
Used as a new parallel dataset
52. Procedure: Dataset Generation Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
0. Parallel data preparation
& parameter configurations
Parameter
configurations
F0 transformed
source voices
1. F0 transformed source
voice generation
The same procedure applied to the new parallel dataset
0. Parallel data preparation & parameter configurations
1. Speech analysis
2. Statistics calculation
3. Joint feature development
4. Model training
5. Conversion
Used as a new parallel dataset
DIFFGMM-based VC w/ F0 transformation: 2
53. Initialization Steps
• Select source and target speakers for the cross‐gender conversion!
• Source speakers: females = SF1 & SF2, males = SM1 & SM2
• Target speakers: females = TF1 & TF2, males = TM1 & TM2
• Generate list files for your selected speaker‐pair,
e.g., if setting source = SF2, target = TM2, & sampling = 22.05 kHz,
$ python3 initialize.py ‐1 SF2 TM2 22050
• Modify list files to select training and evaluation utterance pairs
list/{SF2,TM2}_train.list # remain only 10001 – 10060 utterances
list/{SF2,TM2}_eval.list # remain only 30001 – 30020 utterances
• Generate configure files for your selected speaker‐pair, e.g.,
$ python3 initialize.py ‐2 SF2 TM2 22050
• Perform speech analysis for your selected speaker‐pair, e.g.,
$ python3 initialize.py ‐3 SF2 TM2 22050
• Modify speaker‐dependent YML files based on histograms.
conf/speaker/{SF2,TM2}.yml # revise minf0, maxf0, threshold values
DIFFGMM-based VC w/ F0 transformation: 3
55. Procedure: Dataset Generation Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
0. Parallel data preparation
& parameter configurations
Parameter
configurations
F0 transformed
source voices
1. F0 transformed source
voice generation
The same procedure applied to the new parallel dataset
0. Parallel data preparation & parameter configurations
1. Speech analysis
2. Statistics calculation
3. Joint feature development
4. Model training
5. Conversion
Used as a new parallel dataset
DIFFGMM-based VC w/ F0 transformation: 5
56. F0 Transformed Waveform Generation
• Perform F0 transformation based on waveform modification by executing
$ python3 run_f0_transformation.py SourceSpeaker TargetSpeaker
e.g.,
$ python3 run_f0_transformation.py SF2 TM2
• The following message will be printed out.
### 1. F0 transformation of original waveform ###
Extract F0: data/wav/SF2/10001.wav
:
• Finally, F0 transformed wav files will be generated under data/wav/
directory as a new source speaker, SourceSpeaker_F0TransformationRatio.
data/wav/SF2_0.73/*.wav # F0 transformation ratio = 0.73
• Let’s check F0 transformed source speaker’s voices by listening to some
wav files. Note that not only F0 but also voice quality should be converted.
DIFFGMM-based VC w/ F0 transformation: 6
57. F0 Transformation Process
• Source code: src/f0_transformation.py
• Constant F0 transformation ratio calculated from source & target F0 mean
values by using training data is applied to all source speech wav files.
Source f0 seqs
F0 transformation
ratio
F0 transformed
source wav files
Source wav files Target wav files
Target f0 seqs
F0 transformed
source wav files
Source wav files
Training data
Evaluation data
Training data
DIFFGMM-based VC w/ F0 transformation: 7
58. F0 Transformed Waveform Generation
• Duration conversion w/ WSOLA [Verhelst; ’93] and waveform resampling
is used to generate F0 transformed waveform.
e.g., if setting F0 transformation ratio to 2 (i.e., 100 Hz to 200 Hz),
1. Make duration of input waveform double w/ WSOLA while keeping F0 values
2. Resample the modified waveform to make its duration half
Input waveform
Duration modified
waveform
1.1. Extract frames by windowing
1.2 Find the best concatenation point
1.3 Overlap and add
fO modified
waveform
Deletion or down sampling
Duration modified
waveform
DIFFGMM-based VC w/ F0 transformation: 8
59. DIFFGMM-based VC w/ F0 transformation: 9
Procedure: Parallel VC Steps
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
0. Parallel data preparation
& parameter configurations
Parameter
configurations
F0 transformed
source voices
1. F0 transformed source
voice generation
The same procedure applied to the new parallel dataset
0. Initialization: parameter configurations
1. Training: speech analysis
2. Training: statistics calculation
3. Training: joint feature development
4. Training: model training
5. Conversion
Used as a new parallel dataset
60. Initialization for F0 Transformed Speaker
• Perform initialization steps by setting the F0 transformed source speaker to
a new source speaker
• Generate list files for a new speaker‐pair, e.g.,
$ python3 initialize.py ‐1 SF2_0.73 TM2 22050
• Modify list files to select training and evaluation utterance pairs
list/SF2_0.73_train.list # remain only 10001 – 10060 utterances
list/SF2_0.73_eval.list # remain only 30001 – 30020 utterances
• Generate configure files for the new speaker‐pair, e.g.,
$ python3 initialize.py ‐2 SF2_0.73 TM2 22050
• Perform speech analysis for the new speaker‐pair, e.g.,
$ python3 initialize.py ‐3 SF2_0.73 TM2 22050
• Modify a speaker‐dependent YML file based on histograms.
conf/speaker/SF2_0.73.yml # revise minf0, maxf0, threshold values
conf/pair/SF2_0.73‐TM2.yml # set n_mix for mcep to 16
DIFFGMM-based VC w/ F0 transformation: 10
63. Reproduce Baseline Results of VCC2018!
• You can develop a baseline system of VCC2018 Hub task [Kobayashi; ’18a] by
using sprocket!
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
sprocket
• Hub task: parallel training task
• Source: 2 female & 2 male speakers
• Target: 2 female & 2 male speakers
• 81 utterances for training
• 35 utterances for evaluation
• Baseline system development
• DIFFVC w/o F0 transformation
for the same‐gender pairs
MOS ≅ 4.0, similarity ≅ 70%
• VC for the cross‐gender pairs
MOS ≅ 3.0, similarity ≅ 70%
• In total, MOS ≅ 3.5, similarity ≅ 70%
Results of VCC2018 [Lorenzo‐Trueba; ’18a]
VCC2018 Baseline: 1
65. Develop Baseline System
• Just execute initialize.py & run_sprocket.py for each speaker‐pair.
• Use 81 training utterances and 35 evaluation utterances
• Use default settings of the pair‐dependent YML file (32 mixture components)
• May use the following configurations in speaker‐dependent YML files:
• Use DIFFVC.wav for same‐gender pairs, and VC.wav for cross‐gender pairs
Speaker Minimum F0 (minf0) Maximum F0 (maxf0) Power threshold
VCC2SF1 100 450 –31
VCC2SF2 110 350 –31
VCC2SM1 50 200 –31
VCC2SM2 70 300 –40
VCC2TF1 140 350 –45
VCC2TF2 100 400 –30
VCC2TM1 60 200 –23
VCC2TM2 50 280 –31
Source
speakers
Target
speakers
[Kobayashi; ’18a]
VCC2018 Baseline: 3
67. Tips 1: Target Voice Recording
• If you want to develop a good VC system, use a good parallel dataset!
How can we develop such a parallel dataset?
• Target voices should be high‐quality!
• Desirable to record them in a high‐quality sound environment
• Quality of target waveforms directly affects that of converted voices.
• The use of noisy target waveforms will generate noisy converted voices.
• Target voices should have desired voice characteristics!
• Not only speaker identity but also a speaking style strongly affects voice
characteristics.
• If a speaking style of a target speaker is special (e.g., a specific character’s
voice), it would be often useful to record special utterances suitable to such
a style rather than to record controlled ones (e.g., phonetically balanced
sentences).
Tips: 1
68. Tips 2: Source Voice Recording
• How about source voices?
• Minimize acoustic mismatches between training and conversion!
• Acoustic mismatches easily cause quality degradation of converted voices.
• It would be better to record source voices in the same environment as used in
the VC system (e.g., if using the VC system in your room, it would be better to
record source voices there).
• Ask a source speaker to imitate the target speaker’s speaking style!
• Only a part of speech parameters is converted in some VC techniques.
• It is better to ask the source speaker to imitate the target speaker’s voices by
controlling prosody, such as duration and F0 pattern. It would be OK to use
a special speaking way (e.g., falsetto) to do it. If F0 transformation is not
necessary, DIFFVC will be available!
• In recording, it will be helpful for the source speaker to listen to the target
voice sample just before uttering the corresponding utterance.
Tips: 2
69. Tips 3: Parameter Adjustment
• Which parameter can be adjusted?
• Training step
• Number of mixture components for mcep conversion needs to be changed
according to the amount of training data as the use of a larger number of
mixture components is effective for improving the converted speech quality
but it easily suffers from over‐fitting.
• You may change it like 8, 16, 32, 64, and 128.
• Conversion step
• The GV postfilter is effective for significantly improving the converted speech
quality but it also tends to cause artifact sounds. These sounds are alleviated
by setting the GV postfilter parameter to a smaller value (from 0 to 1) shown
in the pair‐dependent YML file as follows:
GV:
morph_coeff: 0.7 # from 0 (no effect) to 1 (full effect)
• There is a tradeoff between the converted voice quality and the artifacts.
Tips: 3