Hands on Voice Conversion

Nagoya University, Japan
tomoki@icts.nagoya‐u.ac.jp
Hands on Voice Conversion
July 26th, 2018
Tomoki TODA
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
Result of Voice Conversion Challenge 2018
(VCC2018) [Lorenzo‐Trueba; ’18a]
Let’s develop this
baseline system!
Baseline system
 Naturalness score = 3.5
 Speaker similarity score = 70%

Let’s Start VC Research & Development!
• Purpose: Understand overall procedure of statistical VC and a current
baseline level of statistical VC techniques so that you will be able to start
VC research or develop VC systems.
• Goal: Learn to use open‐source VC software to develop a basic VC system
for speaker conversion using a parallel speech dataset.
• Contents
• Let’s use open‐source VC software, sprocket!
• Let’s develop a traditional GMM‐based VC system!
• Let’s develop a vocoder‐free GMM‐based VC system!
• Let’s learn tips on VC system development!
Outline

Let’s use sprocket!
[Kobayashi; ’18a]
K. Kobayashi, T. Toda,
“sprocket: open‐source voice conversion software,”
Proc. Odyssey 2018, pp. 203—210, June 2018.
https://www.isca‐speech.org/archive/Odyssey_2018/pdfs/47.pdf
sprocket

Open‐Source VC Software: sprocket
• Developed by Dr. Kazuhiro Kobayashi of Nagoya University, JAPAN
• Motivation: provide an environment for both expert and
non‐expert users to easily use statistical VC framework
• Simply developed using existing libraries
• Implemented know‐how accumulated through our VC research (> 15 years)
• Freely available for both research and industrial purposes (MIT license)
• Used as a baseline system for Voice Conversion Challenge 2018 (VCC2018)
• Features:
• Traditional VC method based on GMM
• Vocoder‐free VC method based on DIFFGMM
• Supply Python3 VC library
• What we can do using sprocket?
• Can easily reproduce converted voices using VCC2016 & VCC2018 datasets
[Toda; ’16][Lorenzo‐Trueba; ’18b]
• Can develop VC system using other parallel speech datasets
[Lorenzo‐Trueba; ’18a]
sprocket: 1

Download
• Freely available from GitHub
• Directory structure of sprocket
sprocket‐master/
docs/ # documentation for running an example
examples/ # framework (for running an example)
sprocket/ # sprocket libraries
README.md # README file
LICENSE.txt # license file
requirements.txt
setup.py # setup script
Other files
https://github.com/k2kobayashi/sprocket
sprocket: 2

Install Procedure
Note: You need to use Python3 instead of Python2!
• First, install required libraries by executing the following commands:
$  pip3  install  numpy
$  pip3  install  ‐r  requirements.txt
• Then, install sprocket by executing the following command:
$  python3  setup.py  install
NOTE: These install procedure has already been done in your computer.
You may execute these commands to confirm it.
sprocket: 3

Let’s Run an Example Script!
• Instructions for running an example script is described in
doc/vc_example.md.
• An example script can be run under the working directory, example/.
$ cd example/
example/
conf/ # directory for configure files
data/ # data directory
Initialize.py # command script
list/ # directory for list files
run_f0_transformation.py # command script
run_sprocket.py # command script
src/ # source codes
Others
sprocket: 4

Let’s develop traditional
GMM‐based VC system!
[Toda; ’07]
T. Toda, A.W. Black, K. Tokuda,
“Voice conversion based on maximum likelihood estimation of spectral parameter trajectory,”
IEEE Transactions on Audio, Speech, and Language Processing,
Vol. 15, No. 8, pp. 2222—2235, 2007.
GMM-based VC

Overall Procedure of Statistical VC
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Conversion modelsConverted mel-cepstrum
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Speaker-dependent statistics
Converted
feature sequence
Time warping function
1. Speech analysis
2. Statistics calculation
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
0. Parallel data preparation
& parameter configurations
Parameter
configurations
Converted voices
GMM-based VC: 1

Converted voices
Procedure: Preparation Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Converted
feature sequence
1. Speech analysis
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
Parameter
configurations
GMM-based VC: 2

Preparation of Speech Dataset
• Speech waveform files of a parallel dataset between source and target
speakers need to be prepared.
• Only the following wav file format is supported.
• Sampling rate: 16, 22.05, 44.1, or 48 kHz
• Quantization bit: 16 (signed‐integer)
• Number of channels: 1
• Each utterance (e.g., around 5 seconds) needs to be stored in one wav file.
• Wav files need to be put in data/wav/ directory, e.g.,
data/wav/speakerA/*.wav
data/wav/speakerB/*.wav
• Download script for VCC datasets (download_speech_corpus.py) is also
available for automatically setting wav files, e.g.,
$ python3 download_speech_corpus.py downloader_conf/vcc2016.yml
GMM-based VC: 3

Example of Speech Dataset
• In this session, let’s use a part of VCC2018 database [Lorenzo‐Trueba; ’18b].
• Sampling frequency: 22.05 kHz
• Only speakers for parallel training
• Only 60 out of 81 training utterances and 20 out of 35 evaluation utterances
• Note: The following wav files have already been set in your computer.
data/wav/{SF1, SF2, SM1, SM2}/ # 4 source speakers
data/wav/{TF1, TF2, TM1, TM2}/ # 4 target speakers
10001.wav – 10060.wav # 60 utterances for training
30001.wav – 30020.wav # 20 utterances for evaluation
• Let’s check each speaker’s voice by listening to some wav files, e.g.,
data/wav/SF1/10001.wav & data/wav/TF1/10001.wav
data/wav/SM1/30001.wav & data/wav/TM1/30001.wav
: :
GMM-based VC: 4
You can download VCC2018 database by executing
and use it. Note that speaker names are slightly different from those shown in these slides.

Converted voices
Procedure: Initialization Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Converted
feature sequence
1. Speech analysis
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
Parameter
configurations
GMM-based VC: 5

Initialization 1: List File Generation
• Select a source & target speaker pair for the same‐gender conversion
• Source speakers: females = SF1 & SF2,  males = SM1 & SM2
• Target speakers: females = TF1 & TF2,  males = TM1 & TM2
• Generate list files for your selected speaker‐pair by executing
$  python3  initialize.py  ‐1  SourceSpeaker TargetSpeaker SamplingRate
e.g., if setting source = SF1, target = TF1, sampling = 22.05 kHz,
$  python3  initialize.py  ‐1  SF1  TF1  22050
• 4 list files will be generated under list/ directory.
list/SF1_train.list # training data list for the source speaker, SF1
list/TF1_train.list # training data list for the target speaker, TF1
list/SF1_eval.list # evaluation data list for SF1
list/TF1_eval.list # evaluation data list for TF1
• Modify each list file to define training & evaluation utterance pairs.
list/{SF1,TF1}_train.list # remain only 10001 – 10060 utterances
list/{SF1,TF1}_eval.list # remain only 30001 – 30020 utterances
GMM-based VC: 6

Example of Modified List Files
• Contents of list/SF1_train.list
SF1/10001
SF1/10002
SF1/10003
:
SF1/10060
• Contents of list/SF1_eval.list
SF1/30001
SF1/30002
SF1/30003
:
SF1/30020
• Contents of list/TF1_train.list
TF1/10001
TF1/10002
TF1/10003
:
TF1/10060
• Contents of list/TF1_eval.list
TF1/30001
TF1/30002
TF1/30003
:
TF1/30020
Should be parallel
• The length and order should be consistent between the source & target
speakers. These listed data is regarded as a parallel dataset.
Should be parallel
GMM-based VC: 7

Initialization 2: Configure File Generation
• Generate configure files for your selected speaker‐pair by executing
e.g.,
• 3 configure files will be generated under conf/ directory.
Speaker‐dependent settings are shown in the following YML files:
conf/speaker/SF1.yml # configure for SF1
conf/speaker/TF1.yml # configure for TF1
Speaker‐pair‐dependent setting is shown in the following YML file:
conf/pair/SF1‐TF1.yml # configure for the speaker‐pair, SF1‐TF1
GMM-based VC: 8

Example of Speaker‐Dependent YML File
• Contents of the speaker‐dependent YML file: conf/speaker/SF1.yml
wav:
fs: 22050 # sampling frequency [Hz]
bit: 16 # quantization bit [bit]
fftl: 1024 # FFT length [points]
shiftms: 5 # shift length [msec]
f0:
minf0: 40 # minimum F0 [Hz]
maxf0: 700 # maximum F0 [Hz]
mcep:
dim: 34 # order of mel‐cepstrum
alpha: 0.455 # all‐path filter parameter for mel‐frequency warping
power:
threshold: ‐15 # power threshold to remove silence frames
analyzer: world # speech analysis method
GMM-based VC: 9

Example of Pair‐Dependent YML File
• Contents of the speaker‐pair‐dependent YML file: conf/pair/SF1‐TF1.yml
jnt:
n_iter: 3 # number of iterative time alignments
GMM:
mcep: # GMM settings for mel‐cepstrum conversion
n_mix: 32 # number of mixture components
n_iter: 100 # number of iteration of GMM training
covtype: full # covariance type of GMM
cvtype: mlpg # conversion method
codeap: # GMM settings for aperiodicity conversion
n_mix: 16 # number of mixture components
: # (these lines are the same as the mcep part)
GV:
morph_coeff: 1.0 # GV postfilter parameter
GMM-based VC: 10

Initialization 3: Manual Settings
• Perform speech analysis for your selected speaker‐pair by executing
e.g.,
• The following message will be printed out.
### 3. create figures to define parameters ###
Extract: data/wav/SF1/10001.wav
Extract: data/wav/SF1/10002.wav
:
• Finally, 4 PNG files will be generated under conf/figure/ directory.
conf/figure/SF1_f0histogram.png # F0 histogram of SF1
conf/figure/SF1_npowhistogram.png # normalized power histogram of SF1
conf/figure/TF1_f0histogram.png # F0 histogram of TF1
conf/figure/TF1_npowhistogram.png # normalized power histogram of TF1
GMM-based VC: 11

Example of F0 Histogram
• It is very effective to adjust an F0 search range to each speaker for reducing
F0 extraction errors, such as half F0 and double F0 errors.
• NOTE: This is very important process as I will explain later.
Source speaker: SF1 Target speaker: TF1
Proper F0 search range
might be 140 – 400 Hz
Supposed to be
half F0 error
Supposed to be
half F0 error
Proper F0 search range
might be 140 – 340 Hz
conf/figure/SF1_f0histogram.png conf/figure/TF1_f0histogram.png
GMM-based VC: 12

Example of Normalized Power Histogram
• It is also effective to adjust a normalized power threshold to each speaker
for improving time alignment accuracy by removing silence frames.
• NOTE: This is also important process as I will explain later.
Silence
frames
Source speaker: SF1 Target speaker: TF1
conf/figure/SF1_npowhistogram.png conf/figure/TF1_npowhistogram.png
Speech
frames
Proper threshold
might be –30 dB
Silence
frames
Speech
frames
Proper threshold
might be –40 dB
GMM-based VC: 13

Let’s Modify Speaker‐Dependent YML Files
• Contents of conf/speaker/SF1.yml
wav:
fs: 22050
bit: 16
fftl: 1024
shiftms: 5
f0:
minf0: 140
maxf0: 400
mcep:
dim: 34
alpha: 0.455
power:
threshold: ‐30
analyzer: world
• Contents of conf/speaker/TF1.yml
wav:
fs: 22050
bit: 16
fftl: 1024
shiftms: 5
f0:
minf0: 140
maxf0: 340
mcep:
dim: 34
alpha: 0.455
power:
threshold: ‐40
analyzer: world
Revise these
3 values based on
your observation!
Revise these
3 values based on
your observation!
GMM-based VC: 14

Also Modify Pair‐Dependent YML File
• Contents of conf/pair/SF1‐TF1.yml
jnt:
n_iter: 3
GMM:
mcep:
n_mix: 16
n_iter: 100
covtype: full
cvtype: mlpg
codeap:
n_mix: 16
:
GV:
morph_coeff: 1.0
Let’s set the number of mixture components
for mel‐cepstrum conversion to 16 for reducing
training time!
GMM-based VC: 15

Converted voices
Procedure: Training Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Converted
feature sequence
1. Speech analysis
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
Parameter
configurations
GMM-based VC: 16

Training 1: Speech Analysis
• Let’s perform speech analysis by executing
$ python3 run_sprocket.py ‐1 SourceSpeaker TargetSpeaker
e.g.,
$ python3 run_sprocket.py ‐1 SF1 TF1
• The following message will be printed out,
### 1. Extract acoustic features ###
Extract acoustic features: data/wav/SF1/10001.wav
Extract acoustic features: data/wav/SF1/10002.wav
:
• Speech parameter files and analysis‐synthesized wav files will be generated
utterance by utterance under data/pair/ directory.
data/pair/SF1‐TF1/h5/{SF1,TF1}/100*.h5 # parameter HDF5 files
data/pair/SF1‐TF1/anasyn/{SF1,TF1}/100*.wav # analysis‐synthesized wav files
GMM-based VC: 17

Speech Analysis Processing
• Source code: src/extract_features.py
• WORLD [Morise; ’16] is used as an speech analysis‐synthesis method.
• Spectral envelope is parameterized into mel‐cepstrum [Tokuda; ’94].
Speech
waveform
F0 sequence
(f0 seq)
Spectral envelope
Sequence (spc seq)
Coded aperiodicity sequence
(codeap seq)
Mel‐cepstrum sequence
(mcep seq)
Normalized power sequence
(npow seq)
Parameter HDF5 file
(f0, mcep, npow, codeap)wav file
aperiodicity
sequence (ap seq)
*NOTE: F0 is used to accurately estimate
spectral envelope by removing the effects
of periodicity of excitation. Therefore, F0
estimation errors cause adverse effects in
other parameter estimation.
GMM-based VC: 18

Check If Speech Analysis Works Well
• Let’s check quality of analysis‐synthesized speech (e.g., 1st utterance) of
the source and the target speakers by listening to them.
data/wav/SF1/10001.wav # original source wav files
data/pair/SF1‐TF1/anasyn/SF1/10001.wav # its analysis‐synthesis wav file
Similar to each other?
data/wav/TF1/10001.wav # original target wav files
data/pair/SF1‐TF1/anasyn/TF1/10001.wav # its analysis‐synthesis wav file
Similar to each other?
• If they are similar to original natural speech, speech analysis works well.
If F0 of analysis‐synthesized speech sounds quite different from that of
original natural speech, F0 search range needs to be revised.
GMM-based VC: 19

Converted voices
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Converted
feature sequence
1. Speech analysis
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
Parameter
configurations
GMM-based VC: 20

Training 2: Statistics Calculation
• Calculate speaker‐dependent statistics by executing
e.g.,
• Speaker‐dependent statistics files will be generated under data/pair/
directory.
data/pair/SF1‐TF1/stats/SF1.h5 #statistics HDF5 file for SF1
data/pair/SF1‐TF1/stats/TF1.h5 #statistics HDF5 file for Tf1
GMM-based VC: 21

Statistics Calculation Processing
• Source code: src/estimate_feature_statistics.py
• Speaker‐dependent statistics to be used for F0 conversion & GV postfilter
[Toda; ’12] are extracted.
f0 seqs
mcep seqs
Statistics HDF5 file
(f0stats, gv)
Parameter
HDF5 files
Mean & variance
vectors
Log F0
sequence
Mean & variance
values
GMM-based VC: 22

Converted voices
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Converted
feature sequence
1. Speech analysis
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
Parameter
configurations
GMM-based VC: 23

Training 3: Joint Feature Development
• Develop joint features by executing
e.g.,
### 3. Estimate time warping function and jnt ###
## Alignment mcep w/o 0‐th and silence ##
1‐th joint feature extraction starts.
distortion [dB] for 1‐th file: …..
:
• Finally, a joint feature file and time warping function files will be generated
under data/pair/ directory.
data/pair/SF1‐TF1/jnt/it3_jnt.h5 # joint feature HDF5 file
data/pair/SF1‐TF1/twf/it3_100*.h5 # time warping function HDF5 files
GMM-based VC: 24

Joint Feature Development Processing
• Source code: src/estimate_twf_and_jnt.py
mcep feature
seqs
Source
parameter
HDF5 files
GMM for mcep
conversion
mcep seqs
codeap seqs
Time warping
functions
Joint mcep
feature seqs
Converted mcep
feature seqs
Converted
mcep seqs
codeap
feature seqs
Joint codeap
feature seqs
mcep feature
seqs
Target
parameter
HDF5 files
mcep seqs
codeap seqs
codeap
feature seqs
Joint feature HDF5 file
(mcep, codeap)
HDF5 files (twf)
Iterative
processing
Utterance-by-utterance
processing
GMM-based VC: 25

GMM-based VC: 26
Dynamic Time Warping (DTW)
GMM for mcep
conversion
codeap seqs
Converted mcep
feature seqs
Converted
mcep seqs
codeap
feature seqs
Joint codeap
feature seqs
codeap seqs
codeap
feature seqs
Joint feature HD5 file
(mcep, codeap)
HD5 files (twf)
Iterative
processing
processing
mcep feature
seqs
mcep seqs
Time warping
functions
Joint mcep
feature seqs
mcep feature
seqs
Target
parameter
HDF5 files
mcep seqs
Source
parameter
HDF5 files

Feature Extraction for DTW
• It is very important to align source frames to target frames so that they
share the same linguistic contents.
• There are several tips to robustly perform time alignment!
• Joint static and dynamic mcep features are used.
• Power differences between source & target voices are ignored.
• Silence frames are discarded (automatically by normalized power, npow).
This process is effective to deal with mismatches of short pause positions.
mcep seq Remove the 0th
coefficients
Append
dynamic features
mcep
feature seq
npow seq Remove
silence frames
*NOTE: power is NOT converted in
sprocket.  As a conversion model is
developed without using silence
frames, conversion accuracy at
those frames significantly degrades.
However, it  will not cause significant
issues as power of those frames is
too small to be perceived.
GMM-based VC: 27

DTW Process
• Time warping function is determined by minimizing a distance measure
between aligned feature sequences.
• Joint feature sequence is generated by concatenating source mcep feature
and target mcep feature at each aligned frame
Source mcep feature seq
Target mcepfeature seq
Joint feature seq
Source part
Target part
GMM-based VC: 28

GMM-based VC: 29
Iterative Time‐Alignment Refinement
Source
parameter
HD5 files
codeap seqs
codeap
feature seqs
Joint codeap
feature seqs
Target
parameter
HD5 files
codeap seqs
codeap
feature seqs
Joint feature HD5 file
(mcep, codeap)
HD5 files (twf)
Iterative
processing
processing
mcep feature
seqs
GMM for mcep
conversion
mcep seqs
Time warping
functions
Joint mcep
feature seqs
Converted mcep
feature seqs
Converted
mcep seqs
mcep feature
seqs
mcep seqs

Iterative DTW Process
• Time alignment determined by using source & target feature seqs suffers
from acoustic differences between source & target voices.
• To improve accuracy of time alignment, iterative DTW process is usually
used for refining the time warping functions [Abe; ’90].
mcep feature
seqs
GMM for mcep
conversion
mcep seqs
Time warping
functions
Joint mcep
feature seqs
Converted mcep
feature seqs
Converted
mcep seqs
mcep feature
seqs
mcep seqs
Same time structure
Used for developing
joint features
Used for determining
time warping function
Acoustically more
similar to target
GMM-based VC: 30

GMM Training & Conversion Process
• Joint GMM training [Kain; ’98]
• Joint probability density function (p.d.f.) of the source & target mcep features
is modeled by a joint GMM.
• Trajectory‐based conversion [Toda; ’07]
• The source mcep seq is converted into the target one by maximum likelihood
parameter generation using a conditional p.d.f. derived from the joint GMM
Joint mcep
feature seqs
Joint
GMM
Maximum likelihood estimation
using EM algorithm
Source mcep seq
Remove the 0th
coefficients
Append
dynamic features
Converted mcep seq w/o
the 0th coefficients
Conditional p.d.f.
Joint
GMM
GMM-based VC: 31

GMM-based VC: 32
DTW for codeap
Iterative
processing
mcep feature
seqs
GMM for mcep
conversion
mcep seqs
Converted mcep
feature seqs
Converted
mcep seqs
mcep feature
seqs
mcep seqs
Joint feature HDF5 file
(mcep, codeap)
HDF5 files (twf)
processing
Source
parameter
HDF5 files
codeap seqs
codeap
feature seqs
Joint codeap
feature seqs
Target
parameter
HDF5 files
codeap seqs
codeap
feature seqs
Time warping
functions
Joint mcep
feature seqs

Converted voices
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Converted
feature sequence
1. Speech analysis
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
Parameter
configurations
GMM-based VC: 33

Training 4: Model Training
• Develop conversion models by executing
e.g.,
### 4. Train GMM and converted GV ###
:
• Finally, conversion model files will be generated under data/pair/ directory.
data/pair/SF1‐TF1/model/GMM_mcep.pkl # GMM PKL file for mcep
data/pair/SF1‐TF1/model/GMM_codeap.pkl # GMM PKL file for codeap
data/pair/SF1‐TF1/model/cvgv.h5 # GV postfilter HDF5 file
GMM-based VC: 34

GMMs Training & GV Postfilter Calculation
• Source code: src/train_GMM.py
• Joint GMMs training
• Joint GMMs for mcep features and for codeap features are separately trained.
• GV calculation [Toda; ’07][Toda; ’12]
• Statistics of converted mcep seqs are calculated for GV postfilter
Joint feature
HDF5 file
(mcep, codeap)
Joint GMM
for mcep
GMM PKL
file (mcep)
Joint mcep
feature seqs
Joint GMM
for codeap
GMM PKL
file (codeap)
Joint codeap
feature seqs
mcep seqs
GV statistics
HDF5 file (cvgv)
Parameter
HDF5 files
Mean & variance
vectors
Converted
mcep seqs
Joint GMM
for mcep
GMM-based VC: 35

Converted voices
Procedure: Conversion Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Joint features
Source speech
parameter sequence
Source feature
sequence
Target feature
sequence
Converted
feature sequence
1. Speech analysis
3. Joint feature
development
4. Model training
5. Conversion
Target speech
parameter sequence
Target
voices
Parameter
configurations
GMM-based VC: 36

Conversion: Converted Speech Generation
• Perform voice conversion by executing
e.g.,
### 5. Conversion based on the trained models ###
GMM for mcep conversion mode: None
data/pair/SF1‐TF1/test/SF1/30001_VC.wav
:
• Speech parameter files and analysis‐synthesis wav files will be generated
utterance by utterance under data/pair/ directory.
data/pair/SF1‐TF1/test/SF1/300*_VC.wav # converted wav files by VC
data/pair/SF1‐TF1/test/SF1/300*_DIFFVC.wav # converted wav files by DIFFVC
GMM-based VC: 37

Converted Speech Generation by VC
• Source code: src/convert.py
• WORLD [Morise; ’16] is used as an speech analysis‐synthesis method.
Speech
waveform
f0 seq
mcep seq
ap seq
Statistics HDF5 file
(f0stats, gv)
Converted
F0 seq
GMM PKL
file (mcep)
Converted mcep
(cvmcep) seq
GV postfiltered
cvmcep seq
Converted
waveform
GV statistics
HDF5 file (cvgv)
Power adjusted
cvmcep seq
Source
wav file Linear transformation
of log‐scaled F0 seq
GV postfiltering
w/o power
conversionTrajectory‐based
conversion
w/o ap conversion
Converted wav
file (VC)
GMM-based VC: 38

Already Developed
Vocoder‐Free VC based on
DIFFGMM as well!
[Kobayashi; 18b]
K. Kobayashi, T. Toda, S. Nakamura,
“Intra‐gender statistical singing voice conversion with direct waveform modification
using log‐spectral differential,”
Speech Communication, Vol. 99, pp. 211—220, 2018.
https://doi.org/10.1016/j.specom.2018.03.011
DIFFGMM-based VC

Converted Speech Generation by DIFFVC
• Source code: src/convert.py
• MLSA filter [Tokuda; ’94] is used as to directly convert source waveform
Speech
waveform
f0 seq
mcep seq
Statistics
HDF5 file (gv)
GMM PKL
file (mcep)
Differential mcep
(diffmcep) seq
GV postfiltered
diffmcep seq
Converted wav
file (DIFFVC)
Converted
waveform
GV statistics
HDF5 file (cvgv)
Power adjusted
diffmcep seq
Source
wav file
GV postfiltering
w/o power
conversion
Trajectory‐based conversion
DIFFGMM
for mcep
*NOTE: DIFFVC can generate much higher quality of
converted speech than VC, but F0 is NOT converted.
Thus, DIFFVC is very effective for a speaker‐pair with
a similar F0 range, e.g., in the same gender conversion.
DIFFGMM-based VC: 1

Let’s Listen to Converted Speech Samples!
• Converted wav files by VC
:
• Converted wav files by DIFFVC
data/pair/SF1‐TF1/test/SF1/30001_DIFFVC.wav
data/pair/SF1‐TF1/test/SF1/30002_DIFFVC.wav
:
• Original source wav files
data/wav/SF1/30001.wav
data/wav/SF1/30002.wav
:
• Original target wav files
data/wav/TF1/30001.wav
data/wav/TF1/30002.wav
:
DIFFGMM-based VC: 2

Let’s Develop Vocoder‐Free
VC based on DIFFGMM
with F0 Modification!
[Kobayashi; ’16]
K. Kobayashi, T. Toda, S. Nakamura,
“F0 transformation techniques for statistical voice conversion with direct waveform
modification with spectral differential,”
Proc. IEEE SLT, pp. 693—700, Dec. 2016.
DIFFGMM-based VC w/ F0 transformation

DIFFGMM-based VC w/ F0 transformation: 1
Overall Procedure
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
Parameter
configurations
F0 transformed
source voices
1. F0 transformed source
voice generation
The same procedure applied to the new parallel dataset
0. Parallel data preparation & parameter configurations
1. Speech analysis
3. Joint feature development
4. Model training
5. Conversion
Used as a new parallel dataset

Procedure: Dataset Generation Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
Parameter
configurations
F0 transformed
source voices
voice generation
1. Speech analysis
4. Model training
5. Conversion

Initialization Steps
• Select source and target speakers for the cross‐gender conversion!
• Source speakers: females = SF1 & SF2,  males = SM1 & SM2
• Target speakers: females = TF1 & TF2,  males = TM1 & TM2
• Generate list files for your selected speaker‐pair,
e.g., if setting source = SF2, target = TM2, & sampling = 22.05 kHz,
$  python3  initialize.py  ‐1  SF2  TM2  22050
• Modify list files to select training and evaluation utterance pairs
list/{SF2,TM2}_train.list # remain only 10001 – 10060 utterances
list/{SF2,TM2}_eval.list # remain only 30001 – 30020 utterances
• Generate configure files for your selected speaker‐pair, e.g.,
• Perform speech analysis for your selected speaker‐pair, e.g.,
• Modify speaker‐dependent YML files based on histograms.
conf/speaker/{SF2,TM2}.yml # revise minf0, maxf0, threshold values

Example of Histograms
Source speaker: SF2 Target speaker: TM2
conf/figure/SF2_f0histogram.png conf/figure/TM2_f0histogram.png
min: 120 Hz
max: 340 Hz
min: 60 Hz
max: 270 Hz
threshold:
–30 dB
conf/figure/SF2_npowhistogram.png conf/figure/TM2_npowhistogram.png
threshold:
–30 dB

Procedure: Dataset Generation Step
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
Parameter
configurations
F0 transformed
source voices
voice generation
1. Speech analysis
4. Model training
5. Conversion

F0 Transformed Waveform Generation
• Perform F0 transformation based on waveform modification by executing
$  python3  run_f0_transformation.py  SourceSpeaker TargetSpeaker
e.g.,
$  python3  run_f0_transformation.py  SF2  TM2
### 1. F0 transformation of original waveform ###
Extract F0: data/wav/SF2/10001.wav
:
• Finally, F0 transformed wav files will be generated under data/wav/
directory as a new source speaker, SourceSpeaker_F0TransformationRatio.
data/wav/SF2_0.73/*.wav # F0 transformation ratio = 0.73
• Let’s check F0 transformed source speaker’s voices by listening to some
wav files.  Note that not only F0 but also voice quality should be converted.

F0 Transformation Process
• Source code: src/f0_transformation.py
• Constant F0 transformation ratio calculated from source & target F0 mean
values by using training data is applied to all source speech wav files.
Source f0 seqs
F0 transformation
ratio
F0 transformed
source wav files
Source wav files Target wav files
Target f0 seqs
F0 transformed
source wav files
Source wav files
Training data
Evaluation data
Training data

F0 Transformed Waveform Generation
• Duration conversion w/ WSOLA [Verhelst; ’93] and waveform resampling
is used to generate F0 transformed waveform.
e.g., if setting F0 transformation ratio to 2 (i.e., 100 Hz to 200 Hz),
1.  Make duration of input waveform double w/ WSOLA while keeping F0 values
2.  Resample the modified waveform to make its duration half
Input waveform
Duration modified
waveform
1.1.  Extract frames by windowing
1.2  Find the best concatenation point
1.3  Overlap and add
fO modified
waveform
Deletion or down sampling
Duration modified
waveform

Procedure: Parallel VC Steps
Source
voices
F0 & power
histograms
F0 & power
histograms
Target
voices
Parameter
configurations
F0 transformed
source voices
voice generation
0. Initialization: parameter configurations
1. Training: speech analysis
2. Training: statistics calculation
3. Training: joint feature development
4. Training: model training
5. Conversion

Initialization for F0 Transformed Speaker
• Perform initialization steps by setting the F0 transformed source speaker to
a new source speaker
• Generate list files for a new speaker‐pair, e.g.,
$  python3  initialize.py  ‐1  SF2_0.73  TM2  22050
• Modify list files to select training and evaluation utterance pairs
list/SF2_0.73_train.list # remain only 10001 – 10060 utterances
list/SF2_0.73_eval.list # remain only 30001 – 30020 utterances
• Generate configure files for the new speaker‐pair, e.g.,
• Perform speech analysis for the new speaker‐pair, e.g.,
• Modify a speaker‐dependent YML file based on histograms.
conf/speaker/SF2_0.73.yml # revise minf0, maxf0, threshold values
conf/pair/SF2_0.73‐TM2.yml # set n_mix for mcep to 16

Training & Conversion Steps
• Perform training and conversion steps for converting the F0 transformed
source speaker into the target speaker, e.g., by executing
$  python3  run_sprocket.py  ‐1  ‐2  ‐3  ‐4  ‐5  SF2_0.73  TM2
• Finally, the converted voices will be generated under data/pair/ directory.
data/pair/SF2_0.73‐TF1/test/SF2_0.73/300*_DIFFVC.wav # converted wav files
NOTE: converted wav files by VC (*_VC.wav) will also be generated but they can
be ignored…
• NOTE: only the conversion step can also be performed,
e.g.,
• List data to be converted in list/{SF2,TM2}_eval.list
• Generate F0 transformed source wav files given the F0 transformation ratio:
$  python3  run_f0_transformation.py  ‐‐ev ‐‐f0rate  0.73  SF2_0.73  TM2
• Generate converted wav files:
$  python3  run_sprocket.py  ‐5  SF2_0.73  SF2  TM2

How to Develop
VCC2018 Baseline System?
VCC2018 Baseline

Reproduce Baseline Results of VCC2018!
• You can develop a baseline system of VCC2018 Hub task [Kobayashi; ’18a] by
using sprocket!
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
sprocket
• Hub task: parallel training task
• Source: 2 female & 2 male speakers
• Target: 2 female & 2 male speakers
• 81 utterances for training
• 35 utterances for evaluation
• Baseline system development
• DIFFVC w/o F0 transformation
for the same‐gender pairs
MOS ≅ 4.0, similarity ≅ 70%
• VC for the cross‐gender pairs
MOS ≅ 3.0, similarity ≅ 70%
• In total, MOS ≅ 3.5, similarity ≅ 70%
Results of VCC2018 [Lorenzo‐Trueba; ’18a]
VCC2018 Baseline: 1

Download VCC2018 Dataset
• Automatically set wav files of VCC2018 datasets [Lerenzo‐Trueba: ’18b] by
executing a download script (download_speech_corpus.py) as follows:
• The following files will be generated.
data/wav/VCC2{SF1, SF2, SM1, SM2}/ # 4 source speakers for parallel data
data/wav/VCC2{TF1, TF2, TM1, TM2}/ # 4 target speakers for parallel data
10001.wav – 10081.wav # 81 utterances for training
30001.wav – 30035.wav # 35 utterances for evaluation
These files will be used in the baseline system development.
On the other hand, the following files will NOT be used.
data/wav/VCC2{SF3, SF4, SM3, SM4}/ # 4 source speakers for SPOKE task
VCC2018 Baseline: 2

Develop Baseline System
• Just execute initialize.py & run_sprocket.py for each speaker‐pair.
• Use 81 training utterances and 35 evaluation utterances
• Use default settings of the pair‐dependent YML file (32 mixture components)
• May use the following configurations in speaker‐dependent YML files:
• Use DIFFVC.wav for same‐gender pairs, and VC.wav for cross‐gender pairs
Speaker Minimum F0 (minf0) Maximum F0 (maxf0) Power threshold
VCC2SF1 100 450 –31
VCC2SF2 110 350 –31
VCC2SM1 50 200 –31
VCC2SM2 70 300 –40
VCC2TF1 140 350 –45
VCC2TF2 100 400 –30
VCC2TM1 60 200 –23
VCC2TM2 50 280 –31
Source
speakers
Target
speakers
[Kobayashi; ’18a]
VCC2018 Baseline: 3

Let’s Learn Tips on
VC System Development!
Tips

Tips 1: Target Voice Recording
• If you want to develop a good VC system, use a good parallel dataset!
How can we develop such a parallel dataset?
• Target voices should be high‐quality!
• Desirable to record them in a high‐quality sound environment
• Quality of target waveforms directly affects that of converted voices.
• The use of noisy target waveforms will generate noisy converted voices.
• Target voices should have desired voice characteristics!
• Not only speaker identity but also a speaking style strongly affects voice
characteristics.
• If a speaking style of a target speaker is special (e.g., a specific character’s
voice), it would be often useful to record special utterances suitable to such
a style rather than to record controlled ones (e.g., phonetically balanced
sentences).
Tips: 1

Tips 2: Source Voice Recording
• How about source voices?
• Minimize acoustic mismatches between training and conversion!
• Acoustic mismatches easily cause quality degradation of converted voices.
• It would be better to record source voices in the same environment as used in
the VC system (e.g., if using the VC system in your room, it would be better to
record source voices there).
• Ask a source speaker to imitate the target speaker’s speaking style!
• Only a part of speech parameters is converted in some VC techniques.
• It is better to ask the source speaker to imitate the target speaker’s voices by
controlling prosody, such as duration and F0 pattern. It would be OK to use
a special speaking way (e.g., falsetto) to do it. If F0 transformation is not
necessary, DIFFVC will be available!
• In recording, it will be helpful for the source speaker to listen to the target
voice sample just before uttering the corresponding utterance.
Tips: 2

Tips 3: Parameter Adjustment
• Which parameter can be adjusted?
• Training step
• Number of mixture components for mcep conversion needs to be changed
according to the amount of training data as the use of a larger number of
mixture components is effective for improving the converted speech quality
but it easily suffers from over‐fitting.
• You may change it like 8, 16, 32, 64, and 128.
• Conversion step
• The GV postfilter is effective for significantly improving the converted speech
quality but it also tends to cause artifact sounds. These sounds are alleviated
by setting the GV postfilter parameter to a smaller value (from 0 to 1) shown
in the pair‐dependent YML file as follows:
GV:
morph_coeff: 0.7 # from 0 (no effect) to 1 (full effect)
• There is a tradeoff between the converted voice quality and the artifacts.
Tips: 3

That’s all!
Acknowledgement:
I am grateful to Dr. Kazuhiro Kobayashi of Nagoya
University, Japan, for the development of sprocket.
I hope now you can start
your own VC research and
VC system development!

[Abe; ’90]  M. Abe, S. Nakamura, K. Shikano, H. Kuwabara.  Voice conversion through vector quantization.  J.
Acoust. Soc. Jpn (E), Vol. 11, No. 2, pp. 71–76, 1990.
[Kain; ’98]  A. Kain, M.W. Macon.  Spectral voice conversion for text‐to‐speech synthesis.  Proc. IEEE ICASSP,
pp. 285–288, 1998.
[Kobayashi; ’16]  K. Kobayashi, T. Toda, S. Nakamura.  F0 transformation techniques for statistical voice
conversion with direct waveform modification with spectral differential.  Proc. IEEE SLT, pp. 693–700, 2016.
[Kobayashi; ’18a]  K. Kobayashi, T. Toda.  sprocket: open‐source voice conversion software.  Proc. Odyssey,
pp. 203–210, 2018.
[Kobayashi; ’18b]  K. Kobayashi, T. Toda, S. Nakamura.  Intra‐gender statistical singing voice conversion with
direct waveform modification using log‐spectral differential.  Speech Commun., Vol. 99, pp. 211–220, 2018.
[Lorenzo‐Trueba; ’18a]  J. Lorenzo‐Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling.
The voice conversion challenge 2018: promoting development of parallel and nonparallel methods.  Proc.
Odyssey, pp. 195–202, 2018.
[Lorenzo‐Trueba; ’18b]  J. Lorenzo‐Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling.
The Voice Conversion Challenge 2018: database and results.  The Centre for Speech Technology Research,
The University of Edinburgh, UK, 2018. < http://dx.doi.org/10.7488/ds/2337 >
[Morise; ’16]  M. Morise, F. Yokomori, K. Ozawa.  WORLD: a vocoder‐based high‐quality speech synthesis
system for real‐time applications. IEICE Trans. Inf. & Syst., Vol. E99‐D, No. 7, pp. 1877–1884, 2016.
[Toda; ’07]  T. Toda, A.W. Black, K. Tokuda.  Voice conversion based on maximum likelihood estimation of
spectral parameter trajectory.  IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222–2235,
2007.
References
References: 1

[Toda; ’12]  T. Toda, T. Muramatsu, H. Banno.  Implementation of computationally efficient real‐time voice
conversion.  Proc. INTERSPEECH, 4 pages, 2012.
[Toda; ’16]  T. Toda, L.‐H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi.  The Voice
Conversion Challenge 2016.  University of Edinburgh, School of Informatics, Centre for Speech Technology
Research, 2016. < http://dx.doi.org/10.7488/ds/1430 >
[Tokuda; ’94]  K. Tokuda, T. Kobayashi, T. Masuko, S. Imai.   Mel‐generalized cepstral analysis – a unified
approach to speech spectral estimation.   Proc. ICSLP, pp. 1043–1045, 1994.
[Verhelst; ’93]  W. Verhelst, M. Roelands.  An overlap‐add technique based on waveform similarity (WSOLA)
for high quality time‐scale modification of speech.  Proc. IEEE ICASSP, Vol. 2, pp. 554–557, 1993.
References: 2

Hands on Voice Conversion

More Related Content

What's hot

More from NU_I_TODALAB

Recently uploaded

Hands on Voice Conversion