APSIPA2018: A revisit to feature handling for high-quality voice conversion

A Revisit to Feature Handling 
for High-quality Voice Conversion 
Based on Gaussian Mixture Model
Hitoshi Suda, Gaku Kotani, 
Shinnosuke Takamichi, and Daisuke Saito
(The University of Tokyo)
APSIPA ASC 2018 / Nov. 14, 2018 No. 1370 / WE-A2-8.3

/ 24
Voice Conversion
2
Modifying personalities
in the utterance
Source
Speaker
Target
Speaker

/ 24
Voice Conversion Framework
• Training
• Conversion
3
Source Filtering Output
AnalysisSource
Conversion
diff
AnalysisSource
AnalysisSource
AnalysisTarget
Conversion Model
DTW 
and Joint

/ 24
• Training
• Conversion
4
AnalysisSource
Conversion
diff
AnalysisSource
AnalysisSource
AnalysisTarget
Conversion Model
DTW 
and Joint

/ 24
• Training
• Conversion
5
AnalysisSource
Conversion
diff
AnalysisSource
AnalysisSource
AnalysisTarget
Conversion Model
DTW 
and Joint

/ 24
Aim of This Study
 
 
 
• To improve conversion quality 
of existing voice conversion frameworks
• To experimentally reveal influences of 
feature handling
• via subjective experiments
6
A Revisit to Feature Handling 
for High-quality Voice Conversion 
Based on Gaussian Mixture Model

/ 24
Experimental Setups
• 50 sentences from ATR Japanese phonetically
balanced sentence sets [Kurematsu+, 1990]
• 40 for training, 10 for evaluation
• Sampling frequency: 22050 Hz
• Analysis and synthesis: WORLD [Morise+, 2016]
• Speakers: 2 males and 2 females
• Only intra-gender conversion / No F0 conversion
• 23 listeners answered questions in each preference
test via our crowdsourcing system
7

/ 24
3 Experiments
1. Analysis conditions
2. Conversion system without statistical mapping
3. Total conversion system
8
AnalysisSource Synthesis Output
No conversion
AnalysisSource
Conversion
diff
AnalysisSource
AnalysisSource
AnalysisTarget
diff
DTW

/ 24
Experiment 1: Analysis Conditions
• To reveal the effects of conditions of analysis
• Frame periods (or frame shifts)
• How precisely the waveforms are analyzed
in time domain
• Order of mel-cepstral coefficients (mcep)
• How precisely the spectral envelopes are
represented
9
No conversion

/ 24
• Frame periods:
5 ms < 1 ms 
1 ms ≈ 500 μs 
500 μs ≤ 50 μs
10
p < 0.01
p < 0.5
Frameperiod
Preference score
0% 25% 50% 75% 100%
50 µs
500 µs
1 ms
500 µs
1 ms
5 ms
p < 0.1

/ 24
• Order of mcep:
with 1 ms analysis 
24 < 39 
39 ≈ 59 
59 < 79 
79 ≈ 99
with 50 μs analysis 
24 < 39 
39 ≈ 59 
59 ≈ 79 
79 ≈ 99
11
Orderofmel-cepstral
coefficients
Preference score
0% 25% 50% 75% 100%
99
79
59
39
79
59
39
24 p < 10–7
p < 1.0
p < 1.0
p < 1.0
Orderofmel-cepstral
coefficients
Preference score
0% 25% 50% 75% 100%
99
79
59
39
79
59
39
24 p < 10–10
p < 0.5
p < 0.01
p = 1

/ 24
• Frame periods: 
5 ms < 1 ms ≈ 500 μs ≤ 50 μs
• Order of mcep: 
24 < 39 ≈ 59 < 79 ≈ 99 (with 1 ms frames) 
24 < 39 ≈ 59 ≈ 79 ≈ 99 (with 50 μs frames)
12

/ 24
3 Experiments
13
No conversion
AnalysisSource
AnalysisTarget
diff
AnalysisSource
Conversion
diff
AnalysisSource
DTW

/ 24
Differential-spectrum Compensation
• Famous implementation: Mel log spectrum
approximation (MLSA) Filtering [Imai+, 1983]
• We introduce a diffspec method “SP-WORLD” 
inspired by WORLD vocoder
14
(diffspec) [Kobayashi+, 2014]
Filtering
(MLSA or SP-WORLD)
Output
waveforms
Source
waveforms
Difference of
spectra

/ 24
Dynamic Time Warping (DTW)
• Alignment of features
• Sensitive to difference 
of individuality
• We introduce 
“Affine-DTW”
• Iteration of alignment 
and coarse conversion
15
Source speaker [s]
Targetspeaker[s]
0 0.3 0.6 0.9 1.2 1.5 1.8 2.1
0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
2.7 non-affine
1-affine
2-affine
3-affine

/ 24
Experiment 2: Ideal Conversion
• To reveal the effects on quality of conversion 
of the conditions except mapping models
• Diffspec method: MLSA or SP-WORLD
• Analysis conditions
• Frame period and order of mcep
16
AnalysisSource
AnalysisTarget
diff
Affine-DTW

/ 24
• Diffspec method:
MLSA ≤ SP-WORLD 
MLSA < SP-WORLD 
MLSA < SP-WORLD 
MLSA < SP-WORLD
17
24 / 5 ms
24 / 1 ms
79 / 5 ms
79 / 1 ms
Preference score
0% 25% 50% 75% 100%
SP-WORLDMLSA p < 0.05
p < 0.01
p < 10–3
p < 10–4

/ 24
• Order of mcep:
24 ≈ 79 
24 ≈ 79 
24 > 79 
24 > 79
18
5 ms / MLSA
1 ms / MLSA
5 ms / SP-WORLD
1 ms / SP-WORLD
Preference score
0% 25% 50% 75% 100%
7924 p < 0.5
p < 1.0
p < 0.05
p < 10–3

/ 24
• Frame periods:
5 ms ≈ 1 ms 
5 ms ≈ 1 ms 
5 ms ≈ 1 ms 
5 ms ≥ 1 ms
19
24 / MLSA
79 / MLSA
24 / SP-WORLD
79 / SP-WORLD
Preference score
0% 25% 50% 75% 100%
1 ms5 ms p < 0.5
p < 0.5
p < 0.5
p < 0.05

/ 24
• Diffspec method: MLSA < SP-WORLD
• Order of mcep: 24 > 79 (with SP-WORLD) 
24 ≈ 79 (with MLSA)
• Frame periods: 
5 ms ≥ 1 ms (with SP-WORLD / 79-order) 
5 ms ≈ 1 ms (with other conditions)
20

/ 24
3 Experiments
21
No conversion
AnalysisSource
Conversion
diff
AnalysisSource
AnalysisSource
AnalysisTarget
diff
DTW

/ 24
Experiment 3: Statistical Conversion
• To reveal the influences of below components
• Diffspec method: MLSA or SP-WORLD
• Sequence features [Toda+, 2007]
• Dynamic features and global variances
• 1 ms period / 24-order of mcep
22
AnalysisSource
Conversion
diff
AnalysisSource

/ 24
Experiment 3: Statistical Conversion
• Diffspec method: MLSA ≈ SP-WORLD 
 
• Sequence features: 
static < static+dynamic < static+dynamic+GV
23
MLSA
SP-WORLD
MLSA
SP-WORLD
Preference score
0% 25% 50% 75% 100%
S+D+GV
S+D+GV
S+D
S+D
S+D
S+D
S
S p < 10–18
p < 10–11
p < 10–10
p < 10–4
MLSA
SP-WORLD
MLSA
SP-WORLD
Preference score
0% 25% 50% 75% 100%
S+D+GV
S+D+GV
S+D
S+D
S+D
S+D
S
S p < 10–71
p < 10–42
p < 0.01
p < 0.01
speaker similarity naturalness
Preference score
0% 25% 50% 75% 100%
Preference score
0% 25% 50% 75% 100%
speaker similarity naturalness

/ 24
Conclusion
• In GMM-based statistical voice conversion,
• Dynamic features and GV: definitely effective
• SP-WORLD: comparable to MLSA
• Superior in ideal conversion
• Higher order of features: not always effective
• High time-resolution analysis is effective in analysis-synthesis
• Potential of effectiveness also in conversion
Future Works
• F0 conversion
• Break the 1 ms barrier in WORLD analysis
• Other mapping models such as neural networks
24

APSIPA2018: A revisit to feature handling for high-quality voice conversion

Recommended

Recommended

More Related Content

Similar to APSIPA2018: A revisit to feature handling for high-quality voice conversion

Similar to APSIPA2018: A revisit to feature handling for high-quality voice conversion (20)

More from Shinnosuke Takamichi

More from Shinnosuke Takamichi (20)

Recently uploaded

Recently uploaded (20)

APSIPA2018: A revisit to feature handling for high-quality voice conversion