How do Voices from Past Speech Synthesis Challenges Compare Today?

How do Voices from Past Speech Synthesis Challenges
Compare Today?
Erica Cooper, Junichi Yamagishi
August 28, 2021
National Institute of Informatics, Japan

Table of contents
1. Introduction
2. Listening Test
3. Results
4. Analysis of Natural Speech
5. Discussion and Future Work
1

Project
• A very large-scale listening test combining samples from many past Blizzard
Challenges and Voice Conversion Challenges
• MOS results from separate past listening tests cannot be meaningfully combined and
compared because the set of systems and therefore the context of the test are completely
different.
• Conducting a new test with these samples combined will enable more direct comparisons.
• The data gathered can be used to train MOS prediction models.
2

Motivation
• How reliable and reproducible are MOS scores?
• How do past listening test results compare to ratings gathered in the present day?
• Will the results correlate even when the listening test context has changed?
• What observations can we make about speech synthesis and voice conversion systems
over the years?
• What are the effects of the speaker of the dataset on synthesized speech quality?
• Collect a very large-scale database of a variety of synthesizer outputs and their
MOS ratings for the purpose of training MOSnet-like systems for automatic MOS
prediction
3

Large-scale listening test
• 187 different systems from past BC, VCC, and ESPnet TTS
• BC 2008, 2009, 2010, 2011, 2013, 2016 (English EH1 tasks only)
• VCC 2016, 2018, 2020 (same-language task only)
• ESPnet TTS: samples from ICASSP 2020 trained on LJSpeech
• 38 utterances per system
• Samples are balanced over genre where relevant
• Samples are balanced over source and target speakers for VCC
• Only genres included in the original naturalness tests are included
• Test design
• One sample from each system per set
• One listener rates one set containing 187 samples
• Coverage of 8 Japanese listeners per set; 304 total listeners
• MOS rating for naturalness on a scale of 1-5
• Significant differences: Mann-Whitney U test (p<0.05) with Bonferroni correction
4

Listening test results
VCC2018-N06
VCC2018-N16
VCC2020-T14
VCC2016-C
VCC2016-baseline
VCC2016-I
VCC2020-T26
VCC2018-N19
VCC2018-N18
BC2016-C
VCC2020-T21
VCC2018-N07
VCC2018-N09
BC2013-P
VCC2016-M
VCC2018-D01
VCC2016-H
VCC2016-E
VCC2018-N20
VCC2020-T17
VCC2020-T18
VCC2018-N14
VCC2018-D03
VCC2018-D05
VCC2018-N11
BC2016-J
BC2010-Q
VCC2018-N03
VCC2016-A
BC2016-K
BC2016-N
VCC2020-T09
VCC2016-D
VCC2018-D04
VCC2016-G
BC2016-P
VCC2018-N13
VCC2018-N05
BC2016-E
VCC2016-Q
VCC2018-D02
BC2016-I
VCC2020-T02
BC2016-G
VCC2016-B
BC2009-M
BC2011-I
BC2008-I
VCC2018-N15
BC2013-F
VCC2020-T19
BC2009-P
VCC2018-N12
VCC2020-T06
BC2013-B
BC2013-H
BC2009-R
BC2016-D
BC2016-O
VCC2020-T01
BC2008-R
VCC2020-T31
VCC2018-N17
BC2016-F
VCC2016-F
VCC2020-T16
VCC2016-N
VCC2020-T08
VCC2016-J
BC2009-E
BC2009-Q
VCC2016-P
VCC2018-N08
VCC2016-K
VCC2018-N04
BC2009-T
BC2008-F
BC2008-D
VCC2020-T12
VCC2020-T28
BC2016-H
BC2008-H
VCC2016-L
BC2008-N
BC2010-O
VCC2020-T03
BC2008-T
BC2009-W
BC2008-L
BC2008-P
VCC2016-O
BC2008-Q
VCC2020-T24
BC2009-J
ESPnet-merlin
VCC2020-T04
BC2016-B
BC2008-G
BC2009-D
BC2011-J
BC2008-E
VCC2018-B01
BC2009-O
BC2008-M
BC2009-B
BC2008-C
VCC2020-T20
VCC2020-T23
BC2008-V
VCC2020-T22
BC2010-C
BC2008-O
BC2010-L
BC2009-C
ESPnet-fastspeechv2
BC2008-B
BC2009-I
BC2016-M
BC2008-K
VCC2020-T32
BC2009-L
BC2009-H
BC2011-B
BC2016-Q
VCC2020-T07
BC2013-N
BC2008-S
ESPnet-fastspeechv3
BC2010-R
BC2010-H
BC2010-G
BC2010-N
BC2013-L
VCC2020-T33
BC2010-U
BC2013-C
BC2013-I
VCC2018-N10
BC2010-B
BC2011-M
BC2011-D
VCC2020-T25
BC2010-S
VCC2020-T27
BC2016-L
BC2008-J
BC2011-F
BC2013-K
VCC2020-T30
BC2010-P
BC2011-K
BC2011-L
BC2011-C
VCC2020-T13
BC2009-K
VCC2020-T29
VCC2020-T11
BC2010-F
ESPnet-mozilla
VCC2020-T35
BC2011-H
BC2016-A
BC2011-E
BC2009-S
VCC2018-T00
VCC2020-T10
BC2010-V
VCC2016-target
BC2010-T
VCC2016-source
BC2008-A
VCC2018-S00
VCC2020-T34
BC2013-M
BC2010-J
ESPnet-tacotron2v2
BC2011-G
ESPnet-nvidia
ESPnet-tacotron2v3
BC2009-A
ESPnet-transformerv1
BC2010-M
BC2013-A
ESPnet-transformerv3
BC2010-A
ESPnet-raw
BC2011-A
1
2
3
4
5
0
50
100
150
200
5

Best systems
• ESPnet-transformerv3
• BC2010-M
• ESPnet-tacotron2v3
• ESPnet-nvidia
Worst systems
• VCC2018-N06
• VCC2018-N16
• VCC2020-T14
• VCC2016-C
• VCC2016-baseline
VCC typically has a smaller
amount of data
6

Best systems
• BC2010-M
• ESPnet-tacotron2v3
• ESPnet-nvidia
Worst systems
• VCC2018-N06
• VCC2018-N16
• VCC2020-T14
• VCC2016-C
• VCC2016-baseline
VCC typically has a smaller
amount of data
7

Listening test: Agreement
Agreement = 0.5 (Krippendorff’s
Alpha and Intra-Class Correlation)
Merlin has the largest variation in
scores 8

Do new MOS results correlate with original ones?
System-level and utterance-level Pearson’s correlation coefficient, Spearman Rank correlation
coefficient, and root mean squared error between original and new listening test results by challenge or
set of systems
System-level Utterance-level
Challenge PCC SRCC RMSE PCC SRCC RMSE
BC2008 0.93 0.89 0.33 0.70 0.67 0.62
BC2009 0.97 0.95 0.48 0.76 0.72 0.64
BC2010 0.93 0.98 0.66 0.74 0.73 0.85
BC2011 0.91 0.90 0.76 0.76 0.67 0.87
BC2013 0.97 0.98 0.49 - - -
BC2016 0.97 0.93 0.40 - - -
VCC2016 0.97 0.92 0.42 0.56 0.53 1.12
VCC2018 0.96 0.91 0.77 0.55 0.53 1.10
VCC2020 0.98 0.96 0.23 0.87 0.87 0.48
ESPnet 0.99 0.98 0.09 0.73 0.61 0.59
9

set of systems
BC2008 0.93 0.89 0.33 0.70 0.67 0.62
BC2009 0.97 0.95 0.48 0.76 0.72 0.64
BC2010 0.93 0.98 0.66 0.74 0.73 0.85
BC2011 0.91 0.90 0.76 0.76 0.67 0.87
BC2013 0.97 0.98 0.49 - - -
BC2016 0.97 0.93 0.40 - - -
VCC2016 0.97 0.92 0.42 0.56 0.53 1.12
VCC2018 0.96 0.91 0.77 0.55 0.53 1.10
VCC2020 0.98 0.96 0.23 0.87 0.87 0.48
ESPnet 0.99 0.98 0.09 0.73 0.61 0.59
10

set of systems
BC2008 0.93 0.89 0.33 0.70 0.67 0.62
BC2009 0.97 0.95 0.48 0.76 0.72 0.64
BC2010 0.93 0.98 0.66 0.74 0.73 0.85
BC2011 0.91 0.90 0.76 0.76 0.67 0.87
BC2013 0.97 0.98 0.49 - - -
BC2016 0.97 0.93 0.40 - - -
VCC2016 0.97 0.92 0.42 0.56 0.53 1.12
VCC2018 0.96 0.91 0.77 0.55 0.53 1.10
VCC2020 0.98 0.96 0.23 0.87 0.87 0.48
ESPnet 0.99 0.98 0.09 0.73 0.61 0.59
11

set of systems
BC2008 0.93 0.89 0.33 0.70 0.67 0.62
BC2009 0.97 0.95 0.48 0.76 0.72 0.64
BC2010 0.93 0.98 0.66 0.74 0.73 0.85
BC2011 0.91 0.90 0.76 0.76 0.67 0.87
BC2013 0.97 0.98 0.49 - - -
BC2016 0.97 0.93 0.40 - - -
VCC2016 0.97 0.92 0.42 0.56 0.53 1.12
VCC2018 0.96 0.91 0.77 0.55 0.53 1.10
VCC2020 0.98 0.96 0.23 0.87 0.87 0.48
ESPnet 0.99 0.98 0.09 0.73 0.61 0.59
12

Do MOS scores improve year by year?
Best system in each challenge compared to the previous challenge’s best system
Year : Best system MOS Improved? Significant?
BC2008 : J 3.63
BC2009 : S 3.87 X x
BC2010 : M 4.27 X X
BC2011 : G 4.12 x x
BC2013 : M 4.01 x x
BC2016 : L 3.63 x X
VCC2016 : O 2.86
VCC2018 : N10 3.55 X X
VCC2020 : T10 3.88 X x
ESPnet : transformerv3 4.33
13

Do MOS scores improve year by year?
Best system in each challenge compared to the previous challenge’s best system
Year : Best system MOS Improved? Significant?
BC2008 : J 3.63
BC2009 : S 3.87 X x
BC2010 : M 4.27 X X
BC2011 : G 4.12 x x
BC2013 : M 4.01 x x
BC2016 : L 3.63 x X
VCC2016 : O 2.86
VCC2018 : N10 3.55 X X
VCC2020 : T10 3.88 X x
ESPnet : transformerv3 4.33
14

At what point did TTS quality reach that of natural speech?
Is the year’s best system significantly different from that year’s natural speech?
Year : Best system Significant difference from natural speech?
BC2008 : J X
BC2009 : S X
BC2010 : M x
BC2011 : G X
BC2013 : M x
BC2016 : L x
VCC2016 : O X
VCC2018 : N10 x
VCC2020 : T10 x
ESPnet : transformerv3 x
15

At what point did TTS quality reach that of natural speech?
Difference of each system from natural speech, computed from averaged
z-score-normalized ratings by listener for each challenge
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
2.5
2.0
1.5
1.0
0.5
0.0 BC2008
BC2009
BC2010
BC2011
BC2013
BC2016
ESPnet
16

Correlation with objective measures
• SNR: r=0.17
• P.563: r=0.05
• MOSnet trained on ASVspoof: r=0.03
Room for improvement of
objective measures
17

Correlation with objective measures
• SNR: r=0.17
• P.563: r=0.05
• MOSnet trained on ASVspoof: r=0.03
Room for improvement of
objective measures
18

Natural speech preferences and effects of corpus on TTS
• The effect of speech corpus on perceived TTS quality is well-documented:
• J. Williams, J. Rownicka, P. Oplustil, and S. King, “Comparison of speech representations
for automatic quality estimation in multi-speaker text-to- speech synthesis,” 2020
• F. Hinterleitner, C. Manolaina, and S. Moller, “Influence of a voice on the quality of
synthesized speech,” 2014
• Since every challenge uses a different corpus, this is a confounding factor to making
meaningful direct comparisons across challenges, but it is still important to capture
preferences regarding these factors for training a MOS prediction model.
19

Natural speech metadata and genre
• Professional speakers were rated as significantly more natural than non-professional
speakers
• Female speakers had a marginally-significantly (p=0.05) higher MOS than male speakers
• No significant differences between British and American speakers
• Genres: news, book, conversational
• News rated as most natural (MOS=4.36)
• Conversational: MOS=4.14
• Book: MOS=4.09 (significantly lower)
20

Speaker characteristics
• Standard Praat features:
• Minimum, maximum, mean, and standard deviation of f0 and energy
• NHR, jitter, shimmer
• Moderate negative correlations with MOS for shimmer (r=-0.46), NHR (r=-0.41), and
mean energy (r=-0.37)
• Moderate positive correlations with MOS for standard deviation of energy (r=0.41)
21

Effect of corpus on benchmark systems
• Moderate correlations for Festival
• Pearson r=0.33
• Spearman r=0.54
• Strong correlations for HTS
• Pearson r=0.87
• Spearman r=0.90
HTS output quality more
closely matches the quality
of the training data.
22

Effect of corpus on benchmark systems
• Moderate correlations for Festival
• Pearson r=0.33
• Spearman r=0.54
• Strong correlations for HTS
• Pearson r=0.87
• Spearman r=0.90
HTS output quality more
closely matches the quality
of the training data.
23

Discussion and future work
• We have a large dataset for training MOSnet-type systems
• Strong correlations with past listening tests
• Choice of speaker for training data is very important
• Will repeating the test with English listeners reveal language-dependent or cultural factors?
• Some systems have clear agreements whereas others have a wider distribution of scores.
• What makes certain systems so “controversial”?
• Are certain types of artifacts or unnaturalness more salient to some listeners than to others?
• Analysis of listener differences
• Incorporate variance of scores into MOSnet
24

How do Voices from Past Speech Synthesis Challenges Compare Today?

Recommended

Recommended

More Related Content

Similar to How do Voices from Past Speech Synthesis Challenges Compare Today?

Similar to How do Voices from Past Speech Synthesis Challenges Compare Today? (20)

More from Yamagishi Laboratory, National Institute of Informatics, Japan

More from Yamagishi Laboratory, National Institute of Informatics, Japan (17)

Recently uploaded

Recently uploaded (20)

How do Voices from Past Speech Synthesis Challenges Compare Today?