SlideShare a Scribd company logo
The VoiceMOS Challenge 2022
Wen-Chin Huang¹, Erica Cooper², Yu Tsao³, Hsin-Min Wang³, Tomoki Toda¹, Junichi Yamagishi²
¹Nagoya University, Japan
²National Institute of Informatics, Japan
³Academia Sinica, Taiwan
第141回音声言語情報処理
研究発表会 招待講演
Outline
I. Introduction
II. Challenge description
A. Tracks and datasets
B. Rules and timeline
C. Evaluation metrics
D. Baseline systems
III. Challenge results
A. Participants demographics
B. Results, analysis and discussion
1. Comparison of baseline systems
2. Analysis of top systems
3. Sources of difficulty
4. Analysis of metrics
IV. Conclusions
Introduction
Important to evaluate speech synthesis systems, ex. text-to-speech (TTS), voice conversion (VC).
Mean opinion score (MOS) test:
Rate quality of individual samples.
Speech quality assessment
Subjective
test
Drawbacks:
1. Expensive: Costs too much time and money.
2. Context-dependent: numbers cannot be meaningfully
compared across different listening tests.
Important to evaluate speech synthesis systems, ex. text-to-speech (TTS), voice conversion (VC).
Mean opinion score (MOS) test:
Rate quality of individual samples.
Speech quality assessment
Subjective
test
Objective
assessment
Data-driven MOS prediction
(mostly based on deep learning)
Drawback: bad
generalization ability
due to limited training
data.
Goals of the VoiceMOS challenge
*Accepted as an Interspeech 2022 special session!
Encourage research in
automatic data-driven
MOS prediction
Compare different
approaches using
shared datasets and
evaluation
Focus on the
challenging case of
generalizing to a
separate listening test
Promote discussion
about the future of
this research field
Challenge
description
I. Tracks and datasets
II. Rules and timeline
III. Evaluation metrics
IV. Baseline systems
Challenge platform: CodaLab
Open-source web-based platform for reproducible machine learning research.
Tracks and dataset: Main track
The BVCC Dataset
● Samples from 187 different systems all rated together in one listening test
○ Past Blizzard Challenges (text-to-speech synthesis) since 2008
○ Past Voice Conversion Challenges (voice conversion) since 2016
○ ESPnet-TTS (implementations of modern TTS systems), 2020
● Test set contains some unseen systems, unseen listeners, and unseen speakers and is
balanced to match the distribution of scores in the training set
Listening test data from the Blizzard Challenge 2019
● “Out-of-domain” (OOD): Data from a completely separate listening test
● Chinese-language synthesis from systems submitted to the 2019 Blizzard Challenge
● Test set has some unseen systems and unseen listeners
● Designed to reflect a real-world setting where a small amount of labeled data is available
● Study generalization ability to a different listening test context
● Encourage unsupervised and semi-supervised approaches using unlabeled data
Tracks and dataset: OOD track
10% 40% 10% 40%
Labeled train set Unlabeled train set Dev set Test set
Dataset summary
Rules and timeline
Training phase
✔Main track train/dev audio+rating
✔OOD track train/dev audio+rating
Leaderboard available
Need to submit at least once
2021/12/5 2022/2/21 2022/2/28 2022/3/7
Break phase
No submission available.
Conduct result analysis.
Evaluation phase
✔Main track test
audio
✔OOD track test audio
Leaderboard available
Max 3 submissions
Post-evaluation phase
✔Main track test rating
✔OOD track test rating
✔Evaluation results
Submission & leaderboard reopen
General rules
External data is permitted.
All participants need to submit system description.
Evaluation metrics
System-level and Utterance-level
● Mean Squared Error (MSE): difference between predicted and actual MOS
● Linear Correlation Coefficient (LCC): a basic correlation measure
● Spearman Rank Correlation Coefficient (SRCC): non-parametric; measures ranking order
● Kendall Tau Rank Correlation (KTAU): more robust to errors
import numpy as np
import scipy.stats
# `true_mean_scores` and `predict_mean_scores` are both 1-d numpy arrays.
MSE = np.mean((true_mean_scores - predict_mean_scores)**2)
LCC = np.corrcoef(true_mean_scores, predict_mean_scores)[0][1]
SRCC = scipy.stats.spearmanr(true_mean_scores, predict_mean_scores)[0]
KTAU = scipy.stats.kendalltau(true_mean_scores, predict_mean_scores)[0]
Following prior work, we
picked system-level SRCC as
the main evaluation metric.
Baseline system: SSL-MOS
Fine-tune a self-supervised learning based (SSL) speech model for the MOS prediction task
● Pretrained wav2vec2
● Simple mean pooling and a linear fine-tuning layer
● Wav2vec2 model parameters are updated during fine-tuning
E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022
Baseline system: MOSANet
● Originally developed for noisy speech
assessment
● Cross-domain input features:
○ Spectral information
○ Complex features
○ Raw waveform
○ Features extracted from SSL models
R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “Deep Learning-based Non-Intrusive
Multi-Objective Speech Assessment Model with Cross-Domain Features,” arXiv preprint
arXiv:2111.02363, 2021.
Baseline system: LDNet
Listener-dependent modeling
● Specialized model structure and inference
method allows making use of multiple ratings
per audio sample.
● No external data is used!
W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech,” in Proc. ICASSP, 2022
Challenge results
I. Participants demographics
II. Results, analysis and discussion
A. Comparison of baseline systems
B. Analysis of top systems
C. Sources of difficulty
D. Analysis of metrics
Participants demographics
Number of teams: 22 teams + 3 baselines
14 teams are from academia, 5 teams are
from industry, 3 teams are personal
Main track: 21 teams + 3 baselines
OOD track: 15 teams + 3 baselines
Baseline systems:
● B01: SSL-MOS
● B02: MOSANet
● B03: LDNet
Overall evaluation results: main track, OOD track
Comparison of baseline systems: main track
In terms of system-level SRCC,
11 teams outperformed the best
baseline, B01!
However, the gap between the
best baseline and the top
system is not large…
Comparison of baseline systems: OOD track
In terms of system-level SRCC,
only 2 teams outperformed or
on par with B01.
The gap is even smaller…
Participant feedback:
“The baseline was too strong!
Hard to get improvement!”
Analysis of approaches used
● Main track: Finetuning SSL > using SSL features > not using SSL
● OOD track: finetuned SSL models were both the best and worst systems
● Popular approaches:
○ Ensembling (top team in main track; top 2 teams in OOD track)
○ Multi-task learning
○ Use of speech recognizers (top team in OOD track)
Finetuned SSL
SSL
features
No
SSL
Analysis of approaches used
● 7 teams used per-listener ratings
● No teams used listener demographics
○ One team used “listener group”
● OOD track: only 3 teams used the unlabeled data:
Conducted their
own listening test
(top team)
Task-adaptive
pretraining
“Pseudo-label” the
unlabeled data using
trained model
Sources of difficulty
Are unseen categories more difficult?
Unseen systems
Unseen speakers
Unseen listeners
Main track OOD track
no
yes (7 teams)
Category
yes (17 teams)
yes (6 teams)
N/A
no
Sources of difficulty
Systems with large differences between their training and test set distributions are harder to predict.
Sources of difficulty
Low-quality systems
are easy to predict.
Middle and high
quality systems are
harder to predict.
Analysis of metrics
Why did you use system-level
SRCC as main metric?
Why are there 4 metrics?
There are two main
usage of MOS
prediction models:
Compare a lot of systems
Ex. evaluation in scientific challenges.
→ Ranking-related metrics are preferred
(LCC, SRCC, KTAU)
Evaluate absolute goodness of a system
Ex. use as objective function in training.
→ Numerical metrics are preferred
(MSE)
Analysis of metrics
We calculated the linear correlation coefficients between all metrics using main track results.
Correlation coefficients between
ranking based metrics are close to 1.
MSE is different from the other
three metrics.
⇒ Future researchers can consider just reporting 2 metrics: MSE + {LCC, SRCC, KTAU}.
⇒ It is still of significant importance to develop a general metric that reflects both aspects.
Conclusions
Conclusions
⇒ Attracted more than
20 participant teams.
⇒ SSL is very
powerful in this task.
⇒ Generalizing to a
different listening test
is still very hard.
⇒ There will be a 2nd,
3rd, 4th,... version!!
The goals of the VoiceMOS challenge:
Useful materials
The CodaLab challenge page is still open! Datasets are free to download. VoiceMOS Challenge
The baseline systems are open-sourced!
● https://github.com/nii-yamagishilab/mos-finetune-ssl
● https://github.com/dhimasryan/MOSA-Net-Cross-Domain
● https://github.com/unilight/LDNet
The arXiv paper is available!
● https://arxiv.org/abs/2203.11389

More Related Content

What's hot

実環境音響信号処理における収音技術
実環境音響信号処理における収音技術実環境音響信号処理における収音技術
実環境音響信号処理における収音技術
Yuma Koizumi
 
異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
NU_I_TODALAB
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech production
NU_I_TODALAB
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
NU_I_TODALAB
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
NU_I_TODALAB
 
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
Junya Koguchi
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
Deep Learning JP
 
[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
Deep Learning JP
 
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
YosukeKashiwagi1
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
Yuki Saito
 
深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調
Yuma Koizumi
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
NU_I_TODALAB
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
NU_I_TODALAB
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
Tomoki Hayashi
 
Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討
Shinnosuke Takamichi
 
音声合成のコーパスをつくろう
音声合成のコーパスをつくろう音声合成のコーパスをつくろう
音声合成のコーパスをつくろう
Shinnosuke Takamichi
 
Asj2017 3invited
Asj2017 3invitedAsj2017 3invited
Asj2017 3invited
SaruwatariLabUTokyo
 
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
Deep Learning JP
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
Yuki Saito
 
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
Shinnosuke Takamichi
 

What's hot (20)

実環境音響信号処理における収音技術
実環境音響信号処理における収音技術実環境音響信号処理における収音技術
実環境音響信号処理における収音技術
 
異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech production
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
 
[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
 
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討
 
音声合成のコーパスをつくろう
音声合成のコーパスをつくろう音声合成のコーパスをつくろう
音声合成のコーパスをつくろう
 
Asj2017 3invited
Asj2017 3invitedAsj2017 3invited
Asj2017 3invited
 
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
[DL輪読会]Wav2CLIP: Learning Robust Audio Representations From CLIP
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
 

Similar to The VoiceMOS Challenge 2022

The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?
Katrien Verbert
 
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Jekaterina Novikova, PhD
 
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Daniel Roggen
 
Explainable AI for non-expert users
Explainable AI for non-expert usersExplainable AI for non-expert users
Explainable AI for non-expert users
Katrien Verbert
 
Improving evaluations and utilization with statistical edge nested data desi...
Improving evaluations and utilization with statistical edge  nested data desi...Improving evaluations and utilization with statistical edge  nested data desi...
Improving evaluations and utilization with statistical edge nested data desi...CesToronto
 
HCI 3e - Ch 9: Evaluation techniques
HCI 3e - Ch 9:  Evaluation techniquesHCI 3e - Ch 9:  Evaluation techniques
HCI 3e - Ch 9: Evaluation techniques
Alan Dix
 
SLR.pdf
SLR.pdfSLR.pdf
SLR.pdf
SLR.pdfSLR.pdf
Mixed-initiative recommender systems: towards a next generation of recommende...
Mixed-initiative recommender systems: towards a next generation of recommende...Mixed-initiative recommender systems: towards a next generation of recommende...
Mixed-initiative recommender systems: towards a next generation of recommende...
Katrien Verbert
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
ISSEL
 
My experiment
My experimentMy experiment
My experiment
Boshra Albayaty
 
Leveraging Technology in a Music Classroom
Leveraging Technology in a Music ClassroomLeveraging Technology in a Music Classroom
Leveraging Technology in a Music Classroom
Dan Massoth
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Quinsulon Israel
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
multimediaeval
 
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docxG2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
Maicol Suntasig
 
Dstc6 an introduction
Dstc6 an introductionDstc6 an introduction
Dstc6 an introduction
hkh
 
Explaining recommendations: design implications and lessons learned
Explaining recommendations: design implications and lessons learnedExplaining recommendations: design implications and lessons learned
Explaining recommendations: design implications and lessons learned
Katrien Verbert
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
SAIL_QU
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
Alan Said
 

Similar to The VoiceMOS Challenge 2022 (20)

The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?
 
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
 
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
 
Explainable AI for non-expert users
Explainable AI for non-expert usersExplainable AI for non-expert users
Explainable AI for non-expert users
 
Improving evaluations and utilization with statistical edge nested data desi...
Improving evaluations and utilization with statistical edge  nested data desi...Improving evaluations and utilization with statistical edge  nested data desi...
Improving evaluations and utilization with statistical edge nested data desi...
 
HCI 3e - Ch 9: Evaluation techniques
HCI 3e - Ch 9:  Evaluation techniquesHCI 3e - Ch 9:  Evaluation techniques
HCI 3e - Ch 9: Evaluation techniques
 
SLR.pdf
SLR.pdfSLR.pdf
SLR.pdf
 
SLR.pdf
SLR.pdfSLR.pdf
SLR.pdf
 
Mixed-initiative recommender systems: towards a next generation of recommende...
Mixed-initiative recommender systems: towards a next generation of recommende...Mixed-initiative recommender systems: towards a next generation of recommende...
Mixed-initiative recommender systems: towards a next generation of recommende...
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
My experiment
My experimentMy experiment
My experiment
 
Leveraging Technology in a Music Classroom
Leveraging Technology in a Music ClassroomLeveraging Technology in a Music Classroom
Leveraging Technology in a Music Classroom
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
 
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docxG2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
 
Dstc6 an introduction
Dstc6 an introductionDstc6 an introduction
Dstc6 an introduction
 
Explaining recommendations: design implications and lessons learned
Explaining recommendations: design implications and lessons learnedExplaining recommendations: design implications and lessons learned
Explaining recommendations: design implications and lessons learned
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 

More from NU_I_TODALAB

信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
NU_I_TODALAB
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
NU_I_TODALAB
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
NU_I_TODALAB
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
NU_I_TODALAB
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
NU_I_TODALAB
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
NU_I_TODALAB
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
NU_I_TODALAB
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
NU_I_TODALAB
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
NU_I_TODALAB
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
NU_I_TODALAB
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
NU_I_TODALAB
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法
NU_I_TODALAB
 
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
NU_I_TODALAB
 
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
NU_I_TODALAB
 
音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?
NU_I_TODALAB
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
NU_I_TODALAB
 

More from NU_I_TODALAB (16)

信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法
 
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
 
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
 
音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
 

Recently uploaded

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 

Recently uploaded (20)

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 

The VoiceMOS Challenge 2022

  • 1. The VoiceMOS Challenge 2022 Wen-Chin Huang¹, Erica Cooper², Yu Tsao³, Hsin-Min Wang³, Tomoki Toda¹, Junichi Yamagishi² ¹Nagoya University, Japan ²National Institute of Informatics, Japan ³Academia Sinica, Taiwan 第141回音声言語情報処理 研究発表会 招待講演
  • 2. Outline I. Introduction II. Challenge description A. Tracks and datasets B. Rules and timeline C. Evaluation metrics D. Baseline systems III. Challenge results A. Participants demographics B. Results, analysis and discussion 1. Comparison of baseline systems 2. Analysis of top systems 3. Sources of difficulty 4. Analysis of metrics IV. Conclusions
  • 4. Important to evaluate speech synthesis systems, ex. text-to-speech (TTS), voice conversion (VC). Mean opinion score (MOS) test: Rate quality of individual samples. Speech quality assessment Subjective test Drawbacks: 1. Expensive: Costs too much time and money. 2. Context-dependent: numbers cannot be meaningfully compared across different listening tests.
  • 5. Important to evaluate speech synthesis systems, ex. text-to-speech (TTS), voice conversion (VC). Mean opinion score (MOS) test: Rate quality of individual samples. Speech quality assessment Subjective test Objective assessment Data-driven MOS prediction (mostly based on deep learning) Drawback: bad generalization ability due to limited training data.
  • 6. Goals of the VoiceMOS challenge *Accepted as an Interspeech 2022 special session! Encourage research in automatic data-driven MOS prediction Compare different approaches using shared datasets and evaluation Focus on the challenging case of generalizing to a separate listening test Promote discussion about the future of this research field
  • 7. Challenge description I. Tracks and datasets II. Rules and timeline III. Evaluation metrics IV. Baseline systems
  • 8. Challenge platform: CodaLab Open-source web-based platform for reproducible machine learning research.
  • 9. Tracks and dataset: Main track The BVCC Dataset ● Samples from 187 different systems all rated together in one listening test ○ Past Blizzard Challenges (text-to-speech synthesis) since 2008 ○ Past Voice Conversion Challenges (voice conversion) since 2016 ○ ESPnet-TTS (implementations of modern TTS systems), 2020 ● Test set contains some unseen systems, unseen listeners, and unseen speakers and is balanced to match the distribution of scores in the training set
  • 10. Listening test data from the Blizzard Challenge 2019 ● “Out-of-domain” (OOD): Data from a completely separate listening test ● Chinese-language synthesis from systems submitted to the 2019 Blizzard Challenge ● Test set has some unseen systems and unseen listeners ● Designed to reflect a real-world setting where a small amount of labeled data is available ● Study generalization ability to a different listening test context ● Encourage unsupervised and semi-supervised approaches using unlabeled data Tracks and dataset: OOD track 10% 40% 10% 40% Labeled train set Unlabeled train set Dev set Test set
  • 12. Rules and timeline Training phase ✔Main track train/dev audio+rating ✔OOD track train/dev audio+rating Leaderboard available Need to submit at least once 2021/12/5 2022/2/21 2022/2/28 2022/3/7 Break phase No submission available. Conduct result analysis. Evaluation phase ✔Main track test audio ✔OOD track test audio Leaderboard available Max 3 submissions Post-evaluation phase ✔Main track test rating ✔OOD track test rating ✔Evaluation results Submission & leaderboard reopen General rules External data is permitted. All participants need to submit system description.
  • 13. Evaluation metrics System-level and Utterance-level ● Mean Squared Error (MSE): difference between predicted and actual MOS ● Linear Correlation Coefficient (LCC): a basic correlation measure ● Spearman Rank Correlation Coefficient (SRCC): non-parametric; measures ranking order ● Kendall Tau Rank Correlation (KTAU): more robust to errors import numpy as np import scipy.stats # `true_mean_scores` and `predict_mean_scores` are both 1-d numpy arrays. MSE = np.mean((true_mean_scores - predict_mean_scores)**2) LCC = np.corrcoef(true_mean_scores, predict_mean_scores)[0][1] SRCC = scipy.stats.spearmanr(true_mean_scores, predict_mean_scores)[0] KTAU = scipy.stats.kendalltau(true_mean_scores, predict_mean_scores)[0] Following prior work, we picked system-level SRCC as the main evaluation metric.
  • 14. Baseline system: SSL-MOS Fine-tune a self-supervised learning based (SSL) speech model for the MOS prediction task ● Pretrained wav2vec2 ● Simple mean pooling and a linear fine-tuning layer ● Wav2vec2 model parameters are updated during fine-tuning E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022
  • 15. Baseline system: MOSANet ● Originally developed for noisy speech assessment ● Cross-domain input features: ○ Spectral information ○ Complex features ○ Raw waveform ○ Features extracted from SSL models R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features,” arXiv preprint arXiv:2111.02363, 2021.
  • 16. Baseline system: LDNet Listener-dependent modeling ● Specialized model structure and inference method allows making use of multiple ratings per audio sample. ● No external data is used! W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech,” in Proc. ICASSP, 2022
  • 17. Challenge results I. Participants demographics II. Results, analysis and discussion A. Comparison of baseline systems B. Analysis of top systems C. Sources of difficulty D. Analysis of metrics
  • 18. Participants demographics Number of teams: 22 teams + 3 baselines 14 teams are from academia, 5 teams are from industry, 3 teams are personal Main track: 21 teams + 3 baselines OOD track: 15 teams + 3 baselines Baseline systems: ● B01: SSL-MOS ● B02: MOSANet ● B03: LDNet
  • 19. Overall evaluation results: main track, OOD track
  • 20. Comparison of baseline systems: main track In terms of system-level SRCC, 11 teams outperformed the best baseline, B01! However, the gap between the best baseline and the top system is not large…
  • 21. Comparison of baseline systems: OOD track In terms of system-level SRCC, only 2 teams outperformed or on par with B01. The gap is even smaller… Participant feedback: “The baseline was too strong! Hard to get improvement!”
  • 22. Analysis of approaches used ● Main track: Finetuning SSL > using SSL features > not using SSL ● OOD track: finetuned SSL models were both the best and worst systems ● Popular approaches: ○ Ensembling (top team in main track; top 2 teams in OOD track) ○ Multi-task learning ○ Use of speech recognizers (top team in OOD track) Finetuned SSL SSL features No SSL
  • 23. Analysis of approaches used ● 7 teams used per-listener ratings ● No teams used listener demographics ○ One team used “listener group” ● OOD track: only 3 teams used the unlabeled data: Conducted their own listening test (top team) Task-adaptive pretraining “Pseudo-label” the unlabeled data using trained model
  • 24. Sources of difficulty Are unseen categories more difficult? Unseen systems Unseen speakers Unseen listeners Main track OOD track no yes (7 teams) Category yes (17 teams) yes (6 teams) N/A no
  • 25. Sources of difficulty Systems with large differences between their training and test set distributions are harder to predict.
  • 26. Sources of difficulty Low-quality systems are easy to predict. Middle and high quality systems are harder to predict.
  • 27. Analysis of metrics Why did you use system-level SRCC as main metric? Why are there 4 metrics? There are two main usage of MOS prediction models: Compare a lot of systems Ex. evaluation in scientific challenges. → Ranking-related metrics are preferred (LCC, SRCC, KTAU) Evaluate absolute goodness of a system Ex. use as objective function in training. → Numerical metrics are preferred (MSE)
  • 28. Analysis of metrics We calculated the linear correlation coefficients between all metrics using main track results. Correlation coefficients between ranking based metrics are close to 1. MSE is different from the other three metrics. ⇒ Future researchers can consider just reporting 2 metrics: MSE + {LCC, SRCC, KTAU}. ⇒ It is still of significant importance to develop a general metric that reflects both aspects.
  • 30. Conclusions ⇒ Attracted more than 20 participant teams. ⇒ SSL is very powerful in this task. ⇒ Generalizing to a different listening test is still very hard. ⇒ There will be a 2nd, 3rd, 4th,... version!! The goals of the VoiceMOS challenge:
  • 31. Useful materials The CodaLab challenge page is still open! Datasets are free to download. VoiceMOS Challenge The baseline systems are open-sourced! ● https://github.com/nii-yamagishilab/mos-finetune-ssl ● https://github.com/dhimasryan/MOSA-Net-Cross-Domain ● https://github.com/unilight/LDNet The arXiv paper is available! ● https://arxiv.org/abs/2203.11389