Annotation Errors Detection in TTS Corpora
Jindˇrich Matouˇsek Daniel Tihelka
University of West Bohemia
Faculty of Applied Sciences
Department of Cybernetics
Plzeˇn, Czech Republic
August 27th, 2013
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 1 / 21
Outline
1 Introduction
2 Experimental Data
3 Annotation Error Detection Framework
4 Conclusions
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 2 / 21
Outline
1 Introduction
2 Experimental Data
3 Annotation Error Detection Framework
4 Conclusions
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 3 / 21
Introduction
Unit Selection Speech Synthesis
Unit selection still very popular approach to speech synthesis
Nearly natural-sounding synthetic speech when
enough data in given style available
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 4 / 21
Introduction
Unit Selection Speech Synthesis
Unit selection still very popular approach to speech synthesis
Nearly natural-sounding synthetic speech when
enough data in given style available
(Some) Disadvantages of Unit Selection
Very large speech corpora (>10 hours of speech)
High-quality consistent recordings (studio, voice quality, style, . . . )
Very precise annotation of speech recordings (text, phonetics,
prosody, . . . )
Wrong word-level annotation causes gross synthesis errors!
speech signal does not correspond to the annotation
« could result in unintelligible speech
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 4 / 21
Introduction (cont.)
Example of a Misannotated Word
source recording
synthetic speech
wrong annot.
correct annot.
misannotated words = missing or extra words, swapped w., mispronounced w.
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 5 / 21
Introduction (cont.)
Aim of this study
Could automatic word-level error detection reveal annotation
errors?
If the error detection is good enough:
detected errors could be (manually) fixed in speech corpus
or, detected words could be removed from speech corpus
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 6 / 21
Introduction (cont.)
Aim of this study
Could automatic word-level error detection reveal annotation
errors?
If the error detection is good enough:
detected errors could be (manually) fixed in speech corpus
or, detected words could be removed from speech corpus
Ultimate Goal
Annotate only words/utterances detected as erroneous when
building a new TTS voice
In other cases rely on the original text prompts
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 6 / 21
Outline
1 Introduction
2 Experimental Data
3 Annotation Error Detection Framework
4 Conclusions
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 7 / 21
Experimental Data
Source Speech Corpus
Czech read-speech single-speaker voice (ARTIC TTS system1)
“News-broadcasting style”, no spontaneous speech
≈12k utterances (≈18 hours of speech; 110k running words)
Segmented to phones using HTK HMM-based forced alignment2
Data for Annotation Error Detection
Based on human expert analysis of phonetic alignment
Number of misannotated words 267
Number of correctly annotated words 1,068
Total number of (running) words 1,335 (8.5 minutes)
1
D. Tihelka et al.: Enhancements of Viterbi Search for Fast Unit Selection Synthesis, Interspeech 2010.
2
Young, S., et al.: The HTK Book., Matouˇsek, J.: Automatic Pitch-Synchronous Phonetic Segmentation, Interspeech 2008.
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 8 / 21
Experimental Data (cont.)
Features
Based on intuitions of a human confronting speech signal with forced alignment
Computed in word-by-word manner
Abbrev. No. Description
BAS 7 Basic features:
mean, min., max. phone duration
mean, min., max. phone HMM acoustic likelihood
number of phones in a word
HIST 12 Histogram related features:
distribution of phone durations and acoust. likelihoods
PHON 28 Phonetic features:
“voicedness” ratio, sonority ratio, word boundary voicedness,
manner/place of articulation, syllabic consonants
POS 6 Positional features:
position of word/phrase in phrase/utterance (forward and reverse order)
number of words/phrases in phrase/utterance
DEV 3 Deviation from CART duration model:
mean, min., max. deviation from CART-based duration
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 9 / 21
Outline
1 Introduction
2 Experimental Data
3 Annotation Error Detection Framework
4 Conclusions
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 10 / 21
Classification
Problem Definition
Two-class classification problem:
a word is annotated correctly
or, a word is misannotated
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
Classification
Problem Definition
Two-class classification problem:
a word is annotated correctly
or, a word is misannotated
Classifiers
Support vector machine (SVM)
linear kernel (SVM-LIN)
Gaussian radial basis kernel
(SVM-RBF)
Extremely randomized trees
(EXTREES)
k-nearest neighbor (KNN)
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
Classification
Problem Definition
Two-class classification problem:
a word is annotated correctly
or, a word is misannotated
Classifiers
Support vector machine (SVM)
linear kernel (SVM-LIN)
Gaussian radial basis kernel
(SVM-RBF)
Extremely randomized trees
(EXTREES)
k-nearest neighbor (KNN)
Classification Procedure
Feature
extraction
annotation data
Train/evaluation
split
features
training
data
Standardize
data
Classifier
training
Classifier
evaluation
standardized
training
data
best
classif.
setting
annotation error
detection results
evaluation
data
Stratified data split:
80% training
20% evaluation
Zero mean
unity variance
Grid search with
5-fold cross-validation
Accuracy, precision,
recall, F1
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
Classification
Classification Procedure3
Feature extraction
annotation data
features
training
data
Standardize
data
Classifier
training
standardized
training
data
best
classif.
setting
annotation error
detection results
10 stratified data splits:
80% training
20% evaluation
Zero mean
unity variance
Grid search with
5-fold cross-validation
Accuracy, precision,
recall, F1
Training / evaluation data split
Classifier evaluation
evaluation
data
training
data
Standardize
data
Classifier
training
standardized
training
data
best
classif.
setting
evaluation
data
training
data
Standardize
data
Classifier
training
best
classif.
setting
evaluation
data
...
split 1 split 2 split 10...
standardized
training
data
3
Scikit-learn: Machine Learning in Python, http://scikit-learn.org
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
Classification (cont.)
Word-Level Classification Results (F1 measure)
Features EXTREES KNN SVM-LIN SVM-RBF
BAS 0.824 0.758 0.744 0.826
BAS+HIST 0.807 0.748 0.840 0.846
BAS+HIST+PHON 0.809 0.605 0.838 0.837
BAS+HIST+POS 0.814 0.724 0.822 0.830
BAS+HIST+DEV 0.872 0.811 0.876 0.876
All features 0.865 0.713 0.868 0.869
No likelihoods 0.827 0.621 0.831 0.843
No lklhd., no dur. 0.182 0.288 0.406 0.406
Bold values in each row denote statistically significant results
(McNemar, α = 0.05)
« SVMs and EXTREES performed comparably well
« BAS+HIST+DEV and “all features” achieved best results
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 12 / 21
Two-Phase Classification
Problem Description
Phase 1 probabilistic decision on each word to be misannotated
Phase 2 contextual features denoting “misannotation probability” of
previous/next/current word (with context n = 1, 2, 3)
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 13 / 21
Two-Phase Classification
Problem Description
Phase 1 probabilistic decision on each word to be misannotated
Phase 2 contextual features denoting “misannotation probability” of
previous/next/current word (with context n = 1, 2, 3)
Results
Configuration F1
CTX (n = 1) 0.897
CTX (n = 2) 0.897
CTX (n = 3) 0.875
SVM-RBF (n = 0) 0.876
Random classif. 0.200
Rule-based classif. 0.725
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision
Precision-Recall Curve (SVM-RBF, CTX n=1)
AUC = 0.92
Better results but not statistically significant (McNemar, α = 0.05)
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 13 / 21
Novelty Detection
Problem Definition
Novelty detection:
new observation belongs to the same distribution as existing ones
or, should be considered as different
Trained on correctly annotated words only!
One-class SVM with RBF kernel used
Similar procedure as for classification utilized
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 14 / 21
Novelty Detection
Novelty Detection Procedure
Feature extraction
features
training
data
Standardize
data
Novelty detector
training
standardized
training
data
best
classif.
setting
annotation error
detection results
10 stratified data splits:
80% training
20% evaluation
Zero mean
unity variance
Grid search with
5-fold cross-validation
Accuracy, precision,
recall, F1
Training / evaluation data split
Novelty detector evaluation
evaluation
data
Standardize
data
Novelty detector
training
standardized
training
data
best
classif.
setting
Standardize
data
Novelty detector
training
best
classif.
setting
...
split 1 split 2 split 10...
annotation data
(correctly annotated words)
annotation data
(misannotated words)
features
standardized
training
data
training
data
evaluation
data
training
data
evaluation
data
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 15 / 21
Novelty Detection
Problem Definition
Novelty detection:
new observation belongs to the same distribution as existing ones
or, should be considered as different
Trained on correctly annotated words only!
One-class SVM with RBF kernel used
Similar procedure as for classification utilized
Detection Results
Configuration Accuracy Precision Recall F1
NOVELTY 0.889 0.924 0.872 0.897
SVM-RBF (n = 0) 0.948 0.831 0.925 0.875
CTX (n = 1) 0.959 0.889 0.906 0.897
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 16 / 21
Utterance-Level Detection
Problem definition
Detection of whether an utterance contains any annotation error
Could be used to filter out utterances with annotation errors
Still trained on word-level features
Evaluation on utterance level
“One takes all strategy evaluation”
utterance marked as misannotated if it contained at least one
misannotated word
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 17 / 21
Utterance-Level Detection
Problem definition
Detection of whether an utterance contains any annotation error
Could be used to filter out utterances with annotation errors
Still trained on word-level features
Evaluation on utterance level
“One takes all strategy evaluation”
utterance marked as misannotated if it contained at least one
misannotated word
Experimental Data
158 utterances
88 utterances containing an annotation error
70 utterances containing correctly annotated words only
20% used for evaluation in each training/evaluation split
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 17 / 21
Utterance-Level Detection (cont.)
Detection results
Classifier Accuracy Precision Recall F1
EXTREES 0.963 0.958 0.978 0.967
KNN 0.875 0.936 0.839 0.882
SVM-LIN 0.856 0.801 1.000 0.888
SVM-RBF 0.909 0.865 1.000 0.926
NOVELTY 0.902 0.898 1.000 0.946
EXTREES CTX (n = 1) 0.969 0.947 1.000 0.973
BAS+HIST+DEV features used
Results averaged over 10 training/evaluation splits
EXTREES achieved the best performance (statistically significant,
McNemar, α = 0.05)
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 18 / 21
Outline
1 Introduction
2 Experimental Data
3 Annotation Error Detection Framework
4 Conclusions
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 19 / 21
Conclusions
Conclusion
Automatic detection of annot. errors for read-speech TTS corpora
Good results
word level: ≈ 90%
utterance level: ≈ 97%
Multiple word-level errors within a single utterance
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 20 / 21
Conclusions
Conclusion
Automatic detection of annot. errors for read-speech TTS corpora
Good results
word level: ≈ 90%
utterance level: ≈ 97%
Multiple word-level errors within a single utterance
Future Work
More data from more speakers/languages
Spontaneous (expressive) data
« Annotate only words/utterances detected as erroneous when
building a new TTS voice
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 20 / 21
Thank you for your attention!
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 21 / 21

is2013

  • 1.
    Annotation Errors Detectionin TTS Corpora Jindˇrich Matouˇsek Daniel Tihelka University of West Bohemia Faculty of Applied Sciences Department of Cybernetics Plzeˇn, Czech Republic August 27th, 2013 Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 1 / 21
  • 2.
    Outline 1 Introduction 2 ExperimentalData 3 Annotation Error Detection Framework 4 Conclusions Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 2 / 21
  • 3.
    Outline 1 Introduction 2 ExperimentalData 3 Annotation Error Detection Framework 4 Conclusions Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 3 / 21
  • 4.
    Introduction Unit Selection SpeechSynthesis Unit selection still very popular approach to speech synthesis Nearly natural-sounding synthetic speech when enough data in given style available Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 4 / 21
  • 5.
    Introduction Unit Selection SpeechSynthesis Unit selection still very popular approach to speech synthesis Nearly natural-sounding synthetic speech when enough data in given style available (Some) Disadvantages of Unit Selection Very large speech corpora (>10 hours of speech) High-quality consistent recordings (studio, voice quality, style, . . . ) Very precise annotation of speech recordings (text, phonetics, prosody, . . . ) Wrong word-level annotation causes gross synthesis errors! speech signal does not correspond to the annotation « could result in unintelligible speech Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 4 / 21
  • 6.
    Introduction (cont.) Example ofa Misannotated Word source recording synthetic speech wrong annot. correct annot. misannotated words = missing or extra words, swapped w., mispronounced w. Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 5 / 21
  • 7.
    Introduction (cont.) Aim ofthis study Could automatic word-level error detection reveal annotation errors? If the error detection is good enough: detected errors could be (manually) fixed in speech corpus or, detected words could be removed from speech corpus Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 6 / 21
  • 8.
    Introduction (cont.) Aim ofthis study Could automatic word-level error detection reveal annotation errors? If the error detection is good enough: detected errors could be (manually) fixed in speech corpus or, detected words could be removed from speech corpus Ultimate Goal Annotate only words/utterances detected as erroneous when building a new TTS voice In other cases rely on the original text prompts Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 6 / 21
  • 9.
    Outline 1 Introduction 2 ExperimentalData 3 Annotation Error Detection Framework 4 Conclusions Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 7 / 21
  • 10.
    Experimental Data Source SpeechCorpus Czech read-speech single-speaker voice (ARTIC TTS system1) “News-broadcasting style”, no spontaneous speech ≈12k utterances (≈18 hours of speech; 110k running words) Segmented to phones using HTK HMM-based forced alignment2 Data for Annotation Error Detection Based on human expert analysis of phonetic alignment Number of misannotated words 267 Number of correctly annotated words 1,068 Total number of (running) words 1,335 (8.5 minutes) 1 D. Tihelka et al.: Enhancements of Viterbi Search for Fast Unit Selection Synthesis, Interspeech 2010. 2 Young, S., et al.: The HTK Book., Matouˇsek, J.: Automatic Pitch-Synchronous Phonetic Segmentation, Interspeech 2008. Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 8 / 21
  • 11.
    Experimental Data (cont.) Features Basedon intuitions of a human confronting speech signal with forced alignment Computed in word-by-word manner Abbrev. No. Description BAS 7 Basic features: mean, min., max. phone duration mean, min., max. phone HMM acoustic likelihood number of phones in a word HIST 12 Histogram related features: distribution of phone durations and acoust. likelihoods PHON 28 Phonetic features: “voicedness” ratio, sonority ratio, word boundary voicedness, manner/place of articulation, syllabic consonants POS 6 Positional features: position of word/phrase in phrase/utterance (forward and reverse order) number of words/phrases in phrase/utterance DEV 3 Deviation from CART duration model: mean, min., max. deviation from CART-based duration Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 9 / 21
  • 12.
    Outline 1 Introduction 2 ExperimentalData 3 Annotation Error Detection Framework 4 Conclusions Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 10 / 21
  • 13.
    Classification Problem Definition Two-class classificationproblem: a word is annotated correctly or, a word is misannotated Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
  • 14.
    Classification Problem Definition Two-class classificationproblem: a word is annotated correctly or, a word is misannotated Classifiers Support vector machine (SVM) linear kernel (SVM-LIN) Gaussian radial basis kernel (SVM-RBF) Extremely randomized trees (EXTREES) k-nearest neighbor (KNN) Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
  • 15.
    Classification Problem Definition Two-class classificationproblem: a word is annotated correctly or, a word is misannotated Classifiers Support vector machine (SVM) linear kernel (SVM-LIN) Gaussian radial basis kernel (SVM-RBF) Extremely randomized trees (EXTREES) k-nearest neighbor (KNN) Classification Procedure Feature extraction annotation data Train/evaluation split features training data Standardize data Classifier training Classifier evaluation standardized training data best classif. setting annotation error detection results evaluation data Stratified data split: 80% training 20% evaluation Zero mean unity variance Grid search with 5-fold cross-validation Accuracy, precision, recall, F1 Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
  • 16.
    Classification Classification Procedure3 Feature extraction annotationdata features training data Standardize data Classifier training standardized training data best classif. setting annotation error detection results 10 stratified data splits: 80% training 20% evaluation Zero mean unity variance Grid search with 5-fold cross-validation Accuracy, precision, recall, F1 Training / evaluation data split Classifier evaluation evaluation data training data Standardize data Classifier training standardized training data best classif. setting evaluation data training data Standardize data Classifier training best classif. setting evaluation data ... split 1 split 2 split 10... standardized training data 3 Scikit-learn: Machine Learning in Python, http://scikit-learn.org Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 11 / 21
  • 17.
    Classification (cont.) Word-Level ClassificationResults (F1 measure) Features EXTREES KNN SVM-LIN SVM-RBF BAS 0.824 0.758 0.744 0.826 BAS+HIST 0.807 0.748 0.840 0.846 BAS+HIST+PHON 0.809 0.605 0.838 0.837 BAS+HIST+POS 0.814 0.724 0.822 0.830 BAS+HIST+DEV 0.872 0.811 0.876 0.876 All features 0.865 0.713 0.868 0.869 No likelihoods 0.827 0.621 0.831 0.843 No lklhd., no dur. 0.182 0.288 0.406 0.406 Bold values in each row denote statistically significant results (McNemar, α = 0.05) « SVMs and EXTREES performed comparably well « BAS+HIST+DEV and “all features” achieved best results Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 12 / 21
  • 18.
    Two-Phase Classification Problem Description Phase1 probabilistic decision on each word to be misannotated Phase 2 contextual features denoting “misannotation probability” of previous/next/current word (with context n = 1, 2, 3) Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 13 / 21
  • 19.
    Two-Phase Classification Problem Description Phase1 probabilistic decision on each word to be misannotated Phase 2 contextual features denoting “misannotation probability” of previous/next/current word (with context n = 1, 2, 3) Results Configuration F1 CTX (n = 1) 0.897 CTX (n = 2) 0.897 CTX (n = 3) 0.875 SVM-RBF (n = 0) 0.876 Random classif. 0.200 Rule-based classif. 0.725 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Precision-Recall Curve (SVM-RBF, CTX n=1) AUC = 0.92 Better results but not statistically significant (McNemar, α = 0.05) Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 13 / 21
  • 20.
    Novelty Detection Problem Definition Noveltydetection: new observation belongs to the same distribution as existing ones or, should be considered as different Trained on correctly annotated words only! One-class SVM with RBF kernel used Similar procedure as for classification utilized Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 14 / 21
  • 21.
    Novelty Detection Novelty DetectionProcedure Feature extraction features training data Standardize data Novelty detector training standardized training data best classif. setting annotation error detection results 10 stratified data splits: 80% training 20% evaluation Zero mean unity variance Grid search with 5-fold cross-validation Accuracy, precision, recall, F1 Training / evaluation data split Novelty detector evaluation evaluation data Standardize data Novelty detector training standardized training data best classif. setting Standardize data Novelty detector training best classif. setting ... split 1 split 2 split 10... annotation data (correctly annotated words) annotation data (misannotated words) features standardized training data training data evaluation data training data evaluation data Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 15 / 21
  • 22.
    Novelty Detection Problem Definition Noveltydetection: new observation belongs to the same distribution as existing ones or, should be considered as different Trained on correctly annotated words only! One-class SVM with RBF kernel used Similar procedure as for classification utilized Detection Results Configuration Accuracy Precision Recall F1 NOVELTY 0.889 0.924 0.872 0.897 SVM-RBF (n = 0) 0.948 0.831 0.925 0.875 CTX (n = 1) 0.959 0.889 0.906 0.897 Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 16 / 21
  • 23.
    Utterance-Level Detection Problem definition Detectionof whether an utterance contains any annotation error Could be used to filter out utterances with annotation errors Still trained on word-level features Evaluation on utterance level “One takes all strategy evaluation” utterance marked as misannotated if it contained at least one misannotated word Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 17 / 21
  • 24.
    Utterance-Level Detection Problem definition Detectionof whether an utterance contains any annotation error Could be used to filter out utterances with annotation errors Still trained on word-level features Evaluation on utterance level “One takes all strategy evaluation” utterance marked as misannotated if it contained at least one misannotated word Experimental Data 158 utterances 88 utterances containing an annotation error 70 utterances containing correctly annotated words only 20% used for evaluation in each training/evaluation split Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 17 / 21
  • 25.
    Utterance-Level Detection (cont.) Detectionresults Classifier Accuracy Precision Recall F1 EXTREES 0.963 0.958 0.978 0.967 KNN 0.875 0.936 0.839 0.882 SVM-LIN 0.856 0.801 1.000 0.888 SVM-RBF 0.909 0.865 1.000 0.926 NOVELTY 0.902 0.898 1.000 0.946 EXTREES CTX (n = 1) 0.969 0.947 1.000 0.973 BAS+HIST+DEV features used Results averaged over 10 training/evaluation splits EXTREES achieved the best performance (statistically significant, McNemar, α = 0.05) Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 18 / 21
  • 26.
    Outline 1 Introduction 2 ExperimentalData 3 Annotation Error Detection Framework 4 Conclusions Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 19 / 21
  • 27.
    Conclusions Conclusion Automatic detection ofannot. errors for read-speech TTS corpora Good results word level: ≈ 90% utterance level: ≈ 97% Multiple word-level errors within a single utterance Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 20 / 21
  • 28.
    Conclusions Conclusion Automatic detection ofannot. errors for read-speech TTS corpora Good results word level: ≈ 90% utterance level: ≈ 97% Multiple word-level errors within a single utterance Future Work More data from more speakers/languages Spontaneous (expressive) data « Annotate only words/utterances detected as erroneous when building a new TTS voice Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 20 / 21
  • 29.
    Thank you foryour attention! Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 21 / 21