is2013

Annotation Errors Detection in TTS Corpora
Jindˇrich Matouˇsek Daniel Tihelka
University of West Bohemia
Faculty of Applied Sciences
Department of Cybernetics
Plzeˇn, Czech Republic
August 27th, 2013
Matouˇsek, J., Tihelka, D. Annotation Errors Detection in TTS Corpora 1 / 21

Outline
1 Introduction
2 Experimental Data
3 Annotation Error Detection Framework
4 Conclusions

Outline
1 Introduction
2 Experimental Data
4 Conclusions

Introduction
Unit Selection Speech Synthesis
Unit selection still very popular approach to speech synthesis
Nearly natural-sounding synthetic speech when
enough data in given style available

Introduction
Unit Selection Speech Synthesis
Unit selection still very popular approach to speech synthesis
Nearly natural-sounding synthetic speech when
enough data in given style available
(Some) Disadvantages of Unit Selection
Very large speech corpora (>10 hours of speech)
High-quality consistent recordings (studio, voice quality, style, . . . )
Very precise annotation of speech recordings (text, phonetics,
prosody, . . . )
Wrong word-level annotation causes gross synthesis errors!
speech signal does not correspond to the annotation
« could result in unintelligible speech

Introduction (cont.)
Example of a Misannotated Word
source recording
synthetic speech
wrong annot.
correct annot.
misannotated words = missing or extra words, swapped w., mispronounced w.

Aim of this study
Could automatic word-level error detection reveal annotation
errors?
If the error detection is good enough:
detected errors could be (manually) ﬁxed in speech corpus
or, detected words could be removed from speech corpus

Aim of this study
Could automatic word-level error detection reveal annotation
errors?
If the error detection is good enough:
detected errors could be (manually) ﬁxed in speech corpus
or, detected words could be removed from speech corpus
Ultimate Goal
Annotate only words/utterances detected as erroneous when
building a new TTS voice
In other cases rely on the original text prompts

Outline
1 Introduction
2 Experimental Data
4 Conclusions

Experimental Data
Source Speech Corpus
Czech read-speech single-speaker voice (ARTIC TTS system1)
“News-broadcasting style”, no spontaneous speech
≈12k utterances (≈18 hours of speech; 110k running words)
Segmented to phones using HTK HMM-based forced alignment2
Data for Annotation Error Detection
Based on human expert analysis of phonetic alignment
Number of misannotated words 267
Number of correctly annotated words 1,068
Total number of (running) words 1,335 (8.5 minutes)
1
D. Tihelka et al.: Enhancements of Viterbi Search for Fast Unit Selection Synthesis, Interspeech 2010.
2
Young, S., et al.: The HTK Book., Matouˇsek, J.: Automatic Pitch-Synchronous Phonetic Segmentation, Interspeech 2008.

Experimental Data (cont.)
Features
Based on intuitions of a human confronting speech signal with forced alignment
Computed in word-by-word manner
Abbrev. No. Description
BAS 7 Basic features:
mean, min., max. phone duration
mean, min., max. phone HMM acoustic likelihood
number of phones in a word
HIST 12 Histogram related features:
distribution of phone durations and acoust. likelihoods
PHON 28 Phonetic features:
“voicedness” ratio, sonority ratio, word boundary voicedness,
manner/place of articulation, syllabic consonants
POS 6 Positional features:
position of word/phrase in phrase/utterance (forward and reverse order)
number of words/phrases in phrase/utterance
DEV 3 Deviation from CART duration model:
mean, min., max. deviation from CART-based duration

Outline
1 Introduction
2 Experimental Data
4 Conclusions

Classification
Problem Definition
Two-class classification problem:
a word is annotated correctly
or, a word is misannotated

Classification
Problem Definition
Classifiers
Support vector machine (SVM)
linear kernel (SVM-LIN)
Gaussian radial basis kernel
(SVM-RBF)
Extremely randomized trees
(EXTREES)
k-nearest neighbor (KNN)

Classification
Problem Definition
Classifiers
Support vector machine (SVM)
linear kernel (SVM-LIN)
Gaussian radial basis kernel
(SVM-RBF)
Extremely randomized trees
(EXTREES)
k-nearest neighbor (KNN)
Classification Procedure
Feature
extraction
annotation data
Train/evaluation
split
features
training
data
Standardize
data
Classifier
training
Classifier
evaluation
standardized
training
data
best
classif.
setting
annotation error
detection results
evaluation
data
Stratified data split:
80% training
20% evaluation
Zero mean
unity variance
Grid search with
5-fold cross-validation
Accuracy, precision,
recall, F1

Classiﬁcation
Classiﬁcation Procedure3
Feature extraction
annotation data
features
training
data
Standardize
data
Classifier
training
standardized
training
data
best
classif.
setting
annotation error
detection results
10 stratified data splits:
80% training
20% evaluation
Zero mean
unity variance
Grid search with
recall, F1
Training / evaluation data split
Classifier evaluation
evaluation
data
training
data
Standardize
data
Classifier
training
standardized
training
data
best
classif.
setting
evaluation
data
training
data
Standardize
data
Classifier
training
best
classif.
setting
evaluation
data
...
split 1 split 2 split 10...
standardized
training
data
3
Scikit-learn: Machine Learning in Python, http://scikit-learn.org

Classification (cont.)
Word-Level Classification Results (F1 measure)
Features EXTREES KNN SVM-LIN SVM-RBF
BAS 0.824 0.758 0.744 0.826
BAS+HIST 0.807 0.748 0.840 0.846
BAS+HIST+PHON 0.809 0.605 0.838 0.837
BAS+HIST+POS 0.814 0.724 0.822 0.830
BAS+HIST+DEV 0.872 0.811 0.876 0.876
All features 0.865 0.713 0.868 0.869
No likelihoods 0.827 0.621 0.831 0.843
No lklhd., no dur. 0.182 0.288 0.406 0.406
Bold values in each row denote statistically significant results
(McNemar, α = 0.05)
« SVMs and EXTREES performed comparably well
« BAS+HIST+DEV and “all features” achieved best results

Two-Phase Classiﬁcation
Problem Description
Phase 1 probabilistic decision on each word to be misannotated
Phase 2 contextual features denoting “misannotation probability” of
previous/next/current word (with context n = 1, 2, 3)

Two-Phase Classification
Problem Description
Phase 1 probabilistic decision on each word to be misannotated
Phase 2 contextual features denoting “misannotation probability” of
previous/next/current word (with context n = 1, 2, 3)
Results
Configuration F1
CTX (n = 1) 0.897
CTX (n = 2) 0.897
CTX (n = 3) 0.875
SVM-RBF (n = 0) 0.876
Random classif. 0.200
Rule-based classif. 0.725
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision
Precision-Recall Curve (SVM-RBF, CTX n=1)
AUC = 0.92
Better results but not statistically significant (McNemar, α = 0.05)

Novelty Detection
Problem Deﬁnition
Novelty detection:
new observation belongs to the same distribution as existing ones
or, should be considered as different
Trained on correctly annotated words only!
One-class SVM with RBF kernel used
Similar procedure as for classiﬁcation utilized

Novelty Detection
Novelty Detection Procedure
Feature extraction
features
training
data
Standardize
data
Novelty detector
training
standardized
training
data
best
classif.
setting
annotation error
detection results
10 stratified data splits:
80% training
20% evaluation
Zero mean
unity variance
Grid search with
recall, F1
Training / evaluation data split
Novelty detector evaluation
evaluation
data
Standardize
data
Novelty detector
training
standardized
training
data
best
classif.
setting
Standardize
data
Novelty detector
training
best
classif.
setting
...
split 1 split 2 split 10...
annotation data
(correctly annotated words)
annotation data
(misannotated words)
features
standardized
training
data
training
data
evaluation
data
training
data
evaluation
data

Novelty Detection
Problem Definition
Novelty detection:
new observation belongs to the same distribution as existing ones
or, should be considered as different
Trained on correctly annotated words only!
One-class SVM with RBF kernel used
Similar procedure as for classification utilized
Detection Results
Configuration Accuracy Precision Recall F1
NOVELTY 0.889 0.924 0.872 0.897
SVM-RBF (n = 0) 0.948 0.831 0.925 0.875
CTX (n = 1) 0.959 0.889 0.906 0.897

Utterance-Level Detection
Problem deﬁnition
Detection of whether an utterance contains any annotation error
Could be used to ﬁlter out utterances with annotation errors
Still trained on word-level features
Evaluation on utterance level
“One takes all strategy evaluation”
utterance marked as misannotated if it contained at least one
misannotated word

Utterance-Level Detection
Problem deﬁnition
Detection of whether an utterance contains any annotation error
Could be used to ﬁlter out utterances with annotation errors
Still trained on word-level features
Evaluation on utterance level
“One takes all strategy evaluation”
utterance marked as misannotated if it contained at least one
misannotated word
Experimental Data
158 utterances
88 utterances containing an annotation error
70 utterances containing correctly annotated words only
20% used for evaluation in each training/evaluation split

Utterance-Level Detection (cont.)
Detection results
Classiﬁer Accuracy Precision Recall F1
EXTREES 0.963 0.958 0.978 0.967
KNN 0.875 0.936 0.839 0.882
SVM-LIN 0.856 0.801 1.000 0.888
SVM-RBF 0.909 0.865 1.000 0.926
NOVELTY 0.902 0.898 1.000 0.946
EXTREES CTX (n = 1) 0.969 0.947 1.000 0.973
BAS+HIST+DEV features used
Results averaged over 10 training/evaluation splits
EXTREES achieved the best performance (statistically signiﬁcant,
McNemar, α = 0.05)

Outline
1 Introduction
2 Experimental Data
4 Conclusions

Conclusions
Conclusion
Automatic detection of annot. errors for read-speech TTS corpora
Good results
word level: ≈ 90%
utterance level: ≈ 97%
Multiple word-level errors within a single utterance

Conclusions
Conclusion
Automatic detection of annot. errors for read-speech TTS corpora
Good results
word level: ≈ 90%
utterance level: ≈ 97%
Multiple word-level errors within a single utterance
Future Work
More data from more speakers/languages
Spontaneous (expressive) data
« Annotate only words/utterances detected as erroneous when
building a new TTS voice

Thank you for your attention!

is2013

More Related Content

Viewers also liked

Similar to is2013

is2013