SlideShare a Scribd company logo
Anomaly-Based Annotation Errors Detection in TTS Corpora
Jindřich Matoušek and Daniel Tihelka
Dept. of Cybernetics, NTIS – New Technologies for the Information Society
Faculty of Applied Sciences, University of West Bohemia, Czech Republic
jmatouse@kky.zcu.cz, dtihelka@ntis.zcu.cz
Introduction
Concatenative Speech Synthesis
• unit selection still very popular approach
to speech synthesis
• nearly natural-sounding speech when
enough data in given style available
Unit Selection Disadvantages
• very large speech corpora (>10 hours of
speech)
• high-quality consistent recordings (stu-
dio, voice quality, style, . . . )
• need for very precise annotation
• wrong word-level annotation causes gross
synthesis errors [1]!
• speech signal does not correspond to the an-
notation
« could result in unintelligible speech!
Misannotation Example
source recording
synthetic speech
wrong annot.
correct annot.
Aim of this Study
• use anomaly detection to detect word-
level annotation errors
« detected errors could be fixed/removed
from speech corpus
Anomaly Detection Methods
Problem Definition
• anomaly (novelty/outlier) detection
= identification of items which do not conform to an expected
pattern
• misannotated words considered as anomalous examples
• correctly annotated words taken as normal examples
• unsupervised technique under the assumption that examples
are not polluted by anomalies
Nn. . . number of normal examples
Nf . . . number of features
x(1)
, . . . , x(Nn)
. . . training set
of normal examples
x(i)
∈ RNf
. . . i-th example
Univariate Gaussian Distribution (UGD)
• each feature modeled separately with mean µj ∈ R and variance σ2
j ∈ R: xj ∼ N(µj, σ2
j)
• assumption of feature independence
• training – fitting parameters µj, σ2
j:
µj =
1
Nn
Nn
i=1
x
(i)
j , σ2
j =
1
Nn
Nn
i=1
(x
(i)
j − µj)2
• probability of a new example x:
p(x) =
Nf
j=1
p(xj; µj, σ2
j) =
Nf
j=1
1
√
2πσj
exp(−
(xj − µj)2
2σ2
j
)
• anomaly detection: if p(x) < ε ⇒ x is anomalous
• model parameter:
ε threshold probability value to distinguish between normal/anomalous examples
Multivariate Gaussian Distribution (MGD)
• p(x) modeled in one go using mean vector µ ∈ RNf
and covariance matrix Σ ∈ RNf×Nf
:
x ∼ NNf
(µ, Σ)
• training – fitting parameters µ, Σ:
µ =
1
Nn
Nn
i=1
x(i)
, Σ =
1
Nn
Nn
i=1
(x(i)
− µ)(x(i)
− µ)
• probability of a new example x:
p(x) =
1
(2π)Nf|Σ|
exp −
1
2
(x − µ) Σ−1
(x − µ)
• anomaly detection: if p(x) < ε ⇒ x is anomalous
• model parameter:
ε threshold probability value to distinguish between normal/anomalous examples
One-Class SVM (OCSVM)
• mapping of input data into a high dimensional feature space via a kernel function
• finding maximal margin hyperplane which best separates training data from the origin
• training – determining hyperplane parameters w, ρ (quadratic programming problem):
minw,ξ,ρ
1
2||w||2
+ 1
νNn
Nn
i=1 ξi − ρ s.t. w · Φ(x(i)
) ≥ ρ − ξi, i = 1, 2, . . . , Nn, ξi ≥ 0
• Gaussian radial basis function used as kernel function
K(x, x ) = exp(γ||x − x ||2
)
• binary decision function (αi . . . Lagrange multipliers):
f(x) = sgn(w · Φ(x) − ρ) = sgn(
Nn
i=1
αiK(x(i)
, x) − ρ)
• anomaly detection: f(x) =



+1 ⇒ x is normal
−1 ⇒ x is anomalous
• model parameters:
ν represents an upper bound on the fraction of possibly anomalous examples
γ kernel parameter
References
[1] J. Matoušek, D. Tihelka, and L. Šmídl, “On the impact of annotation errors on unit-selection speech synthesis,”
Text, Speech and Dialogue, ser. Lecture Notes in Computer Science, vol. 7499, 2012.
[2] C.-Y. Lin and R. Jang, “Automatic phonetic segmentation by score predictive model for the corpora of Mandarin
singing voices,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 7, 2007.
[3] J. Matoušek and D. Tihelka, “Annotation errors detection in TTS Corpora,” in INTERSPEECH 2013.
[4] F. Pedregosa et. al, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol 12, 2011.
[5] T. G. Dietterich, “Approximate statistical tests for comparing supervised classification algorithms,” Neural
Comput., vol. 10, 1998.
Experimental Data & Features
Speech Data
• Czech read-speech single-speaker corpus
• “news-broadcasting style”, no emotions, . . .
• recordings forced-aligned to phones (HTK)
• normal examples: 1124 correctly annot. words
• anomalous examples: 273 misannotated words
• misannotated words collected by human ex-
perts during TTS system evaluation
Feature Extraction & Collection
Phone-level features (basic, acoustic, spectral, other)
... ... ... ...
Word-level features
p1 p2 p3 p4
Phonetic
features
Positional
features
word1 word2
phrase1 phrase2
utterance1
words:
phrases:
utterances:
word3 word3 word3
phones:
Deviation from CART-
predicted values
Word-level
histograms
Word-level
statistics
Features
• based on intuitions when checking forced align-
ment
• phone- and word-level features
• z-score deviations from CART-predicted
phone-level values used to emphasize phone-
level anomalies
• phone-level to word-level features conversion
• mean/median, min., max. phone-level feature value
• range of feature values
• within-word anomalies emphasized by histograms
Summary of Features
Features Description
Phone-level features
Basic duration, forced-aligned acoustic likelihood
Acoustic energy, formants (F1, F2, F3, F2/F1), fundamental frequency (F0), zero
crossing, voiced/unvoiced ratio
Spectral spectral crest factor, rolloff, flatness, centroid, spread, kurtosis, skewness,
harmonic-to-noise ratio
Other score predictive model (SPM) [2], energy/duration ratio, spectral
centroid/duration ratio
Word-level features
Phonetic phonetic voicedness ratio, sonority ratio, syllabic consonants ratio,
articulation manner distribution, articulation place distribution, word
boundary voicedness match [3]
Positional forward/backward position of word/phrase in phrase/utterance, the
position of the phrase in an utterance
Experiments
Model Training and Selection
• Normal examples
• 60% (674) used for training
• 20% (225) used for validation
• 20% (225) used for evaluation
• Anomalous examples
• 50% (136) used for validation
• 50% (137) used for evaluation
• none used for training
• Model training and selection
• features standardized to have zero mean and unity
variance
• models’ parameters optimized with respect to F1
score applying a grid search with 10-fold cross val-
idation
• various feature set combinations were also part of
model selection
• scikit-learn toolkit [4] used
UGD, MGD, OCSVM – anomaly detection models
model∗
. . . optimally selected models
model0 . . . models trained on basic features
modelall . . . models trained on all features
modeldim. . . models with best reduced features
Model ID Parameters Features (#)
UGD∗
ε = 0.005 duration: stats + histogram +
zscore, acoust. likelihood: stats +
histogram, energy: zscore (28)
MGD∗
ε = 2.5e-14
OCSVM∗
ν = 0.005
γ = 0.03125
duration: stats + histogram +
zscore, acoust. likelihood: stats +
histogram, energy/duration: stats
(28)
UGDdim ε = 5.0e-24 PCA (20)
MGDdim ε = 5.0e-24 PCA (20)
OCSVMdim ν = 0.125
γ = 0.125
ICA (30)
UGD0 ε = 2.0e-7 duration: stats, acoust. likelihood:
stats (8)
MGD0 ε = 7.9e-4
OCSVM0 ν = 0.05 γ = 0.25
UGDall — all features (359)
MGDall —
OCSVMall ν = 0.075
γ = 2.4e-4
(— means no optimal values were found)
Dimensionality Reduction
• automatic selection of best
feature combination us-
ing dimensionality reduc-
tion techniques
• number of features seen
as another parameter of
model selection process
Comparison on validation set
PCA. . . principal component analysis
ICA . . . independent component analysis
FAG. . . feature agglomeration
CVF. . . features selected by cross validation
(i.e. UGD∗
, MGD∗
, OCSVM∗
)
PCA ICA FAG CVF
72
74
76
78
80
82
84
86
88
90
F1[%]
UGD
MGD
OCSVM
Evaluation & Results
Detection Metrics
• Precision: P =
tp
pp
• Recall: R =
tp
ap
• F1: F1 = 2∗P∗R
P+R
tp . . . No. of words correctly detected as misannotated
pp. . . No. of all words detected as misannotated
ap. . . No. of actual misannotated words
Alerted results are statistically significant (McNemar’s test, α = 0.05) [5]
Discussion
• dimensionality reduction techniques achieve similar results as fea-
ture combinations carefully selected by cross validation
• OCVSM with all features achieves comparable results
• features emphasizing anomalies (z-score deviations from CART-
predicted values, histograms) are important
• spectral, phonetic, and positional features not so important
Results
Model ID P[%] R[%] F1[%]
UGD∗
84.83 89.78 87.23
MGD∗
87.32 90.51 88.89
OCSVM∗
85.71 87.59 86.64
UGDdim 88.15 86.86 87.50
MGDdim 88.15 86.86 87.50
OCSVMdim 85.40 85.40 85.40
UGD0 84.26 66.42 74.29
MGD0 76.03 81.02 78.45
OCSVM0 82.95 78.10 80.45
UGDall 37.85 100.00 54.91
MGDall 47.06 99.27 63.85
OCSVMall 87.97 85.40 86.67
random detect. 23.70 25.50 24.60
Conclusion & Future Work
• all three anomaly detection techniques performed similarly well with F1 ≈ 89% when carefully
configured using grid search and cross validation
• results comparable with classification-based detection [3] but no misannotated words needed to train
anomaly-based detector ⇒ training data collection should be easier
• Future work:
• error analysis to spot any potential systematic trend in the misdetected words
• performance on data from more speakers and/or more languages
• performance on spontaneous (expressive) data
« annotate only words detected as erroneous when building a new TTS voice
This research was supported by the grant TAČR TA01030476. The access to the MetaCentrum clusters provided under the programme LM2010005 is highly appreciated.

More Related Content

Similar to is2015_poster

Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Boston Institute of Analytics
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design TrainingESCOM
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Marina Santini
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
Soumya Mukherjee
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdf
Jyoti Yadav
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
Chamani Shiranthika
 
MEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsMEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsGIScRG
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on Premises
Ivo Andreev
 
On the Semantics of Real-Time Domain Specific Modeling Languages
On the Semantics of Real-Time Domain Specific Modeling LanguagesOn the Semantics of Real-Time Domain Specific Modeling Languages
On the Semantics of Real-Time Domain Specific Modeling LanguagesJose E. Rivera
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Vienna Data Science Group
 
Learning with classification and clustering, neural networks
Learning with classification and clustering, neural networksLearning with classification and clustering, neural networks
Learning with classification and clustering, neural networks
Shaun D'Souza
 
Machine learning
Machine learning Machine learning
Machine learning
Aarthi Srinivasan
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
Fares Al-Qunaieer
 
Computer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programmingComputer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programming
Wookjin Choi
 
Week 1.pdf
Week 1.pdfWeek 1.pdf
Week 1.pdf
AnjaliJain608033
 
Nighthawk: A Two-Level Genetic-Random Unit Test Data Generator
Nighthawk: A Two-Level Genetic-Random Unit Test Data GeneratorNighthawk: A Two-Level Genetic-Random Unit Test Data Generator
Nighthawk: A Two-Level Genetic-Random Unit Test Data Generator
CS, NcState
 

Similar to is2015_poster (20)

Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdf
 
I2b2 2008
I2b2 2008I2b2 2008
I2b2 2008
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
MEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsMEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational Experiments
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on Premises
 
On the Semantics of Real-Time Domain Specific Modeling Languages
On the Semantics of Real-Time Domain Specific Modeling LanguagesOn the Semantics of Real-Time Domain Specific Modeling Languages
On the Semantics of Real-Time Domain Specific Modeling Languages
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
 
Learning with classification and clustering, neural networks
Learning with classification and clustering, neural networksLearning with classification and clustering, neural networks
Learning with classification and clustering, neural networks
 
1st sem
1st sem1st sem
1st sem
 
1st sem
1st sem1st sem
1st sem
 
Machine learning
Machine learning Machine learning
Machine learning
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Computer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programmingComputer aided detection of pulmonary nodules using genetic programming
Computer aided detection of pulmonary nodules using genetic programming
 
Week 1.pdf
Week 1.pdfWeek 1.pdf
Week 1.pdf
 
Nighthawk: A Two-Level Genetic-Random Unit Test Data Generator
Nighthawk: A Two-Level Genetic-Random Unit Test Data GeneratorNighthawk: A Two-Level Genetic-Random Unit Test Data Generator
Nighthawk: A Two-Level Genetic-Random Unit Test Data Generator
 

is2015_poster

  • 1. Anomaly-Based Annotation Errors Detection in TTS Corpora Jindřich Matoušek and Daniel Tihelka Dept. of Cybernetics, NTIS – New Technologies for the Information Society Faculty of Applied Sciences, University of West Bohemia, Czech Republic jmatouse@kky.zcu.cz, dtihelka@ntis.zcu.cz Introduction Concatenative Speech Synthesis • unit selection still very popular approach to speech synthesis • nearly natural-sounding speech when enough data in given style available Unit Selection Disadvantages • very large speech corpora (>10 hours of speech) • high-quality consistent recordings (stu- dio, voice quality, style, . . . ) • need for very precise annotation • wrong word-level annotation causes gross synthesis errors [1]! • speech signal does not correspond to the an- notation « could result in unintelligible speech! Misannotation Example source recording synthetic speech wrong annot. correct annot. Aim of this Study • use anomaly detection to detect word- level annotation errors « detected errors could be fixed/removed from speech corpus Anomaly Detection Methods Problem Definition • anomaly (novelty/outlier) detection = identification of items which do not conform to an expected pattern • misannotated words considered as anomalous examples • correctly annotated words taken as normal examples • unsupervised technique under the assumption that examples are not polluted by anomalies Nn. . . number of normal examples Nf . . . number of features x(1) , . . . , x(Nn) . . . training set of normal examples x(i) ∈ RNf . . . i-th example Univariate Gaussian Distribution (UGD) • each feature modeled separately with mean µj ∈ R and variance σ2 j ∈ R: xj ∼ N(µj, σ2 j) • assumption of feature independence • training – fitting parameters µj, σ2 j: µj = 1 Nn Nn i=1 x (i) j , σ2 j = 1 Nn Nn i=1 (x (i) j − µj)2 • probability of a new example x: p(x) = Nf j=1 p(xj; µj, σ2 j) = Nf j=1 1 √ 2πσj exp(− (xj − µj)2 2σ2 j ) • anomaly detection: if p(x) < ε ⇒ x is anomalous • model parameter: ε threshold probability value to distinguish between normal/anomalous examples Multivariate Gaussian Distribution (MGD) • p(x) modeled in one go using mean vector µ ∈ RNf and covariance matrix Σ ∈ RNf×Nf : x ∼ NNf (µ, Σ) • training – fitting parameters µ, Σ: µ = 1 Nn Nn i=1 x(i) , Σ = 1 Nn Nn i=1 (x(i) − µ)(x(i) − µ) • probability of a new example x: p(x) = 1 (2π)Nf|Σ| exp − 1 2 (x − µ) Σ−1 (x − µ) • anomaly detection: if p(x) < ε ⇒ x is anomalous • model parameter: ε threshold probability value to distinguish between normal/anomalous examples One-Class SVM (OCSVM) • mapping of input data into a high dimensional feature space via a kernel function • finding maximal margin hyperplane which best separates training data from the origin • training – determining hyperplane parameters w, ρ (quadratic programming problem): minw,ξ,ρ 1 2||w||2 + 1 νNn Nn i=1 ξi − ρ s.t. w · Φ(x(i) ) ≥ ρ − ξi, i = 1, 2, . . . , Nn, ξi ≥ 0 • Gaussian radial basis function used as kernel function K(x, x ) = exp(γ||x − x ||2 ) • binary decision function (αi . . . Lagrange multipliers): f(x) = sgn(w · Φ(x) − ρ) = sgn( Nn i=1 αiK(x(i) , x) − ρ) • anomaly detection: f(x) =    +1 ⇒ x is normal −1 ⇒ x is anomalous • model parameters: ν represents an upper bound on the fraction of possibly anomalous examples γ kernel parameter References [1] J. Matoušek, D. Tihelka, and L. Šmídl, “On the impact of annotation errors on unit-selection speech synthesis,” Text, Speech and Dialogue, ser. Lecture Notes in Computer Science, vol. 7499, 2012. [2] C.-Y. Lin and R. Jang, “Automatic phonetic segmentation by score predictive model for the corpora of Mandarin singing voices,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 7, 2007. [3] J. Matoušek and D. Tihelka, “Annotation errors detection in TTS Corpora,” in INTERSPEECH 2013. [4] F. Pedregosa et. al, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol 12, 2011. [5] T. G. Dietterich, “Approximate statistical tests for comparing supervised classification algorithms,” Neural Comput., vol. 10, 1998. Experimental Data & Features Speech Data • Czech read-speech single-speaker corpus • “news-broadcasting style”, no emotions, . . . • recordings forced-aligned to phones (HTK) • normal examples: 1124 correctly annot. words • anomalous examples: 273 misannotated words • misannotated words collected by human ex- perts during TTS system evaluation Feature Extraction & Collection Phone-level features (basic, acoustic, spectral, other) ... ... ... ... Word-level features p1 p2 p3 p4 Phonetic features Positional features word1 word2 phrase1 phrase2 utterance1 words: phrases: utterances: word3 word3 word3 phones: Deviation from CART- predicted values Word-level histograms Word-level statistics Features • based on intuitions when checking forced align- ment • phone- and word-level features • z-score deviations from CART-predicted phone-level values used to emphasize phone- level anomalies • phone-level to word-level features conversion • mean/median, min., max. phone-level feature value • range of feature values • within-word anomalies emphasized by histograms Summary of Features Features Description Phone-level features Basic duration, forced-aligned acoustic likelihood Acoustic energy, formants (F1, F2, F3, F2/F1), fundamental frequency (F0), zero crossing, voiced/unvoiced ratio Spectral spectral crest factor, rolloff, flatness, centroid, spread, kurtosis, skewness, harmonic-to-noise ratio Other score predictive model (SPM) [2], energy/duration ratio, spectral centroid/duration ratio Word-level features Phonetic phonetic voicedness ratio, sonority ratio, syllabic consonants ratio, articulation manner distribution, articulation place distribution, word boundary voicedness match [3] Positional forward/backward position of word/phrase in phrase/utterance, the position of the phrase in an utterance Experiments Model Training and Selection • Normal examples • 60% (674) used for training • 20% (225) used for validation • 20% (225) used for evaluation • Anomalous examples • 50% (136) used for validation • 50% (137) used for evaluation • none used for training • Model training and selection • features standardized to have zero mean and unity variance • models’ parameters optimized with respect to F1 score applying a grid search with 10-fold cross val- idation • various feature set combinations were also part of model selection • scikit-learn toolkit [4] used UGD, MGD, OCSVM – anomaly detection models model∗ . . . optimally selected models model0 . . . models trained on basic features modelall . . . models trained on all features modeldim. . . models with best reduced features Model ID Parameters Features (#) UGD∗ ε = 0.005 duration: stats + histogram + zscore, acoust. likelihood: stats + histogram, energy: zscore (28) MGD∗ ε = 2.5e-14 OCSVM∗ ν = 0.005 γ = 0.03125 duration: stats + histogram + zscore, acoust. likelihood: stats + histogram, energy/duration: stats (28) UGDdim ε = 5.0e-24 PCA (20) MGDdim ε = 5.0e-24 PCA (20) OCSVMdim ν = 0.125 γ = 0.125 ICA (30) UGD0 ε = 2.0e-7 duration: stats, acoust. likelihood: stats (8) MGD0 ε = 7.9e-4 OCSVM0 ν = 0.05 γ = 0.25 UGDall — all features (359) MGDall — OCSVMall ν = 0.075 γ = 2.4e-4 (— means no optimal values were found) Dimensionality Reduction • automatic selection of best feature combination us- ing dimensionality reduc- tion techniques • number of features seen as another parameter of model selection process Comparison on validation set PCA. . . principal component analysis ICA . . . independent component analysis FAG. . . feature agglomeration CVF. . . features selected by cross validation (i.e. UGD∗ , MGD∗ , OCSVM∗ ) PCA ICA FAG CVF 72 74 76 78 80 82 84 86 88 90 F1[%] UGD MGD OCSVM Evaluation & Results Detection Metrics • Precision: P = tp pp • Recall: R = tp ap • F1: F1 = 2∗P∗R P+R tp . . . No. of words correctly detected as misannotated pp. . . No. of all words detected as misannotated ap. . . No. of actual misannotated words Alerted results are statistically significant (McNemar’s test, α = 0.05) [5] Discussion • dimensionality reduction techniques achieve similar results as fea- ture combinations carefully selected by cross validation • OCVSM with all features achieves comparable results • features emphasizing anomalies (z-score deviations from CART- predicted values, histograms) are important • spectral, phonetic, and positional features not so important Results Model ID P[%] R[%] F1[%] UGD∗ 84.83 89.78 87.23 MGD∗ 87.32 90.51 88.89 OCSVM∗ 85.71 87.59 86.64 UGDdim 88.15 86.86 87.50 MGDdim 88.15 86.86 87.50 OCSVMdim 85.40 85.40 85.40 UGD0 84.26 66.42 74.29 MGD0 76.03 81.02 78.45 OCSVM0 82.95 78.10 80.45 UGDall 37.85 100.00 54.91 MGDall 47.06 99.27 63.85 OCSVMall 87.97 85.40 86.67 random detect. 23.70 25.50 24.60 Conclusion & Future Work • all three anomaly detection techniques performed similarly well with F1 ≈ 89% when carefully configured using grid search and cross validation • results comparable with classification-based detection [3] but no misannotated words needed to train anomaly-based detector ⇒ training data collection should be easier • Future work: • error analysis to spot any potential systematic trend in the misdetected words • performance on data from more speakers and/or more languages • performance on spontaneous (expressive) data « annotate only words detected as erroneous when building a new TTS voice This research was supported by the grant TAČR TA01030476. The access to the MetaCentrum clusters provided under the programme LM2010005 is highly appreciated.