is2015_poster

Anomaly-Based Annotation Errors Detection in TTS Corpora
Jindřich Matoušek and Daniel Tihelka
Dept. of Cybernetics, NTIS – New Technologies for the Information Society
Faculty of Applied Sciences, University of West Bohemia, Czech Republic
jmatouse@kky.zcu.cz, dtihelka@ntis.zcu.cz
Introduction
Concatenative Speech Synthesis
• unit selection still very popular approach
to speech synthesis
• nearly natural-sounding speech when
enough data in given style available
Unit Selection Disadvantages
• very large speech corpora (>10 hours of
speech)
• high-quality consistent recordings (stu-
dio, voice quality, style, . . . )
• need for very precise annotation
• wrong word-level annotation causes gross
synthesis errors [1]!
• speech signal does not correspond to the an-
notation
« could result in unintelligible speech!
Misannotation Example
source recording
synthetic speech
wrong annot.
correct annot.
Aim of this Study
• use anomaly detection to detect word-
level annotation errors
« detected errors could be fixed/removed
from speech corpus
Anomaly Detection Methods
Problem Definition
• anomaly (novelty/outlier) detection
= identification of items which do not conform to an expected
pattern
• misannotated words considered as anomalous examples
• correctly annotated words taken as normal examples
• unsupervised technique under the assumption that examples
are not polluted by anomalies
Nn. . . number of normal examples
Nf . . . number of features
x(1)
, . . . , x(Nn)
. . . training set
of normal examples
x(i)
∈ RNf
. . . i-th example
Univariate Gaussian Distribution (UGD)
• each feature modeled separately with mean µj ∈ R and variance σ2
j ∈ R: xj ∼ N(µj, σ2
j)
• assumption of feature independence
• training – fitting parameters µj, σ2
j:
µj =
1
Nn
Nn
i=1
x
(i)
j , σ2
j =
1
Nn
Nn
i=1
(x
(i)
j − µj)2
• probability of a new example x:
p(x) =
Nf
j=1
p(xj; µj, σ2
j) =
Nf
j=1
1
√
2πσj
exp(−
(xj − µj)2
2σ2
j
)
• anomaly detection: if p(x) < ε ⇒ x is anomalous
• model parameter:
ε threshold probability value to distinguish between normal/anomalous examples
Multivariate Gaussian Distribution (MGD)
• p(x) modeled in one go using mean vector µ ∈ RNf
and covariance matrix Σ ∈ RNf×Nf
:
x ∼ NNf
(µ, Σ)
• training – fitting parameters µ, Σ:
µ =
1
Nn
Nn
i=1
x(i)
, Σ =
1
Nn
Nn
i=1
(x(i)
− µ)(x(i)
− µ)
• probability of a new example x:
p(x) =
1
(2π)Nf|Σ|
exp −
1
2
(x − µ) Σ−1
(x − µ)
• anomaly detection: if p(x) < ε ⇒ x is anomalous
• model parameter:
ε threshold probability value to distinguish between normal/anomalous examples
One-Class SVM (OCSVM)
• mapping of input data into a high dimensional feature space via a kernel function
• finding maximal margin hyperplane which best separates training data from the origin
• training – determining hyperplane parameters w, ρ (quadratic programming problem):
minw,ξ,ρ
1
2||w||2
+ 1
νNn
Nn
i=1 ξi − ρ s.t. w · Φ(x(i)
) ≥ ρ − ξi, i = 1, 2, . . . , Nn, ξi ≥ 0
• Gaussian radial basis function used as kernel function
K(x, x ) = exp(γ||x − x ||2
)
• binary decision function (αi . . . Lagrange multipliers):
f(x) = sgn(w · Φ(x) − ρ) = sgn(
Nn
i=1
αiK(x(i)
, x) − ρ)
• anomaly detection: f(x) =



+1 ⇒ x is normal
−1 ⇒ x is anomalous
• model parameters:
ν represents an upper bound on the fraction of possibly anomalous examples
γ kernel parameter
References
[1] J. Matoušek, D. Tihelka, and L. Šmídl, “On the impact of annotation errors on unit-selection speech synthesis,”
Text, Speech and Dialogue, ser. Lecture Notes in Computer Science, vol. 7499, 2012.
[2] C.-Y. Lin and R. Jang, “Automatic phonetic segmentation by score predictive model for the corpora of Mandarin
singing voices,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 7, 2007.
[3] J. Matoušek and D. Tihelka, “Annotation errors detection in TTS Corpora,” in INTERSPEECH 2013.
[4] F. Pedregosa et. al, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol 12, 2011.
[5] T. G. Dietterich, “Approximate statistical tests for comparing supervised classification algorithms,” Neural
Comput., vol. 10, 1998.
Experimental Data & Features
Speech Data
• Czech read-speech single-speaker corpus
• “news-broadcasting style”, no emotions, . . .
• recordings forced-aligned to phones (HTK)
• normal examples: 1124 correctly annot. words
• anomalous examples: 273 misannotated words
• misannotated words collected by human ex-
perts during TTS system evaluation
Feature Extraction & Collection
Phone-level features (basic, acoustic, spectral, other)
... ... ... ...
Word-level features
p1 p2 p3 p4
Phonetic
features
Positional
features
word1 word2
phrase1 phrase2
utterance1
words:
phrases:
utterances:
word3 word3 word3
phones:
Deviation from CART-
predicted values
Word-level
histograms
Word-level
statistics
Features
• based on intuitions when checking forced align-
ment
• phone- and word-level features
• z-score deviations from CART-predicted
phone-level values used to emphasize phone-
level anomalies
• phone-level to word-level features conversion
• mean/median, min., max. phone-level feature value
• range of feature values
• within-word anomalies emphasized by histograms
Summary of Features
Features Description
Phone-level features
Basic duration, forced-aligned acoustic likelihood
Acoustic energy, formants (F1, F2, F3, F2/F1), fundamental frequency (F0), zero
crossing, voiced/unvoiced ratio
Spectral spectral crest factor, rolloff, flatness, centroid, spread, kurtosis, skewness,
harmonic-to-noise ratio
Other score predictive model (SPM) [2], energy/duration ratio, spectral
centroid/duration ratio
Word-level features
Phonetic phonetic voicedness ratio, sonority ratio, syllabic consonants ratio,
articulation manner distribution, articulation place distribution, word
boundary voicedness match [3]
Positional forward/backward position of word/phrase in phrase/utterance, the
position of the phrase in an utterance
Experiments
Model Training and Selection
• Normal examples
• 60% (674) used for training
• 20% (225) used for validation
• 20% (225) used for evaluation
• Anomalous examples
• 50% (136) used for validation
• 50% (137) used for evaluation
• none used for training
• Model training and selection
• features standardized to have zero mean and unity
variance
• models’ parameters optimized with respect to F1
score applying a grid search with 10-fold cross val-
idation
• various feature set combinations were also part of
model selection
• scikit-learn toolkit [4] used
UGD, MGD, OCSVM – anomaly detection models
model∗
. . . optimally selected models
model0 . . . models trained on basic features
modelall . . . models trained on all features
modeldim. . . models with best reduced features
Model ID Parameters Features (#)
UGD∗
ε = 0.005 duration: stats + histogram +
zscore, acoust. likelihood: stats +
histogram, energy: zscore (28)
MGD∗
ε = 2.5e-14
OCSVM∗
ν = 0.005
γ = 0.03125
duration: stats + histogram +
zscore, acoust. likelihood: stats +
histogram, energy/duration: stats
(28)
UGDdim ε = 5.0e-24 PCA (20)
MGDdim ε = 5.0e-24 PCA (20)
OCSVMdim ν = 0.125
γ = 0.125
ICA (30)
UGD0 ε = 2.0e-7 duration: stats, acoust. likelihood:
stats (8)
MGD0 ε = 7.9e-4
OCSVM0 ν = 0.05 γ = 0.25
UGDall — all features (359)
MGDall —
OCSVMall ν = 0.075
γ = 2.4e-4
(— means no optimal values were found)
Dimensionality Reduction
• automatic selection of best
feature combination us-
ing dimensionality reduc-
tion techniques
• number of features seen
as another parameter of
model selection process
Comparison on validation set
PCA. . . principal component analysis
ICA . . . independent component analysis
FAG. . . feature agglomeration
CVF. . . features selected by cross validation
(i.e. UGD∗
, MGD∗
, OCSVM∗
)
PCA ICA FAG CVF
72
74
76
78
80
82
84
86
88
90
F1[%]
UGD
MGD
OCSVM
Evaluation & Results
Detection Metrics
• Precision: P =
tp
pp
• Recall: R =
tp
ap
• F1: F1 = 2∗P∗R
P+R
tp . . . No. of words correctly detected as misannotated
pp. . . No. of all words detected as misannotated
ap. . . No. of actual misannotated words
Alerted results are statistically significant (McNemar’s test, α = 0.05) [5]
Discussion
• dimensionality reduction techniques achieve similar results as fea-
ture combinations carefully selected by cross validation
• OCVSM with all features achieves comparable results
• features emphasizing anomalies (z-score deviations from CART-
predicted values, histograms) are important
• spectral, phonetic, and positional features not so important
Results
Model ID P[%] R[%] F1[%]
UGD∗
84.83 89.78 87.23
MGD∗
87.32 90.51 88.89
OCSVM∗
85.71 87.59 86.64
UGDdim 88.15 86.86 87.50
MGDdim 88.15 86.86 87.50
OCSVMdim 85.40 85.40 85.40
UGD0 84.26 66.42 74.29
MGD0 76.03 81.02 78.45
OCSVM0 82.95 78.10 80.45
UGDall 37.85 100.00 54.91
MGDall 47.06 99.27 63.85
OCSVMall 87.97 85.40 86.67
random detect. 23.70 25.50 24.60
Conclusion & Future Work
• all three anomaly detection techniques performed similarly well with F1 ≈ 89% when carefully
configured using grid search and cross validation
• results comparable with classification-based detection [3] but no misannotated words needed to train
anomaly-based detector ⇒ training data collection should be easier
• Future work:
• error analysis to spot any potential systematic trend in the misdetected words
• performance on data from more speakers and/or more languages
• performance on spontaneous (expressive) data
« annotate only words detected as erroneous when building a new TTS voice
This research was supported by the grant TAČR TA01030476. The access to the MetaCentrum clusters provided under the programme LM2010005 is highly appreciated.

is2015_poster

Recommended

Recommended

More Related Content

Similar to is2015_poster

Similar to is2015_poster (20)

is2015_poster