CUHK System for the Spoken Web Search task at Mediaeval 2012

Overview System Description System performance Conclusion Acknowledgement

The CUHK Systems for Spoken Web Search task at
MediaEval 2012

Haipeng Wang and Tan Lee

Department of Electronic Engineering
The Chinese University of Hong Kong

September 30, 2012


Outline

1 Overview

2 System Description
PTDTW framework
Tokenizers
DTW detection
Pseudo-relevance Feedback and Score Normalization

3 System conﬁguration and performance

4 Conclusion

5 Acknowledgement


Overview

2012 Spoken Web Search task [Metze et al., 2012]
QbyE STD: Audio search using audio queries.
Multilingual: Four South African languages.
Low-resource: Less than 4-hour DEV audio data in total.
Extreme case: One example for each query term.
Overview of our systems
Aiming at language-independent QbyE STD system.
Multiple resources:
1) the DEV audio data; 2) rich-resource languages.
Combine different resources: PTDTW framework.
Pseudo-relevance feedback (PRF).
Score normalization.


Posteriorgram-based template matching

Training
Resources

Query Query
Example Posteriorgrams
Detection
Tokenizer
Score
Test Test
Utterance Posteriorgrams
DETECT by DTW

Figure: Posteriorgram-based template matching[Hazen et al., 2009]
Training resources: audio data with or without transcriptions.
Tokenizer: if trained without transcriptions, unsupervised;
otherwise, supervised.
Posteriorgrams: more robust than spectral features.
How to effectively combine different resources?


PTDTW framework

Query
Posteriorgrams 1 DTW
Tokenizer 1 distance
Test Matrix D1
Posteriorgrams 1

Query Query
Example Posteriorgrams 2 DTW
Tokenizer 2 distance DTW Raw
Test Matrix D2
Posteriorgrams 2 Distance Detection
Matrix D Score
Test
Utterance Query
Posteriorgrams N DTW DETECT by DTW
Tokenizer N distance
Test Matrix DN
Posteriorgrams N

Figure: PTDTW Framework
Parallel tokenizers followed by DTW detection (PTDTW).
Modiﬁed from the posteriorgram-based template matching
approach.
Key idea: Combining DTW distance matrices.


Unsupervised tokenizers

MFCC-GMM tokenizer [Zhang and Glass, 2009]
Unsupervised training from the DEV data without transcription.
1024 Gaussian components.
39-dim MFCC + MVN + VTLN
MFCC-ASM tokenizer [Lee et al., 1988, Wang et al., 2012]
Acoustic segment model, also named as self-organized unit
(SOU) [Siu et al., 2010].
Unsupervised training from the DEV data without transcription.
256 ASM units. Each unit has 3 state, with 16 gaussian
components for each state.
39-dim MFCC + MVN + VTLN


Phoneme recognizers

Czech, Hungarian, Russian phoneme recognizers
developed by BUT [Schwarz, 2009].
trained from SpeechDat-E corpora.
Mandarin phoneme recognizer
179 tonal phonemes.
About 15-hour training data from CallHome corpus and
CallFriend corpus.
English phoneme recognizer
40 phonemes.
About 15-hour training data from Fisher corpus and Swichboard
Cellular corpus.


Phoneme recognizers

Input Phoneme Taking PCA Gaussian
GMM
Data Recognizers Logarithm Transform Posteriorgrams

Figure: Tandem Structure

256 Gaussian components trained on the DEV data.
Using tandem structure, we have 5 tokenizers:
CZ-GMM, HU-GMM, RU-GMM, MA-GMM and EN-GMM.


DTW detection

DTW detection is performed with a sliding window.
Find the path minimizing the normalized distance:
K
ˆ 1 d(i(k), j(k))wk
d= min
K,i(k),j(k) Z(w)
where d(i(k), j(k)) is set to the inner-product distance, wk = 1,
and Z(w) = K.
Additional constraint: |i(k) − j(k)| ≤ R.
Due to the large variation of the query length, R is not set to a
ﬁxed number, but in proportional to the query length I:
1
R = α × I. (α = 3 in our systems).


Pseudo-relevance Feedback and Score Normalization

Pseudo-revelance Feedback for each query:
1) The top H hits from all the test utterances were selected as the
relevance examples. Selection criterion included: a) H ≤ 3; b)
raw detection score should be larger than a pre-set threshold.
ˆ ˆ
2) The relevance examples were used to score the top H (H = 2
for this task) hits from each test utterance.
3) The scores obtained by the relevance examples were linearly
fused with the scores of the original query examples.
Score normalization for each query:
ˆq,t = (sq,t − µq )/δq
s
sq,t is the score of the qth query on the tth hit region.
2
µq and δq are the mean and variance of the scores for the qth
query estimated from the development data.


System Configuration and Performance
Table: System Configurations and ATWV performances.
System No. 1 2 3 4 5
√ √ √ √
MFCC-GMM
√ √ √ √
MFCC-ASM
√ √ √
PHNREC-GMM1
√ √
PRF
√ √ √ √ √
Score Normalization
devQ - devC 0.68 0.63 0.73 0.78 0.74
devQ - evlC 0.60 0.55 0.70 0.75 0.70
evlQ - devC 0.68 0.65 0.73 0.77 0.75
evlQ - evlC 0.64 0.59 0.72 0.74 0.74

System 1 and 2 belong to the require run condition.
System 3, 4 and 5 belong to the general run condition.
The best performance (system 4) is achieved when all the tokenizers, PRF and
Score normalization are used.
1
PHNREC-GMM denotes the combination of the five used tandem tokenizers: CZ-GMM,
HU-GMM, RU-GMM, MA-GMM, and EN-GMM.



√ √ √ √
MFCC-GMM
√ √ √ √
MFCC-ASM
√ √ √
PHNREC-GMM
√ √
PRF
√ √ √ √ √
Score Normalization
devQ - devC 0.68 0.63 0.73 0.78 0.74
devQ - evlC 0.60 0.55 0.70 0.75 0.70
evlQ - devC 0.68 0.65 0.73 0.77 0.75
evlQ - evlC 0.64 0.59 0.72 0.74 0.74

Supervised tokenizers perform better than the unsupervised tokenizers.
Training resources for unsupervised tokenizers are limited in this task, but not
limited for supervised tokenizers.
The PTDTW framework provides a ﬂexible way to combine all these resources.



√ √ √ √
MFCC-GMM
√ √ √ √
MFCC-ASM
√ √ √
PHNREC-GMM
√ √
PRF
√ √ √ √ √
Score Normalization
devQ - devC 0.68 0.63 0.73 0.78 0.74
devQ - evlC 0.60 0.55 0.70 0.75 0.70
evlQ - devC 0.68 0.65 0.73 0.77 0.75
evlQ - evlC 0.64 0.59 0.72 0.74 0.74

Combination of supervised tokenizers and unsupervised tokenizers leads to
consistent improvement.
Pseudo-relevance Feedback provides consistent improvement.


Conclusion

A PTDTW framework was proposed for the query-by-example
STD task in this evaluation.
Supervised tokenizers performed better than unsupervised
tokenizers for this task. The combination of supervised and
unsupervised tokenizers provided consistent gain.
Pseudo-relevance feedback and score normalization were used.


Acknowledgement

Thank Cheung-Chi Leung from IIR for helpful discussions.
Thank the organizers for organizing this evaluation.
Thank BUT for sharing the phoneme recognizers and scripts.
This research is partially supported by the General Research
Funds (Ref: 414010 and 413811) from the Hong Kong Research
Grants Council.


Thank you!


Reference

Hazen, T., Shen, W., and White, C. (2009).
Query-by-example spoken term detection using phonetic posteriorgram templates.
In ASRU.
Lee, C., Soong, F., and Juang, B. (1988).
A segment model based approach to speech recognition.
In ICASSP.
Metze, F., Barnard, E., Davel, M., van Heerden, C., Anguera, X., Gravier, G., and Rajput, N. (2012).
The spoken web search task.
In MediaEval 2012 Workshop.

Schwarz, P. (2009).
Phoneme recognition based on long temporal context, PhD thesis.

Siu, M., Gish, H., Chan, A., and Belﬁeld, W. (2010).
Improved topic classiﬁcation and keyword discovery using an hmm-based speech recognizer trained without
supervision.
In INTERSPEECH.
Wang, H., C.Leung, LEE, T., Li, H., and Ma, B. (2012).
An acoustic segment modeling approach to query-by-example spoken term detection.
In ICASSP.
Zhang, Y. and Glass, J. (2009).
Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams.
In ASRU.

CUHK System for the Spoken Web Search task at Mediaeval 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to CUHK System for the Spoken Web Search task at Mediaeval 2012

Similar to CUHK System for the Spoken Web Search task at Mediaeval 2012 (20)

More from MediaEval2012

More from MediaEval2012 (20)

Recently uploaded

Recently uploaded (20)

CUHK System for the Spoken Web Search task at Mediaeval 2012