Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Overview        System Description             System performance           Conclusion   Acknowledgement           The CUH...
Overview        System Description   System performance   Conclusion   AcknowledgementOutline      1    Overview      2   ...
Overview       System Description   System performance   Conclusion      AcknowledgementOverview           2012 Spoken Web...
Overview        System Description            System performance         Conclusion           AcknowledgementPosteriorgram...
Overview             System Description                  System performance              Conclusion       AcknowledgementP...
Overview       System Description   System performance   Conclusion     AcknowledgementUnsupervised tokenizers           M...
Overview       System Description   System performance   Conclusion    AcknowledgementPhoneme recognizers           Czech,...
Overview           System Description               System performance   Conclusion            AcknowledgementPhoneme reco...
Overview       System Description            System performance        Conclusion   AcknowledgementDTW detection          ...
Overview       System Description   System performance     Conclusion       AcknowledgementPseudo-relevance Feedback and S...
Overview              System Description     System performance          Conclusion          AcknowledgementSystem Configur...
Overview        System Description       System performance          Conclusion          AcknowledgementSystem Configuratio...
Overview        System Description     System performance          Conclusion          AcknowledgementSystem Configuration ...
Overview       System Description   System performance   Conclusion      AcknowledgementConclusion           A PTDTW frame...
Overview       System Description   System performance   Conclusion   AcknowledgementAcknowledgement           Thank Cheun...
Overview   System Description     System performance   Conclusion   Acknowledgement                                Thank y...
Overview            System Description                  System performance                  Conclusion               Ackno...
Upcoming SlideShare
Loading in …5
×

CUHK System for the Spoken Web Search task at Mediaeval 2012

1,287 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

CUHK System for the Spoken Web Search task at Mediaeval 2012

  1. 1. Overview System Description System performance Conclusion Acknowledgement The CUHK Systems for Spoken Web Search task at MediaEval 2012 Haipeng Wang and Tan Lee Department of Electronic Engineering The Chinese University of Hong Kong September 30, 2012
  2. 2. Overview System Description System performance Conclusion AcknowledgementOutline 1 Overview 2 System Description PTDTW framework Tokenizers DTW detection Pseudo-relevance Feedback and Score Normalization 3 System configuration and performance 4 Conclusion 5 Acknowledgement
  3. 3. Overview System Description System performance Conclusion AcknowledgementOverview 2012 Spoken Web Search task [Metze et al., 2012] QbyE STD: Audio search using audio queries. Multilingual: Four South African languages. Low-resource: Less than 4-hour DEV audio data in total. Extreme case: One example for each query term. Overview of our systems Aiming at language-independent QbyE STD system. Multiple resources: 1) the DEV audio data; 2) rich-resource languages. Combine different resources: PTDTW framework. Pseudo-relevance feedback (PRF). Score normalization.
  4. 4. Overview System Description System performance Conclusion AcknowledgementPosteriorgram-based template matching Training Resources Query Query Example Posteriorgrams Detection Tokenizer Score Test Test Utterance Posteriorgrams DETECT by DTW Figure: Posteriorgram-based template matching[Hazen et al., 2009] Training resources: audio data with or without transcriptions. Tokenizer: if trained without transcriptions, unsupervised; otherwise, supervised. Posteriorgrams: more robust than spectral features. How to effectively combine different resources?
  5. 5. Overview System Description System performance Conclusion AcknowledgementPTDTW framework Query Posteriorgrams 1 DTW Tokenizer 1 distance Test Matrix D1 Posteriorgrams 1 Query Query Example Posteriorgrams 2 DTW Tokenizer 2 distance DTW Raw Test Matrix D2 Posteriorgrams 2 Distance Detection Matrix D Score Test Utterance Query Posteriorgrams N DTW DETECT by DTW Tokenizer N distance Test Matrix DN Posteriorgrams N Figure: PTDTW Framework Parallel tokenizers followed by DTW detection (PTDTW). Modified from the posteriorgram-based template matching approach. Key idea: Combining DTW distance matrices.
  6. 6. Overview System Description System performance Conclusion AcknowledgementUnsupervised tokenizers MFCC-GMM tokenizer [Zhang and Glass, 2009] Unsupervised training from the DEV data without transcription. 1024 Gaussian components. 39-dim MFCC + MVN + VTLN MFCC-ASM tokenizer [Lee et al., 1988, Wang et al., 2012] Acoustic segment model, also named as self-organized unit (SOU) [Siu et al., 2010]. Unsupervised training from the DEV data without transcription. 256 ASM units. Each unit has 3 state, with 16 gaussian components for each state. 39-dim MFCC + MVN + VTLN
  7. 7. Overview System Description System performance Conclusion AcknowledgementPhoneme recognizers Czech, Hungarian, Russian phoneme recognizers developed by BUT [Schwarz, 2009]. trained from SpeechDat-E corpora. Mandarin phoneme recognizer 179 tonal phonemes. About 15-hour training data from CallHome corpus and CallFriend corpus. English phoneme recognizer 40 phonemes. About 15-hour training data from Fisher corpus and Swichboard Cellular corpus.
  8. 8. Overview System Description System performance Conclusion AcknowledgementPhoneme recognizers Input Phoneme Taking PCA Gaussian GMM Data Recognizers Logarithm Transform Posteriorgrams Figure: Tandem Structure 256 Gaussian components trained on the DEV data. Using tandem structure, we have 5 tokenizers: CZ-GMM, HU-GMM, RU-GMM, MA-GMM and EN-GMM.
  9. 9. Overview System Description System performance Conclusion AcknowledgementDTW detection DTW detection is performed with a sliding window. Find the path minimizing the normalized distance: K ˆ 1 d(i(k), j(k))wk d= min K,i(k),j(k) Z(w) where d(i(k), j(k)) is set to the inner-product distance, wk = 1, and Z(w) = K. Additional constraint: |i(k) − j(k)| ≤ R. Due to the large variation of the query length, R is not set to a fixed number, but in proportional to the query length I: 1 R = α × I. (α = 3 in our systems).
  10. 10. Overview System Description System performance Conclusion AcknowledgementPseudo-relevance Feedback and Score Normalization Pseudo-revelance Feedback for each query: 1) The top H hits from all the test utterances were selected as the relevance examples. Selection criterion included: a) H ≤ 3; b) raw detection score should be larger than a pre-set threshold. ˆ ˆ 2) The relevance examples were used to score the top H (H = 2 for this task) hits from each test utterance. 3) The scores obtained by the relevance examples were linearly fused with the scores of the original query examples. Score normalization for each query: ˆq,t = (sq,t − µq )/δq s sq,t is the score of the qth query on the tth hit region. 2 µq and δq are the mean and variance of the scores for the qth query estimated from the development data.
  11. 11. Overview System Description System performance Conclusion AcknowledgementSystem Configuration and Performance Table: System Configurations and ATWV performances. System No. 1 2 3 4 5 √ √ √ √ MFCC-GMM √ √ √ √ MFCC-ASM √ √ √ PHNREC-GMM1 √ √ PRF √ √ √ √ √ Score Normalization devQ - devC 0.68 0.63 0.73 0.78 0.74 devQ - evlC 0.60 0.55 0.70 0.75 0.70 evlQ - devC 0.68 0.65 0.73 0.77 0.75 evlQ - evlC 0.64 0.59 0.72 0.74 0.74 System 1 and 2 belong to the require run condition. System 3, 4 and 5 belong to the general run condition. The best performance (system 4) is achieved when all the tokenizers, PRF and Score normalization are used. 1 PHNREC-GMM denotes the combination of the five used tandem tokenizers: CZ-GMM, HU-GMM, RU-GMM, MA-GMM, and EN-GMM.
  12. 12. Overview System Description System performance Conclusion AcknowledgementSystem Configuration and Performance Table: System Configurations and ATWV performances. System No. 1 2 3 4 5 √ √ √ √ MFCC-GMM √ √ √ √ MFCC-ASM √ √ √ PHNREC-GMM √ √ PRF √ √ √ √ √ Score Normalization devQ - devC 0.68 0.63 0.73 0.78 0.74 devQ - evlC 0.60 0.55 0.70 0.75 0.70 evlQ - devC 0.68 0.65 0.73 0.77 0.75 evlQ - evlC 0.64 0.59 0.72 0.74 0.74 Supervised tokenizers perform better than the unsupervised tokenizers. Training resources for unsupervised tokenizers are limited in this task, but not limited for supervised tokenizers. The PTDTW framework provides a flexible way to combine all these resources.
  13. 13. Overview System Description System performance Conclusion AcknowledgementSystem Configuration and Performance Table: System Configurations and ATWV performances. System No. 1 2 3 4 5 √ √ √ √ MFCC-GMM √ √ √ √ MFCC-ASM √ √ √ PHNREC-GMM √ √ PRF √ √ √ √ √ Score Normalization devQ - devC 0.68 0.63 0.73 0.78 0.74 devQ - evlC 0.60 0.55 0.70 0.75 0.70 evlQ - devC 0.68 0.65 0.73 0.77 0.75 evlQ - evlC 0.64 0.59 0.72 0.74 0.74 Combination of supervised tokenizers and unsupervised tokenizers leads to consistent improvement. Pseudo-relevance Feedback provides consistent improvement.
  14. 14. Overview System Description System performance Conclusion AcknowledgementConclusion A PTDTW framework was proposed for the query-by-example STD task in this evaluation. Supervised tokenizers performed better than unsupervised tokenizers for this task. The combination of supervised and unsupervised tokenizers provided consistent gain. Pseudo-relevance feedback and score normalization were used.
  15. 15. Overview System Description System performance Conclusion AcknowledgementAcknowledgement Thank Cheung-Chi Leung from IIR for helpful discussions. Thank the organizers for organizing this evaluation. Thank BUT for sharing the phoneme recognizers and scripts. This research is partially supported by the General Research Funds (Ref: 414010 and 413811) from the Hong Kong Research Grants Council.
  16. 16. Overview System Description System performance Conclusion Acknowledgement Thank you!
  17. 17. Overview System Description System performance Conclusion AcknowledgementReference Hazen, T., Shen, W., and White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In ASRU. Lee, C., Soong, F., and Juang, B. (1988). A segment model based approach to speech recognition. In ICASSP. Metze, F., Barnard, E., Davel, M., van Heerden, C., Anguera, X., Gravier, G., and Rajput, N. (2012). The spoken web search task. In MediaEval 2012 Workshop. Schwarz, P. (2009). Phoneme recognition based on long temporal context, PhD thesis. Siu, M., Gish, H., Chan, A., and Belfield, W. (2010). Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without supervision. In INTERSPEECH. Wang, H., C.Leung, LEE, T., Li, H., and Ma, B. (2012). An acoustic segment modeling approach to query-by-example spoken term detection. In ICASSP. Zhang, Y. and Glass, J. (2009). Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams. In ASRU.

×