Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The L2F Spoken Web Search system for Mediaeval 2012

764 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The L2F Spoken Web Search system for Mediaeval 2012

  1. 1. The  L 2F  Spoken  Web  Search  system   for  Mediaeval  2012   Alberto  Abad  and  Ramón  F.  Astudillo   L2F  -­‐  Spoken  Language  Systems  Lab     INESC-­‐ID  Lisboa,  Portugal   alberto@l2f.inesc-­‐id.pt   Mediaeval  2012  Workshop  Pisa,  October  4,  2012   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 1  
  2. 2. IntroducJon    In  this  first  L2F/INESC-­‐ID  parJcipaJon  on  SWS  our  main  objecJves  were:   •  To  learn   •  To  have  fun   •  To  build  a  reasonable  system  (in  an  unreasonable  reduced  amount  of  Jme)    The  submiVed  L2F  SWS  system  exploits  hybrid  ANN/HMM  connecJonist   methods  for  both  query  tokenizaJon  and  acousJc  keyword  search   •  Composed  by  the  fusion  of  4  phoneJc-­‐based  SWS  sub-­‐systems   o  Based  on  our  in-­‐house  ASR  system  named  AUDIMUS   •  Each  sub-­‐system  uses  different  language-­‐dependent  acousJc  models:   o  European  Portuguese  (pt),  Brazilian  Portuguese  (br),  European  Spanish  (es)  and   American  English  (en)   •  Different  detecJon  score  normalizaJon  and  fusion  methods  invesJgated   o  SubmiVed  system  applies  per-­‐query  score  normalizaJon  (Q-­‐norm)  and  majority  voJng   (MV)  fusion   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 2  
  3. 3. The  baseline  speech  recognizer    AUDIMUS  is  our  in-­‐house  hybrid  HMM/MLP  speech  recognizer   •  Feature  extrac@on  MulJ-­‐stream  26  PLP,  26  logRASTA-­‐PLP,  28  MSG  and  39  ETSI  (only  8kHz)   •  MLP  Several  context  input  frames  (13-­‐15),  2  hidden-­‐layers  (500  units)  and  1  output  layer.   •  Output  layer  size  39  for  pt,  40  for  br,  30  for  es,  41  for  en.   •  Data  pt  115  hours  (57  BN+58  telephone);    br  13  hours  of  BN  data;  es  57hours  (36  BN +21  telephone);  en  142  hours  of  BN  data  (HUB-­‐4  96  &  97)   •  HMM  topology  Single-­‐state  phonemes  (3  frames  minimum  duraJon)   •  Decoder  Uses  Weighted  Finite-­‐State  Transducer  (WFST)  approach   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 3  
  4. 4. Spoken  Query  TokenizaJon      For  each  sub-­‐system,  obtain  a  phoneJc   tokenizaJon  of  the  queries   •  Similar  to  LID  parallel  phonotacJc   approaches    Use  a  phone-­‐loop  grammar  with   phoneme  minimum  duraJon  of  3   frames  to  obtain  1-­‐best  phoneme  chain   •  AlternaJve  n-­‐best  hypothesis  for   charactering  each  query  were  explored  with   unsaJsfactory  results   o  Other  possibiliJes  not  explored  (yet):  lakce,  CN,  etc…     •  Influence  of  the  word  inserJon  penalty   (wip)  parameter  on  the  tokenizaJon  result     The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 4  
  5. 5. Spoken  Query  Search    Spoken  query  search  based  on  AKWS  with  our  hybrid  speech  recognizer.      Search  window  of  5  seconds  (2.5  seconds  Jme  shio)   •  Convert  the  problem  in  a  verificaJon  task  (originally  for  with  forced  alignment)   •  Convenient  for  fusion  purposes    Equally-­‐likely  1-­‐gram  LM  with  target  Query  and  Background  word:   •  Query  word     o  Described  by  the  phoneJc  units  obtained  in  the  previous  tokenizaJon  stage   •  Background  word     o  Described  by  the  special  phoneJc  class  background/filler   o  Minimum  duraJon  set  to  250  msec.      DetecJon  score  of  detected  candidates  computed  as  the  average   phoneJc  log-­‐likelihood  raJos  of  the  query  term     The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 5  
  6. 6. Spoken  Query  Search  BG  modelling  with  HMM/MLP    Possible  approaches   •  Re-­‐train  MLP  ✖   •  Compute  posterior  probability  of  the  background  class  depending  on  the  other   classes  ✔   o  Mean  probability  of  the  top-­‐N  most  likely  outputs    For  the  SWS  system,     •  We  compute  the  average  in  the  likelihood  domain  (top-­‐6)   o  The  decoder  operates  in  the  likelihood  domain,  so  there  is  not  need  for  and  add-­‐hoc   esJmaJon  of  the  BG  class  prior   •  We  use  a  background  scale  term  β  (exponenJal  in  the  likelihood  domain)  to   control  the  weight  of  the  BG  model  vs.  Query   •  This  β  scale  together  with  the  wip  term  strongly  affects  searching  results   o  Adjusted  following  a  non-­‐exhausJve  greedy  search   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 6  
  7. 7. Score  normalizaJon,  fusion  and  calibraJon    Score  normaliza@on  schemes  explored:   •  Q-­‐norm  Assume  that  the  scores  are  dependent  of  the  queries  and   apply  a  by-­‐query  normalizaJon   •  F-­‐norm  Assume  that  the  scores  are  dependent  of  the  data  file  (of   the  collecJon)  and  apply  a  by-­‐file  normalizaJon   •  CombinaJons  (QF-­‐norm,  FQ-­‐norm)    Fusion  schemes  explored:   •  Candidate  detecJons  from  the  4  parallel  sub-­‐systems  are  kept  (or   rejected)  according  to  simple  combinaJon  rules:   o  AND  (all),  OR  (at  least  1  sys)  and  MV  (at  least  2  sys)   •  Final  detecJon  score  is  the  mean  score      Decision  threshold  set  according  to  maxATWV  in  dev-­‐dev   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 7  
  8. 8. Development  experiments  Sub-­‐systems  performance   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 8  
  9. 9. Development  experiments  Score  normalizaJon  strategies   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 9  
  10. 10. Development  experiments  Fusion  strategies  (aoer  Q-­‐norm)   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 10  
  11. 11. Official  L2F  SWS2012  results   Combined DET Plot Combined DET Plot Combined DET Plot 98 98 98 Random Performance Random Performance Random Performance Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.486 Scr=-0.135 Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.633 Scr=-0.435 Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.523 Scr=-0.362 95 95 95 Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.486 Scr=-0.135 Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.633 Scr=-0.435 Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.523 Scr=-0.362 90 90 90 80 80 80Miss probability (in %) Miss probability (in %) Miss probability (in %) 60 60 60 40 40 40 20 20 20 10 10 10 5 5 5 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 False Alarm probability (in %) False Alarm probability (in %) False Alarm probability (in %) dev-­‐eval   eval-­‐dev   eval-­‐eval   The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 11  
  12. 12. Conclusions    The  L2F  Spoken  Web  Search  system  fully  exploits  hybrid  ANN/HMM   speech  recogniJon:   •  Query  tokenizaJon  based  on  1-­‐best  phoneJc  decoding   •  Query  search  based  on  AKWS  (with  no  need  for  AM  re-­‐training)    The  submiVed  system  is  formed  by  the  fusion  of  four  language-­‐ dependent  sub-­‐systems:   •  Q-­‐norm  score  normalizaJon  is  applied  to  each  individual  sub-­‐system   •  Fusion  is  done  following  a  majority  voJng  strategy    The  system  achieves  an  actual  ATWV  score  of  0.5195  in  the  eval-­‐eval     •  Promising  given  the  simplicity  of  the  proposed  system   •  Robust  to  query  and  collecJon  sets   o  Best  performance  is  achieved  in  a  mismatched  condiJon!!   •  Reasonably  well-­‐calibrated    Future  work  Focused  in  improved  Query  tokenizaJon,  fusion  with   other  type  of  approaches  (DTW)     The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 12  
  13. 13. technology from seed technology from seed L2 F - Spoken Language Systems LaboratoryInstituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 13L2 F - Spoken Language Systems Laboratory

×