Automatic Language Identification

679 views

Published on

My presentation about Automatic Language Identification at Barcamp London 9.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
679
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Automatic Language Identification

  1. 1. mjav meong h miaou mja m ueam m ia m uw na meow u mi m v ao já ya ? mi
  2. 2. Automatic  Language   Identification       (LiD)   Alexander  Hosford   @awhosford  
  3. 3. The  LiD  Problem  •  The  identification  of  a  given  spoken  language   from  a  speech  signal  •  94%  of  global  population  speak  only  6%  of   world  languages  
  4. 4. Real  World  Situations  •  Automation  of  LiD  is  desirable  •  Offers  many  benefits  to  international  service   industries  •  Hotels,  Airports,  Global  Call  Centres  
  5. 5. Language  Differences    •  Languages  contain  information  that  makes   one  discernable  from  the  other;   –  Phonemes   –  Prosody   –  Phonotactics   –  Syntax  
  6. 6. Human  Abilities  •  Bias  towards  native  language  arises  in  infancy  •  Prosodic  features  some  of  the  first  cues  to  be   recognised  •  Humans  can  make  a  reasonable  estimate  on  language   heard  within  3-­‐4  seconds  of  audition  •  Even  unfamiliar  languages  may  be  plausibly  judged    
  7. 7. Previous  Attempts  •  Attempts  as  early  as  1974  –  USAF  work,  therefore   classified.  •  Methodological  Investigations  as  early  as  1977  (House  &   Neuburg)  •  Studies  for  the  most  part  center  on  phonotactic  constrains   and  phoneme  modeling  •  Raw  acoustic  waveforms  have  been  visited  (Kwasny  1993)  
  8. 8. A  Simpler  Approach?  •  Phonotactic  approaches  require  expert  linguistic   knowledge  •  Phoneme  and  phonotactic  modeling  time   consuming  •  Given  the  speed  of  human  LiD  abilities,   discrimination  most  likely  based  on  acoustic   features  
  9. 9. System  Overview   SuperCollider MFCC Vectors SuperCollider SystemCmdPre-processed Speech Onset ARFF File WEKA Tools nPVI Generation Utterance Detection Pitch Contour
  10. 10. Feature  Extraction  •  Spectral  Information  –  MFCC  Vectors  •  Pitch  contour  information  •  Handled  by  SCMIR  Library  (Collins,  2010)  •  Speech  Rhythm  –  Normalised  Pairwise  Variability   Index  (nPVI)   Feature  Type   Implemented  Measure   Spectral  Content   MFCC  Vector  uGen   Pitch  Contour   Tartini  uGen   Speech  Rhythm   nPVI  function  
  11. 11. Classification  •  Handled  by  the  WEKA  toolkit  •  Built  in  Multilayer  Perceptron  •  Called  from  the  command  line  through  a   SuperCollider  system  command  
  12. 12. Comparisons  Made  •  66  language  pairs  from  12  languages  •  A  comparison  within  language  families  •  A  comparison  of  all  12  languages  •  Averaged  &  segmented  data  
  13. 13. Results  Within  Families  100.00   90.00   80.00   70.00   Germanic   Romantic   60.00   Slavic   SinoAltaic   50.00   Mean   40.00   30.00   20.00   4  MFCC   13  MFCC  Tartini   41  MFCC  Tartini   All  Features   *Data  averaged  across  files  
  14. 14. Results  Within  Families  100.00   90.00   80.00   70.00   Germanic   Romantic   60.00   Slavic   SinoAltaic   50.00   Mean   40.00   30.00   20.00   4  MFCC   13  MFCC  Tartini   41  MFCC  Tartini   *Data  in  5  second  segments    
  15. 15. Results  From  All  Languages  100.00   90.00   80.00   70.00   60.00   Averaged   Mean   50.00   5s  Segments   Mean   40.00   30.00   20.00   10.00   4  MFCC   4  MFCC  Tartini   4  MFCC  Tartini   13  MFCC   13  MFCC  Tartini   13  MFCC  Tartini   41  MFCC   41  MFCC  Tartini   41  MFCC  Tartini   nPVI   nPVI   nPVI  
  16. 16. Extensions  •  nPVI  function  to  use  vowel  onsets  •  Phonemic  Segmentation  •  A  Larger  dataset  •  Better  Efficiency  •  Real-­‐time  operation  
  17. 17. Robustness  •  Real  world  signals  are  very  different  from   processed  ‘clean’  data.  •  ‘Ideal’  LiD  systems  –  independent  &  robust  •  A  need  to  analyse  only  the  part  of  the  signal   that  matters.  
  18. 18. CASA  •  A  computational  modeling  of  human   ‘Auditory  Scene  Analysis’  (Bregman  1990)  •  Separation  of  signal  into  component  parts   and  reconstitution  into  meaningful  ‘streams’  
  19. 19. Using  Noisy  Signals   I’m  open  to  suggestions!  
  20. 20. alexanderhosford@googlemail.com   07814  692  549   @awhosford  

×