Khmer ASR system
Sethserey SAM
sam.sethserey@itc.edu.kh
Part I: ASR in general
o    Definition
o    Type of ASR
o    ASR flow chart
o    Data requirement
o    Performance of ASR systems
o    Fundamental methods to create ASR system

                                            2
What is ASR system?
o  ASR: Automatic speech recognition
   system
o  ASR: A system or tool that can
   convert audio flow contained speech
   to text.
                               Seven
                               Seven days
                ASR System     Zaven
                                :
                               :

                              Text output

                                            3
ASR: what for?
o  ASR systems improve your life (works ,
   business, communication ,etc.)
Typology of ASR systems
o  Speaker-dependent vs. -independent

o  Language constraints:                   + Vocabulary:
  n    isolated word recognition
  n    connected word                        small (100),
  n    keyword spotting                      medium (5 000),
                                              large (50 000)
  n    continuous speech recognition


o  Robustness constraints
  n    laboratory (office) conditions: imposed
  n    microphone, channel noise …

                                                                5
Levels of complexity




                       6
ASR flow chart



                             s
                             e                        Seven
                             v                        Seven days
                                                      Zaven
                             e
                                                       :
                             n
                                                      :


     Signal processing           Decoding/Searching
       (digitalizing &
     feature extraction)
                           ASR system

                                                                   7
ASR data requirement
o  To train AM and ML models, huge amount of
   data (text & audio) are needed.

                         Pronunciation
         Audio +           dictionary
                                         Text data
    transcription data




                                                     8
ASR Performance
o    English ASR system Evaluations at National Institute of
     Standards and Technology (NIST)




                                                               9
Causes of ASR’s error rate
                         “seven”




o  The current ASR for continuous speech
   can not reach 0% of WER, why ?
  n  Acoustic model is affected by human character and
      environment: gender, age, emotion, pitch, accent,
      physical state, channel noise, etc.
  n  Lexical model is affected by incorrect word
      pronunciation.
  n  Language model : incorrect usage of words,
      grammar mistakes.
                                                      10
Three fundamental methods for
creating a new ASR system

o  Enough training data è bootstraping
o  Small amount of data è adaptation
o  No data è cross-language transfer




                                          11
Part II:
Khmer language & its processing
o  Khmer language
o  Why research on Khmer ASR?




                                12
Khmer Language
o    Official	
  language	
  of	
  Cambodia	
  
o    Spoken	
  by	
  more	
  than	
  15	
  M	
  people	
  
o    An	
  atonal	
  language	
  
o    Wri>ng	
  system	
  
     n  33	
  Consonants,	
  23	
  dependent	
  vowels	
  
     n  14	
  independent	
  vowels,	
  13	
  diacri>cs	
  and	
  various	
  signs	
  	
  	
  
     n  No	
  explicit	
  word	
  boundary	
  	
  
     	
  


                                                                                              13
Why research on Khmer ASR?
o  An	
  under-­‐resourced	
  language	
  	
  
    n  Lack	
  of	
  text	
  and	
  speech	
  data	
  in	
  digital	
  form	
  
    n  Lack	
  of	
  linguis>c	
  documents	
  (both	
  soK	
  and	
  hard	
  
        copies)	
  
o  Lacking	
  explicit	
  Word	
  Segmenta>on	
  	
  
    n  Automa>c	
  Word	
  Segmenta>on	
  is	
  needed	
  
    n  State-­‐of-­‐the-­‐art	
  method	
  of	
  	
  segmenta>on	
  uses	
  	
  
        –  hand-­‐craKed	
  lexicons,	
  word	
  frequencies,	
  	
  
        –  op>miza>on	
  criteria	
  …	
  
o  Others	
  under-­‐resourced,	
  unsegmented	
  
   languages	
  in	
  the	
  region	
  :	
  Burmese,	
  Laos,	
  Thai	
  
   Vietnamese	
  	
  	
  	
  
                                                                                    14
Part III:
    Khmer ASR at the glance
o  Corpus
  o  Speech corpus setup
  o  Text corpus setup
  o  General overview
o  Current ASR system
o  Future work


                              15
Corpus: Speeh corpus setup
o  Two types of corpus:
  n  small transcribed corpus (2007-2008)
     o  Transcribed manually by Engineering students at ITC
     o  only 6 hours of transcribed signal
     o  Nature: radio signal (poor quality) downloaded from
        radio australie, radio free asia and voice of america

  n  Large transcribed corpus (2011)
     o    Already have text and speech corresponding
     o    Students help verifying the transcription
     o    21 hours of transcribed signal
     o    Nature: reading speech from newspaper


                                                                16
Corpus: Text corpus setup
o  Retrieving	
  text	
  from	
  the	
  Web	
  is	
  becoming	
  a	
  common	
  approach	
  
o  Well	
  selected	
  rich-­‐content	
  websites	
  Vs	
  crawling	
  the	
  Web	
  
o  Adap>ng	
  ClipsTextTk,	
  an	
  open	
  source	
  tool	
  for	
  corpus	
  crea>on	
  for	
  
   Khmer	
  language	
  
      n    Conversion	
  from	
  legacy	
  character	
  encoding	
  to	
  Unicode	
  
      n    Automa>c	
  Segmenta>on	
  	
  
      n    Conversion	
  of	
  special	
  sign	
  and	
  number	
  to	
  text	
  
      n    Normaliza>on	
  of	
  word	
  spelling	
  
o  Text	
  Corpus	
  obtained	
  from	
  5	
  sites	
  :	
  
      n    2,5000	
  html	
  pages	
  retrieved	
  	
  
      n    AKer	
  processing	
  :	
  0.5	
  M	
  sentences,	
  15	
  M	
  words	
  
      n    Dura>on	
  :	
  November	
  2007	
  –	
  January	
  2008	
  	
  	
  

                                                                                              17
Corpus-Oveview
o  Description of Khmer ASR corpus
 Type               Small corpus         Large corpus
 Signal             ~6h of transcribed   ~20h of
 (acoustic model)   signal (radio)       transcribed
                                         signal (reading
                                         speech)
 Text                0,5 millions of     to be improved
 (language model)   phrase
                    ~ 15,5 millions of
                    words
 Pronunciation      ~ 20 000 words       To be improved
 Dictionary
 (lexical model)
                                                           18
Current ASR system
Continue ASR       Training &          Word Error Rate (%)
  System         tasting corpus
                                     Context       Context
                                    Dependent     Dependent
                                     (8gau)        (16gau)
Khmer ASR v1   - LM: 15.5M words      42.5           40.3
               - Training AM: 5h
               - Testing: 172p
Khmer ASR v2   - LM: 15M words        36.4            35
               - Training AM: 20h
               - Testing: 290 p




                                                             19
Future Work
o  Collect more text data for language
   model
o  Next challenge: How to improve
   Khmer ASR for independent speakers
   and in different environments?




                                     20
THANK YOU!!




              21

Khmer ASR

  • 1.
    Khmer ASR system SethsereySAM sam.sethserey@itc.edu.kh
  • 2.
    Part I: ASRin general o  Definition o  Type of ASR o  ASR flow chart o  Data requirement o  Performance of ASR systems o  Fundamental methods to create ASR system 2
  • 3.
    What is ASRsystem? o  ASR: Automatic speech recognition system o  ASR: A system or tool that can convert audio flow contained speech to text. Seven Seven days ASR System Zaven : : Text output 3
  • 4.
    ASR: what for? o ASR systems improve your life (works , business, communication ,etc.)
  • 5.
    Typology of ASRsystems o  Speaker-dependent vs. -independent o  Language constraints: + Vocabulary: n  isolated word recognition n  connected word small (100), n  keyword spotting medium (5 000), large (50 000) n  continuous speech recognition o  Robustness constraints n  laboratory (office) conditions: imposed n  microphone, channel noise … 5
  • 6.
  • 7.
    ASR flow chart s e Seven v Seven days Zaven e : n : Signal processing Decoding/Searching (digitalizing & feature extraction) ASR system 7
  • 8.
    ASR data requirement o To train AM and ML models, huge amount of data (text & audio) are needed. Pronunciation Audio + dictionary Text data transcription data 8
  • 9.
    ASR Performance o  English ASR system Evaluations at National Institute of Standards and Technology (NIST) 9
  • 10.
    Causes of ASR’serror rate “seven” o  The current ASR for continuous speech can not reach 0% of WER, why ? n  Acoustic model is affected by human character and environment: gender, age, emotion, pitch, accent, physical state, channel noise, etc. n  Lexical model is affected by incorrect word pronunciation. n  Language model : incorrect usage of words, grammar mistakes. 10
  • 11.
    Three fundamental methodsfor creating a new ASR system o  Enough training data è bootstraping o  Small amount of data è adaptation o  No data è cross-language transfer 11
  • 12.
    Part II: Khmer language& its processing o  Khmer language o  Why research on Khmer ASR? 12
  • 13.
    Khmer Language o  Official  language  of  Cambodia   o  Spoken  by  more  than  15  M  people   o  An  atonal  language   o  Wri>ng  system   n  33  Consonants,  23  dependent  vowels   n  14  independent  vowels,  13  diacri>cs  and  various  signs       n  No  explicit  word  boundary       13
  • 14.
    Why research onKhmer ASR? o  An  under-­‐resourced  language     n  Lack  of  text  and  speech  data  in  digital  form   n  Lack  of  linguis>c  documents  (both  soK  and  hard   copies)   o  Lacking  explicit  Word  Segmenta>on     n  Automa>c  Word  Segmenta>on  is  needed   n  State-­‐of-­‐the-­‐art  method  of    segmenta>on  uses     –  hand-­‐craKed  lexicons,  word  frequencies,     –  op>miza>on  criteria  …   o  Others  under-­‐resourced,  unsegmented   languages  in  the  region  :  Burmese,  Laos,  Thai   Vietnamese         14
  • 15.
    Part III: Khmer ASR at the glance o  Corpus o  Speech corpus setup o  Text corpus setup o  General overview o  Current ASR system o  Future work 15
  • 16.
    Corpus: Speeh corpussetup o  Two types of corpus: n  small transcribed corpus (2007-2008) o  Transcribed manually by Engineering students at ITC o  only 6 hours of transcribed signal o  Nature: radio signal (poor quality) downloaded from radio australie, radio free asia and voice of america n  Large transcribed corpus (2011) o  Already have text and speech corresponding o  Students help verifying the transcription o  21 hours of transcribed signal o  Nature: reading speech from newspaper 16
  • 17.
    Corpus: Text corpussetup o  Retrieving  text  from  the  Web  is  becoming  a  common  approach   o  Well  selected  rich-­‐content  websites  Vs  crawling  the  Web   o  Adap>ng  ClipsTextTk,  an  open  source  tool  for  corpus  crea>on  for   Khmer  language   n  Conversion  from  legacy  character  encoding  to  Unicode   n  Automa>c  Segmenta>on     n  Conversion  of  special  sign  and  number  to  text   n  Normaliza>on  of  word  spelling   o  Text  Corpus  obtained  from  5  sites  :   n  2,5000  html  pages  retrieved     n  AKer  processing  :  0.5  M  sentences,  15  M  words   n  Dura>on  :  November  2007  –  January  2008       17
  • 18.
    Corpus-Oveview o  Description ofKhmer ASR corpus Type Small corpus Large corpus Signal ~6h of transcribed ~20h of (acoustic model) signal (radio) transcribed signal (reading speech) Text 0,5 millions of to be improved (language model) phrase ~ 15,5 millions of words Pronunciation ~ 20 000 words To be improved Dictionary (lexical model) 18
  • 19.
    Current ASR system ContinueASR Training & Word Error Rate (%) System tasting corpus Context Context Dependent Dependent (8gau) (16gau) Khmer ASR v1 - LM: 15.5M words 42.5 40.3 - Training AM: 5h - Testing: 172p Khmer ASR v2 - LM: 15M words 36.4 35 - Training AM: 20h - Testing: 290 p 19
  • 20.
    Future Work o  Collectmore text data for language model o  Next challenge: How to improve Khmer ASR for independent speakers and in different environments? 20
  • 21.