Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Nishimoto Interspeech 2010 v3

  1. The comparison between the Deletion-Based Method and the Mixing-Based Method for Audio CAPTCHAs Takuya NISHIMOTO (Univ. Tokyo, Japan) Takayuki WATANABE (TWCU, Japan) Interspeech 2010 Mon-Ses2-P3 1
  2. CAPTCHA  Completely Automated Public Turing test to tell Computers and Humans Apart  popular security techniques on the Web  prevent automated programs from abusing  image-based CAPTCHAs  image containing distorted characters  preventing use of persons with visual disability  audio CAPTCHAs were created  create better audio CAPTCHA tasks  safeness: the difference of recognition performance  usability: mental workload of human in listening speech 2
  3. Performance gap model  performance of machine should be lower  than the intelligibility of human  gap: safeness 100  should be large Human Intelligibility (%)  exposed ratio (ER)  0%: random answer ASR  chance-level; no gap  100%: best guess  easy for both; no gap  practical condition  0 < ER < 100 0 Exposed Ratio (%) 100 (Provided Information) 3
  4. Safeness: ER control  machine is becoming strong  statistical ASR method is the mainstream  supervised machine learning (Hidden Markov Models)  techniques to cope with the noise  CAPTCHA tasks should be created systematically  it should not be created by trial and error  controllability of Exposed Ratio is essential  Mixing-based method: best way to control ER?  mixing noises / distorting signals  can hide portion of information, however...  difficult to measure the ER, performance is not easy to predict  alternatives must be investigated 4
  5. Usability: Mental workload  CAPTCHAs should not increase mental workload  the workload may increase, if they are..  difficult to listen / memorize the task  long task (many characters)  difficult to remember  safer, but higher mental workload  requirements  information can be obtained in short time, easily  investigation required  human auditory sensation  language cognition 5
  6. Top-down knowledge  incomplete stimulus  knowledge helps to guess the information  visual sensation  if part of image is missing, or part of the word is hidden  common knowledge can complement image  about the character and the vocabulary  speech perception  if "word familiarity" is high: easy to guess  phonemic restoration  may help the human listening 6
  7. Deletion-based method  delete some parts on temporal axis little by little  if every 30 msec over a period of 100 msec is replaced with silence, the 30% of the information was deleted (D70)  if the ratio of remained sections go down, the degree of listening difficulty may increase.  Exposed Ratio can be controlled easily  however, not easy to understand.... deletion (original) Festival engine KAL (HMM-based) 7
  8. Phonemic restoration  interrupted speech and noise maskers combined  the fence effect  continuity of speech signal perceived  may help human listening  does not affect machine performance  expected to enlarge the gap  performance difference of human and machine deletion + phonemic restoration 8
  9. NASA-TLX evaluation  mental workload  rating 6 subscales  Mental, Physical, and Temporal Demands, Frustration, Effort, and Performance  range: 0-100  weights of subscales (6-1)  for each participant  placing an order how the 6 dimensions are related to personal definition of workload  weighted workload (WWL) 9
  10. Deletion vs Mixing (Exp1)  objective: compare intelligibility and mental workload  Deletion-Based Method (DBM)  Mixing-Based Method (MBM)  effect of SNR (signal-to-noise ratio) in MBM  human intelligibility test  75 utterances: 3,4,5 digits numbers (3 x 25)  Japanese recorded speech  subjects: 15 (5 x 3) undergraduate students  mental workload (WWL) by NASA-TLX  normalized within every subject  their average and SD become 50 and 10 respectively  automatic speech recognition using HMM  task: numbers (1-7 digits) in Japanese  training: 8440 utterances, 18 states, 20 mixtures  evaluation: 1001 utterances, sentence recognition 10
  11. Setup (Exp1)  compare DBM and MBM within a person  acoustic presentation: given by headphone  at the subject’s preferred reference loudness level  MBM disturbing signals  utterances of Japanese sentences fragmented as short periods, shuffled and combined MBM(Exp1): Sentence Group Trial 1: D30 Trial 2: M0, Mm10, Mm20 recognition using HTK (%) 80 G1 DBM 30% MBM SNR 0dB 60 G2 DBM 30% MBM SNR -10dB 40 G3 DBM 30% MBM SNR -20dB 20 0 M0 Mm10 Mm20 11
  12. Performance (Exp1) DBM(T1):marginally significant (p<0.1) (G1>G2) DBM 30% task is harder than MBM 0dB, -10dB, -20dB MBM(T2): effect of SNR conditions is significant, however, only between 0dB & -10dB (p<0.05) (G1>G2) DBM 30% vs DBM 30% vs DBM 30% vs 100 MBM 0dB MBM -10dB MBM -20dB 90 80 70 60 50 40 T1 T2 30 s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305 12
  13. Workload diffefence (Exp1)  WWL: individual difference cancelled  subtraction of DBM (D30) score from MBM (M0, Mm10 and Mm20) score was performed DBM 30% vs DBM 30% vs DBM 30% vs MBM 0dB MBM -10dB MBM -20dB 20 10 0 s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305 -10 -20 average WWL difference -30 20 -40 0.7 1.0 0 -50 WWL: MBM 0db < DBM 30% ? -60 -20 no significance (ANOVA) (16.2) M0-D30 Mm10-D30 Mm20-D30 MBM: task difficulty is not easy to control 13
  14. DBM exposed ratio (Exp2)  DBM: Exposed Ratio can control the gap size 100 70 90 Workload 60 80 70 50 60 50 Human Ave. (%) 40 40 Machine (%) 30 30 30% 50% 70% 30% 50% 70% DBM 30% gap is very large, however, Significant difference (p<0.05) workload is very high. 14
  15. Discussion  D30 (DBM) & Mm10 (MBM) can be the benchmarks  for the purpose of comparison between MBM and DBM  performance difference are close (43.7pt & 44.8pt)  WWL are also very close (WMm10 - WD30 = 0.7) performance difference between human and machine (pt) 80 60 40 20 0 M0 Mm10 Mm20 D70 D50 D30 15
  16. Conclusion  audio CAPTCHA task using phonemic restoration  deletion-based method (DBM)  evaluation of CAPTCHA task  performance + mental workload (NASA-TLX)  comparison between DBM and MBM  DBM: easier to control the task  future works  improve the noise  investigation of phonemic restoration  really improving performance? only decreasing workload?  word familiarity, speech rate, synthesized speech, ... 16
Advertisement