Nishimoto Interspeech 2010 v3

The comparison between
the Deletion-Based Method
and the Mixing-Based Method
for Audio CAPTCHAs

Takuya NISHIMOTO (Univ. Tokyo, Japan)
Takayuki WATANABE (TWCU, Japan)
Interspeech 2010 Mon-Ses2-P3

1

CAPTCHA
 Completely Automated Public Turing test
to tell Computers and Humans Apart
 popular security techniques on the Web
 prevent automated programs from abusing
 image-based CAPTCHAs
 image containing distorted characters
 preventing use of persons with visual disability
 audio CAPTCHAs were created
 create better audio CAPTCHA tasks
 safeness: the difference of recognition performance
 usability: mental workload of human in listening speech

2

Performance gap model
 performance of machine should be lower
 than the intelligibility of human
 gap: safeness 100
 should be large Human

Intelligibility (%)
 exposed ratio (ER)
 0%: random answer ASR
 chance-level; no gap
 100%: best guess
 easy for both; no gap
 practical condition
 0 < ER < 100
0 Exposed Ratio (%) 100
(Provided Information)
3

Safeness: ER control
 machine is becoming strong
 statistical ASR method is the mainstream
 supervised machine learning (Hidden Markov Models)
 techniques to cope with the noise
 CAPTCHA tasks should be created systematically
 it should not be created by trial and error
 controllability of Exposed Ratio is essential
 Mixing-based method: best way to control ER?
 mixing noises / distorting signals
 can hide portion of information, however...
 difficult to measure the ER, performance is not easy to predict
 alternatives must be investigated

4

Usability: Mental workload
 CAPTCHAs should not increase mental workload
 the workload may increase, if they are..
 difficult to listen / memorize the task
 long task (many characters)
 difficult to remember
 safer, but higher mental workload
 requirements
 information can be obtained in short time, easily
 investigation required
 human auditory sensation
 language cognition

5

Top-down knowledge
 incomplete stimulus
 knowledge helps to guess the information
 visual sensation
 if part of image is missing, or part of the word is hidden
 common knowledge can complement image
 about the character and the vocabulary
 speech perception
 if "word familiarity" is high: easy to guess
 phonemic restoration
 may help the human listening

6

Deletion-based method
 delete some parts on temporal axis little by little
 if every 30 msec over a period of 100 msec is replaced with
silence, the 30% of the information was deleted (D70)
 if the ratio of remained sections go down, the degree of
listening difficulty may increase.
 Exposed Ratio can be controlled easily
 however, not easy to understand....
deletion (original)

Festival engine
KAL (HMM-based)
7

Phonemic restoration
 interrupted speech and noise maskers combined
 the fence effect
 continuity of speech signal perceived
 may help human listening
 does not affect machine performance
 expected to enlarge the gap
 performance difference of human and machine

deletion +
phonemic restoration

8

NASA-TLX evaluation
 mental workload
 rating 6 subscales
 Mental, Physical, and Temporal Demands,
Frustration, Effort, and Performance
 range: 0-100
 weights of subscales (6-1)
 for each participant
 placing an order
how the 6 dimensions are related
to personal definition of workload
 weighted workload (WWL)

9

Deletion vs Mixing (Exp1)
 objective: compare intelligibility and mental workload
 Deletion-Based Method (DBM)
 Mixing-Based Method (MBM)
 effect of SNR (signal-to-noise ratio) in MBM
 human intelligibility test
 75 utterances: 3,4,5 digits numbers (3 x 25)
 Japanese recorded speech
 subjects: 15 (5 x 3) undergraduate students
 mental workload (WWL) by NASA-TLX
 normalized within every subject
 their average and SD become 50 and 10 respectively
 automatic speech recognition using HMM
 task: numbers (1-7 digits) in Japanese
 training: 8440 utterances, 18 states, 20 mixtures
 evaluation: 1001 utterances, sentence recognition
10

Setup (Exp1)
 compare DBM and MBM within a person
 acoustic presentation: given by headphone
 at the subject’s preferred reference loudness level
 MBM disturbing signals
 utterances of Japanese sentences
fragmented as short periods, shuffled and combined
MBM(Exp1): Sentence
Group Trial 1: D30 Trial 2: M0, Mm10, Mm20 recognition using HTK (%)
80
G1 DBM 30% MBM SNR 0dB
60
G2 DBM 30% MBM SNR -10dB
40
G3 DBM 30% MBM SNR -20dB
20
0
M0 Mm10 Mm20
11

Performance (Exp1)
DBM(T1)：marginally significant (p<0.1) (G1>G2)
DBM 30% task is harder than MBM 0dB, -10dB, -20dB
MBM(T2): effect of SNR conditions is significant, however,
only between 0dB & -10dB (p<0.05) (G1>G2)
DBM 30% vs DBM 30% vs DBM 30% vs
100 MBM 0dB MBM -10dB MBM -20dB

90

80

70

60

50

40 T1 T2

30
s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305

12

Workload diffefence (Exp1)
 WWL: individual difference cancelled
 subtraction of DBM (D30) score
from MBM (M0, Mm10 and Mm20) score was performed
DBM 30% vs DBM 30% vs DBM 30% vs
MBM 0dB MBM -10dB MBM -20dB
20
10
0
s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305
-10
-20
average WWL difference
-30 20
-40 0.7 1.0
0
-50 WWL: MBM 0db < DBM 30% ?
-60 -20
no significance (ANOVA) (16.2)
M0-D30 Mm10-D30 Mm20-D30

MBM: task difficulty is not easy to control
13

DBM exposed ratio (Exp2)
 DBM: Exposed Ratio can control the gap size
100 70

90 Workload

60
80

70
50
60

50 Human Ave. (%) 40

40 Machine (%)
30
30
30% 50% 70%
30% 50% 70%

DBM 30%
gap is very large, however,
Significant difference (p<0.05) workload is very high.

14

Discussion
 D30 (DBM) & Mm10 (MBM) can be the benchmarks
 for the purpose of comparison between MBM and DBM
 performance difference are close (43.7pt & 44.8pt)
 WWL are also very close (WMm10 - WD30 = 0.7)
performance difference
between human and machine (pt)
80

60

40

20

0
M0 Mm10 Mm20 D70 D50 D30

15

Conclusion
 audio CAPTCHA task using phonemic restoration
 deletion-based method (DBM)
 evaluation of CAPTCHA task
 performance + mental workload (NASA-TLX)
 comparison between DBM and MBM
 DBM: easier to control the task
 future works
 improve the noise
 investigation of phonemic restoration
 really improving performance? only decreasing workload?
 word familiarity, speech rate, synthesized speech, ...

16

Nishimoto Interspeech 2010 v3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Nishimoto Interspeech 2010 v3

Similar to Nishimoto Interspeech 2010 v3 (20)

More from Takuya Nishimoto

More from Takuya Nishimoto (20)

Recently uploaded

Recently uploaded (20)

Nishimoto Interspeech 2010 v3