3. ⦿CAPTCHA stands for Completely Automated
Public Turing test to tell Computers and
Humans Apart
⦿A program that can tell whether its user is a
human or a computer.
⦿The challenge: develop a software program that
can create and grade challenges most humans
can pass but computers cannot
3
4. ⦿First used by Altavista in1997
4
• Reduced SPAM add-url by over 95%
⦿CM U/Yahoo!
• Automated the creating and grading of
challenges
⦿PARC
• Relies on document image degradation to
prevent successful OCR
• Conducted user-focused studies to assess the
effectiveness of CAPTCHAs
5. ⦿C APTC HAs are based on open AI
problems
⦿Breaking C APTCHAs help advance AI by
solving these open problems
⦿Improving C APTC HAs help telling
computers and human apart
⦿Win-win situation
5
6. ⦿Pessimal Print: A Reverse Turing Test
6
Allison L. Coates, Henry S. Baird, Richard J. Fateman
⦿Telling Humans and C omputer Apart
Automatically
Luis von Ahn, Manuel Blum, and John Langford
⦿C APTC HA: Using Hard AI Problems for
Security
Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford
⦿Using Machine Learning to Break Visual
Human Interaction Proofs (HIPs)
Kumar C hellapilla, Patrice Y. Simard
8. ⦿Text based
8
• G impy, ez-gimpy
• G impy-r, G oogle C APTC HA
• Simard’s HIP (MSN)
⦿G raphic based
• Bongo
• Pix
⦿Audio based
9. ⦿G impy, ez-gimpy
9
• Pick a word or words from a small dictionary
• Distort them and add noise and background
⦿G impy-r, G oogle’s C APTCHA
• Pick random letters
• Distort them, add noise and background
⦿Simard’s HIP
• Pick random letters and numbers
• Distort them and add arcs
11. ⦿Bongo
• Display two series of blocks
• User must find the characteristic that sets the two
series apart
• User is asked to determine which series each of
four single blocks belongs to
Difference? thick vs. thin lines
11
12. ⦿PIX
12
• Create a large database of labeled images
• Pick a concrete object
• Pick four images of the object from the images
database
• Distort the images
• Ask the user to pick the object for a list of words
14. ⦿Pick a word or a sequence of numbers at
random
⦿Render them into an audio clip using a
TTS software
⦿Distort the audio clip
⦿Ask the user to identify and type the
word or numbers
14
15. ⦿Most text based C APTCHAs have been
broken by software
• OCR
• Segmentation
⦿Other C APTC HAs were broken by
streaming the tests for unsuspecting
users to solve.
15
16. ⦿Very similar to PIX
⦿Pick a concrete object
⦿G et 6 images at random from
images.google.com that match the object
⦿Distort the images
⦿Build a list of 100 words: 90 from a full
dictionary, 10 from the objects dictionary
⦿Prompt the user to pick the object from
the list of words
16
17. ⦿Make an HTTP call to images.google.com
and search for the object
⦿Screen scrape the result of 2-3 pages to
get the list of images
⦿Pick 6 images at random
⦿Randomly distort both the images and
their URLs before displaying them
⦿Expire the C APTCHA in 30-45 seconds
17
18. ⦿The database already exists and is public
⦿The database is constantly being
updated and maintained
⦿Adding “concrete objects” to the
dictionary is virtually instantaneous
⦿Distortion prevents caching hacks
⦿Quick expiration limits streaming hacks
18
19. ⦿Not accessible to people with disabilities
(which is the case of most CAPTCHAs)
⦿Relies on Google’s infrastructure
⦿Unlike C APTC HAs using random letters
and numbers, the number of challenge
words is limited
19