HOW DOES IT WORK?
CAPTCHA works on a simple principal: Only solvable by
Humans. CAPTCHA works on the principle that
computers cannot process the image character, while
a human can easily read the CAPTCHA text. Hence it
became quite a successful scheme where a user
would have to enter the characters in order to
proceed to any website.
While there exist many types of CAPTCHA, the most
common one is the text based CAPTCHA where the
random combination of characters of varying length
is distorted into an image which, assumingly, cannot
be processed and solved by a computer script but
only read and understood by the Human senses.
Once the Human enters the CAPTCHA characters, it is
matched at the backend with the already known
solution and if it is 100% perfect, the user can
proceed to do the tasks. Cracking the CAPTCHA has
been a challenge to AI Research community, and till
date there has been so system that has been
developed that was able to achieve a 100% accuracy
and efficiency rate.
CAPTCHAs has applications for
practical security like
• Preventing Comment Spam in Blogs: Comment spamming to
increase the index in the search engine. These bots spam the
comments in blog with index words that will increase the blog’s
index higher on search engine. CAPTCHA ensures that this does not
• Protecting Website Registration: Everyone uses emails! Sever
websites have signups. It is humans who are supposed to sign up,
however with Registration bots several such email services and sign
up websites realized that it had millions of accounts overnight, all
fake generated by the bots.
• Protecting Email Addresses From Scrapers: Spammers crawl the
Web in search of email addresses posted in clear text. CAPTCHAs
provide an effective mechanism to hide your email address from
Web scrapers. The idea is to require users to solve a CAPTCHA
before showing your email address.
• Preventing Dictionary Attacks: A way to hack someone’s email or
registration account is try millions of combinations in the password
box along with the right userid. A CAPTCHA prevents this by
showing up after a number of ‘miss’ trials of logging in. Since a bot
cannot solve the CAPTCHA, more trials are not possible and it
doesn’t account the account in any way.
• Search Engine Bots: It is sometimes desirable to keep web pages
unindexed to prevent others from finding them easily. There is an
html tag to prevent search engine bots from reading web pages.
The tag, however, doesn't guarantee that bots won't read a web
page; it only serves to say "no bots, please." Search engine bots,
since they usually belong to large companies, respect web pages
that don't want to allow them in. However, in order to truly
guarantee that bots won't enter a web site, CAPTCHAs are needed.
GOALS TO ACHIEVE
• Web interface for the CAPTCHA system: Given a web page, we
construct a plug-in so that when you click a button, the CAPTCHA
will be captured, passed to a recognizer, get the result back, and fill
in the CAPTCHA text box. The result is checked to see if the
CAPTCHA is correctly filled. If yes, we record the CAPTCHA and the
answer in a database, for future research. Also, the recognition rate
is calculated for analysis.
• Segmentation Engine: The JCAPTCHA is segmented here
implemented on differed modes of segmentation. The
segmentation algorithms are based on invariants observed on
hundreds of JCAPTCHA.
• Recognition Engine: Build a recognition engine for the JCAPTCHA
segmented characters to identify the best answer possible.
A BRIEF FLOW:
• A CAPTCHA recognition framework consists of
3 main features:
• The front end plug-in that is used to detect
the CAPTCHA on the webpage.
• The segmentation engine which segments the
characters of the CAPTCHA.
• The recognizer which is responsible to identify
the segmented character.
The diagram below demonstrates the
framework for CAPTCHA recognition:
JCAPTCHA Recognizer Engine
• The Recognizer Engine forms the core of the JCAP
1. Collecting files and removing artifacts
We observed that the JCAPTCHA image file saved by
the plugin had a 2-pixel blue border. This border
was not in the original image and was an artifact
created when the plugin software iMacros selected
the image to take a screen shot. This border is
cropped off the image, and the new image is saved
in the Recognizer folder.
• There are three modes of segmentation that is
configurable by the user.
1.Fast Pixel Array mode
2.Slow Pixel Array mode
3.Connected Components mode
• As introduced in the theory our approach to
Character Recognition is based on template
matching. Although, the implementation of the OCR
is based very much on explanation given in the
theory, I’d like to walk you through the flow of the
code talking about some of the challenges I
experienced building each function.