Generic Solving of Text-based captchas
A seminar report submitted in partial fulfillment of the requirements
for the award of the degree of
Bachelor of Technology
in
Computer Science & Engineering
Eighth Semester 2011 Admission
ABSTRACT
Over the last decade, it has become well-established that a captchas ability to with-
stand automated solving lies in the difficulty of segmenting the image into individual
characters. The standard approach to solve captchas automatically has been a se-
quential process wherein a segmentation algorithm splits the image into segments
that contain individual characters, followed by a character recognition step that uses
machine learning. While this approach has been effective against particular captcha
schemes, its generality is limited by the segmentation step, which is hand-crafted to
defeat the distortion at hand.
No general algorithm is known for the character collapsing anti-segmentation tech-
nique used by most prominent real world captcha schemes. Here a novel approach
to solve captchas in a single step that uses machine learning to attack the segmen-
tation and the recognition problems simultaneously is formulated. Performing both
operations jointly allows the algorithm to exploit information and context that is
not available when they are done sequentially. At the same time, it removes the need
for any hand-crafted component, making the approach generalize to new captcha
schemes where the previous approach cannot.
Many websites use captchas, or Completely Automated Public Turing tests to tell
Computers and Humans Apart, to block automated interaction with their sites. For
example, G mail uses captchas to block access by automated spammers, eBay[12]
uses captchas to improve its marketplace by blocking bots from flooding the site
with scams, and Facebook uses captchas to limit creation of fraudulent profiles used
to spam honest users or cheat at games. The most widely used captcha schemes use
combinations of distorted characters and obfuscation techniques that humans can
recognize but that may be difficult for automated scripts. captchas are sometimes
called reverse Turing tests, because they are intended to allow a computer to deter-
mine whether a remote client is human or machine.
Contents
List of Figures ii
1 Introduction 1
2 Outline 3
3 Motivation 4
4 Approaches and Data Set 5
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Algorithm 10
5.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Cut - Point Detector . . . . . . . . . . . . . . . . . . . . . . . 11
5.1.2 Slicer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.3 Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.4 Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Dealing with Occluding Lines. . . . . . . . . . . . . . . . . . . . . . . 17
5.4 Sequential Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Areas of Improvement 19
6.1 Learn the KNN weights . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Improve cut-point elimination . . . . . . . . . . . . . . . . . . . . . . 19
6.3 Additional Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.4 Explore deep neural networks . . . . . . . . . . . . . . . . . . . . . . 20
7 Future Works of captcha system 21
8 Conclusion 24
Bibliography 25
i
List of Figures
4.1 Segmentation then Recognition. . . . . . . . . . . . . . . . . . . . . . 6
4.2 Various Distortion in Negatively Kerned captcha. . . . . . . . . . . . 7
4.3 15 Best Captchas over which test was conducted. . . . . . . . . . . . 8
4.4 Captchas over which test was conducted. . . . . . . . . . . . . . . . . 9
5.1 Overview of the algorithm’s four components . . . . . . . . . . . . . . 10
5.2 How Algorithm Works . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Example of the algorithm successfully applied to a Yahoo captcha . . 12
5.4 Cut Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.5 Graph creation in Slicer . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.6 Reinforced Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.7 Occluding Lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.8 Sequential Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.1 Captcha Future. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
ii
Chapter 1
Introduction
A captcha stand for Completely Automated Public Turing test to tell Computers
and Humans Apart [8]is a type of challenge-response test used in computing to de-
termine whether or not the user is human. From the abbreviation it was clear that
captcha is a Turing test. Turing test is a test a machines intelligence level, to know
whether the machines intelligence reaches up to the level of humans. captcha was
found in 2000 by Luis von Ahn, Manuel Blum and Nicholas J. Hopper of Carnegie
Mellon University and John Langford of IBM[14]. We know that captcha is used to
determine whether user is machine or human but this was old concept. With the
implementation of the captcha breaking system as mentioned in the reference paper
most of the captchas can be decode by the algorithm designed by the authors. So in
order to undergo Turing Test captcha is old fashion now stronger method or captcha
has to be designed. This captcha breaking system can break most of the captcha in
various web application systems
The standard approach to solve captcha automatically (i.e. by a computational de-
vice) is by sequential processing. This Sequential processing consists of two major
functions they are Segmentation [32] and Recognition. The segmentation algorithm
splits the image into segments that contain individual characters. The recognition
algorithm uses machine learning to recognize a single character. After recogniz-
ing all the character the machine can generate the perfect decoded format of the
captcha. After segmentation of captcha, Recognition is performed that is why it
is called sequential processing. This approach is effective only against a particular
set of captcha schemes. In some captcha schemes the sequential processing will
fail at segmentation step. These exceptional captcha schemes follow hand-crafted
to technique. There is no general algorithm known for the character segmentation
process for hand real world captcha schemes. Due to this drawback the traditional
sequential processing failed.
Here the discussion is about the algorithm which is not sequential but simultane-
ous processing. That is the two major functions Segmentation and Recognition are
executed simultaneously over the captcha. Performing both operations jointly allows
the algorithm to get full information for machine learning and context which was
not available when sequential Algorithm was used. It also removes the hand-crafted
schemes, making this approach the generalized approach to new captcha schemes
where the previous approach cannot.
Many websites use captchas, Gmail uses captchas[7] to block spam access, eBay uses
1
Generic Solving of Text-based captchas
captchas to prevent flooding into the site with scams, and Facebook uses captchas
to limit creation of system based fake profile generation. users or cheat at games.
The most widely used captcha schemes use combinations of distorted characters and
obfuscation techniques that humans can recognize but that may be difficult for au-
tomated scripts. Captchas are sometimes called reverse Turing tests, because they
are intended to allow a computer to determine whether a remote client is human or
machine[15]. The effectiveness and universality of the results suggests that combin-
ing segmentation and recognition is the next evolution of automated captcha solving,
and can suppress the sequential approach used in earlier works. After comparing
the accuracy of the algorithm with the accuracy of humans it was found that purely
text based captchas[16] may be nearing their end, and provides early steps toward
rethinking how reverse Turing tests can be performed securely.
Dept. of Computer Science & Engg. 2 SIMAT, Vavanoor
Chapter 2
Outline
This report is presenting the entirely new concept of breaking a text based captcha.
It is know that breaking a captcha using an algorithm is not a good deed. But the
main highlight of this report is to make the computer analysis and research people
understand the approach being used here to break the captcha and design a stronger
and more complex reverse Turing test. This algorithm is only concentrating on solv-
ing a text based captcha.
The First chapter is about describing out an introduction about the proposed
algorithm which does the task of solving a text based captcha. This chapter also
describes about the introductory parts of this new algorithm along with the brief-
ings about the inventors of the captcha and the basic informations needed to know
about the captcha. The Third Chapter describes about the relevance and motiva-
tion behind the study of this algorithm. The Chapter Approaches and Data Set
states about the various methods to implement this algorithm various important
approaches to solve a captcha. The Data Set lists out the various available and
the most famous list of text based captchas. In the Fifth chapter the algorithm is
defined and illustrated in detail with all the substitution process being used in the
algorithm. The Sixth chapter details about the future works being made to impro-
vise the reverse Turing test process. This chapter also states about the various new
captcha schemes which can replace the traditional text based captcha system. The
final chapter is the conclusion where a summarized format of the entire report is
described. The conclusion chapter is followed by the Bibliography.
3
Chapter 3
Motivation
The main motivation of this method is to bring forward a new concept to solve the
captcha. The captchas are believed to be not solvable by the computing machine
but here the authors of the main reference paper has proposed a new algorithm to
solve the captcha. The authors themselves are mentioning that they are publishing
this algorithm for the advancement of technology in the field of Reverse Turing Test
and for the academic research purpose. The algorithm is complex and more costly
to reproduce than employing cheap manual labor to solve captchas. Due to the
higher accuracy rate and effective functionality in solving the captcha using this
algorithm, leads the designers and the research specialist to invent new complex
and more captcha system so that the security level and can be more enhanced in
the field of Computer Science.
4
Chapter 4
Approaches and Data Set
4.1 Approach
As mentioned earlier in the Introduction part, that this algorithm is only applicable
for text based captcha as a result the discussion of the topic is only related to the
text based captcha system. The text based captchas[13] are treated as an image.
As the captcha is an image the various image processings techniques are used to un-
dergone. In this section we will discuss the various approaches made in past and the
approach made to implement automated captcha solving system and its limitations.
In order to implement the automated captcha the entire process of automation
consist of two main process they are:
• Segmentation - Segmentation is the process of partitioning a digital image
into multiple segments (sets of pixels, also known as superpixels). The goal
of segmentation is to simplify and/or change the representation of an image
into something that is more meaningful and easier to analyze.Image segmen-
tation is typically used to locate objects and boundaries (lines, curves, etc.)
in images. Here this process is used for dividing the characters of the captcha
in to different individual characters is called segmentation. This process is
the most difficult and complex to design. It also uses the concepts of Image
processing as the captcha is an image. Practically it is said that there is no
effective algorithm which does the process of segmentation accurately.
• Recognition - Recognition is a field that includes methods for acquiring, pro-
cessing, analyzing, and understanding images and, in general, high-dimensional
data from the real world in order to produce numerical or symbolic infor-
mation. The process of recognizing each distorted individual character (seg-
mented character) with the help of machine learning is called recognition.
The concept of machine learning is used in recognition because it was found
that the machine learning algorithms consistently outperform humans for sin-
gle character recognition. Due to the presence of the Image recognition the
algorithm becomes more smarter and intelligent
5
Generic Solving of Text-based captchas
Figure 4.1: Segmentation then Recognition.
There are generally two main methods to implement automated captcha system
they are:
• Segment then Recognize Method
• Segment and Recognize Together Method
The Figure:4.1 shows the various process involved in the process of decoding or
breaking a text based captcha scheme. The process of solving the automated captcha
can be divided into five generic steps: pre - processing, segmentation, post - seg-
mentation, recognition, and post-processing. While segmentation, the separation
of a sequence of characters into individual characters, and recognition, identifying
those characters, are intuitive and generally understood, there are good reasons for
considering the additional pre - processing and post-processing steps as part of a
standard process. For example, preprocessing can remove background patterns or
eliminate other additions to the image that could interfere with segmentation, while
post - segmentation steps can clean up the segmentation output by normalizing the
size of each image or otherwise performing steps distinct from segmentation. After
recognition, post-processing can improve accuracy by, for example, applying spell
checking to any captcha that is based on actual words (such as slashdot).Based on
this generic captcha - solving architecture, test experimentswith various specific al-
gorithms were tried on various popular website captchas. From these set of analysis
report,a set of techniques was identified that make captchas more difficult to solve
automatically. By varying these techniques, a larger set of techniques were created
that helped the study, the effect of each of these features in detail and refines the
automated attack methods. humans.
The Segment then Recognize Method is the traditional method in which the
process of segmentation is done first and then the segmented characters are passed
into the recognition algorithm which uses the concept of machine learning to rec-
ognize each character. This approach has been effective against particular captcha
schemes, its complexity in solving is deviated due to the segmentation step, which
is hand - crafted to defeat the distortion at hand. No general algorithm is known for
the character collapsing anti-segmentation technique used by most prominent real
world captcha schemes. This technique is called negative kerning which is a variant
of the object occlusion problem.
Dept. of Computer Science & Engg. 6 SIMAT, Vavanoor
Generic Solving of Text-based captchas
(a) captcha with no noise
and distortion
(b) captcha with cut
through lines
(c) captcha with color and
distortion
Figure 4.2: Various Distortion in Negatively Kerned captcha.
Negative Kerning (Figure:4.2) is a character collapsing technique in which the
space between the characters are removed and each characters are occluded with
the neighboring character. The process of occluding means to joint or to attach the
characters. Along with the process of occluding characters in the negative kerning
process some extra noises, distortion and randomization are also added to prevent
side channel attack. This adding up of noises is in the form of adding colors, dis-
torted cutting through text lines in order to make the captcha more complex. Side
channel attack is the process of recognizing the captcha content from the process
of continuous learning of each character in captcha and predict the result. When
noises are added up in the Negatively Kerned captcha then it will be difficult for
undergoing side channel attack[19].
The Figure:4.2a shows the captcha which undergone negative kerning but there
was no noise[11], no occluding lines and no external distortions are added up. The
Figure:4.2b shows the captcha on which negative kerning and occluding lines are
added up to make more distortions and causing confusion. The Figure:4.2c shows a
simple captcha with distortions in the form of color, occluding lines and dots, here
no negative kerning is implemented. Negative kerning is considered the most secure
method for preventing segmentation because it has successfully withstood years of
attacks. Almost all of the most prominently used captcha schemes rely on it. The
other method of choice to prevent segmentation, which seems to have fallen out
of fashion after a successful wave of attacks, is to use occluding lines. A captchas
ability to withstand automated solving lies in the difficulty of segmenting the image
into individual characters rather than recognizing the characters themselves. Which
means that the segmentation process is the difficult part in the automated captcha
system. Up till now there have been two approaches/works which have been formu-
lated for automated captcha solving:
The first type of attack is to undergo side channel attack for all type of captcha.
In this method the segmentation algorithm will does the task of dividing the captcha
into different characters along with the consecutive distortion faced by the particular
character. The machine learning part will then try to remove the distortion or pre-
dict the character and generate the output. But this approach is not much favorable
because the defender can easily defend the captcha by making the captcha difficult
for segmentation and if the segmentation was carried out then also the output will
not be proper. So as a result this attack approach cannot be applied over all the
captchas.
The second type of attack focuses on finding weaknesses in the distortion algo-
rithms of particular captcha schemes. A specially designed segmentation algorithm
Dept. of Computer Science & Engg. 7 SIMAT, Vavanoor
Generic Solving of Text-based captchas
is designed by the attacker which works on the principle of image processing and
morphological segmentation. This is used to remove the distortion. The image pro-
cessing algorithm is provided with the features to recognize the twisted and turns in
the captcha text. So based on the twist and turns in the captcha the image process-
ing algorithm will undergo the process to make it into understandable format. The
morphological segmentation does the task of filling the missing data based on the
relevant information obtained from the image processing part. While this attack was
also a failure. This approach was only applicable over the reCaptcha 2011 scheme.
Later in 2013, a group or researchers examined hollow captcha, specifically and were
able to solve all of the captcha schemes by extending the segmentation process then
recognize approach that involves nine consecutive steps.
Up till now research in captcha solving has followed the exploit - patch cycle.
The exploit - patch method was tried on the best 15 captcha scheme shown in
Figure:4.3 In the exploit - patch cycle the attacker finds a flaw in a particular
anti - segmentation technique, and then the defender tries to patch it, the process
of removing the flaw int the anti - segmentation technique or moves on to a new
one. The limitation of the segment then recognize approach has been the attacker’s
ability to find new flaws. This proposed algorithm can overcomes this limitation by
segmenting and recognizing the captcha simultaneously, thus removing the need for
manually discovered heuristics to segment captchas [1].
Figure 4.3: 15 Best Captchas over which test was conducted.
Dept. of Computer Science & Engg. 8 SIMAT, Vavanoor
Generic Solving of Text-based captchas
4.2 Dataset
In this section the list of various famous captcha which is used widely are described.
The Algorithm as per designed by the authors evaluated the efficiency and complex-
ity of their algorithm based on the Dataset(Figure:4.4) given here.
Figure 4.4: Captchas over which test was conducted.
It was found that most of the captcha schemes out of these six captcha schemes
are depended on negative kerning to prevent segmentation. Since 2011, some of
those schemes, namely Baidu and ReCaptcha, have evolved. To keep the algorithm
evaluation relevant to the state of the art in captcha design, we extended our corpus
to include the new versions of Baidu and ReCaptcha in 2013. Corpus is a large and
structured set of texts used to do statistical analysis and hypothesis testing, checking
occurrences or validating linguistic rules within a specific language territory. Here
the language is the captcha. As visible in figure Baidu and ReCaptcha evolved (the
updated version is the 2013 version) in two radically different ways: Baidu decided
to use hollow letters whereas ReCaptcha introduced more aggressive distortions.
But the success rate decreased on the new version of ReCaptcha compared to the
previous version. On the other hand, surprisingly, its accuracy significantly increased
on the newer version of Baidu.
Dept. of Computer Science & Engg. 9 SIMAT, Vavanoor
Chapter 5
Algorithm
In this chapter an overview of our algorithm is clearly mentioned along with the
description of its major components. As mentioned early the process of learning for
the process of recognition is a very important part of this algorithm. The process
of reinforcement learning process which is the main reason for the accuracy of this
algorithm. In the previous chapter it was described that not only negative kern-
ing but occluding lines are also used for the process of creating captcha so solving
the occluding lines in a generic manner since it is a natural extension of our algo-
rithm. The discussion about optimizations and trade-offs that can be applied to the
algorithm will be done in the next chapter.
5.1 Algorithm Overview
As mentioned early this algorithm works on the process of undergoing the process
of segmentation and recognition together. So here the first thing to do is to find
all possible ways to segment the given captcha. to find the set of all possible ways
to segment the captcha it means that to find the set of methods to segment each
characters of the captcha. An unstructured captcha image can be segmented into dif-
ferent forms. After getting the set of segmented captcha, decide which combination
is most likely to be the correct one. After analyzing all the possible segmentation
paths find out the path(set of segments) that can maximize the recognition rate.
We contrast this with the segment then recognize approach where an uninformed
segmentation algorithm passes at most a small number of possible segmentations to
an independent recognition algorithm as a result it is more time consuming and will
may or may not generate correct result after many attempts.
Figure 5.1: Overview of the algorithm’s four components
In the Figure:5.1 itself it is very clear that this Algorithm consist of four main
components:
10
Generic Solving of Text-based captchas
• Cut - Point Detector: This will find all the potential ways to segment the
captcha.
• Slicer: It will extract the segments and combine them into a graph.
• Scorer: It will perform Optical Character Recognition (OCR) on the segments
and assigns a recognition confidence score to each one of them.
• Arbiter: It is responsible for processing the scores and determining what are
the most likely letters.
Figure 5.2: How Algorithm Works
The graph representation in the Figure: 5.2 is used to find and store all possible
segmentations which can be derived out from the captcha at once. Due to this
procedure the algorithm was successful to simultaneously solve the segmentation
and recognition problems. Now we will discuss each components working in detail.
5.1.1 Cut - Point Detector
The Cut - Point Detector is the first and the initial step in this algorithm. Here as
mentioned earlier will generate the set of segmentation area. Image segmentation
is the process of partitioning a digital image into multiple segments (sets of pixels,
also known as superpixels). The goal of segmentation is to simplify and/or change
the representation of an image into something that is more meaningful and easier
to analyze. Image segmentation is typically used to locate objects and boundaries
(lines, curves, etc.) in images. More precisely, image segmentation is the process
of assigning a label to every pixel in an image such that pixels with the same label
share certain characteristics.
Dept. of Computer Science & Engg. 11 SIMAT, Vavanoor
Generic Solving of Text-based captchas
So in order to undergo the segmentation we need two points(pixels), to construct
a line segment, for segmentation. This two points can be obtained by characters ex-
amining the second derivative of the curve generated by following the bottom pixels
of the captcha, and the curve generated by following the top pixels of the captcha.
Now we got two set of curves, first curve is the second derivative of the top pixels of
captcha and the second curve is the second derivative of the bottom pixels of captcha.
Now in both the curves we have to mark the points for undergoing segmentation.
These points can be obtained by finding the Inflection points [31] on the curve. The
inflection point is a point on a curve at which the curve changes from being concave
(concave downward) to convex (concave upward), or vice versa. So we will get a
set of inflection points on the first curve and the second curve. The set of inflection
points on the first curve is marked as red color and the set of inflection points in
bottom is marked in blue color.
Now after getting the points to generate the cuts the process of finding all possible
cut lines is initiated. Now each cut is constructed by connecting the inflection points
- one from the top, and one from the bottom. On doing this process for the entire
curve we will get a set of cuts called as the Potential Cut. And this Potential Cuts
are marked over the captcha. Now this captcha containing the Potential cuts are
given as input to the Slicer.
Figure 5.3: Example of the algorithm successfully applied to a Yahoo captcha
Dept. of Computer Science & Engg. 12 SIMAT, Vavanoor
Generic Solving of Text-based captchas
5.1.2 Slicer
The Slicer is provided with a captcha containing potential cuts marked over it as
input. The slicer applies some heuristics [30] to extract the meaningful potential
segments based on the cut points and builds the graph as shown in Figure: 5.4. A
potential segment is considered meaningful if the two cuts that define its left and
right boundaries are sufficiently far apart, yet not too far apart. The process of
formation of graph (Figure:5.5) is a real unique process. In this algorithm the entire
text based captcha is divided into a set of window consisting of 2 characters at a
time. So as shown in the Fig. we can see the Characters A and B are the two char-
acters of the captcha and resides in the window. This window consist of 4 potential
cuts which was found in the previous step.
Now the process of taking the content in the segments region is initiated. Initially
the algorithm takes the region within the segment 0 and 1, where 0 and 5 are the
borders of the window treated as a segment. The content is analyzed and a weight
is assigned to the recognized part along with the possible character. Moving to the
next cut i.e the region between the segments 1 and 2, this region doesn’t gives a
meaningful segment as a result it is discarded. After getting all the values a graph
is traversed with the segments as the nodes and the vertices’s are with the possible
character and the weight assigned. Based on this graph the best cut can be found
here cut 3 because region from 0 to 3 and 3 to 5 gives nearly same character and
same weight value In simple terms it means that a cut or segment is said to be po-
tential if the distance towards the black pixels from the segment pixel is sufficiently
far yet not too far.
Figure 5.4: Cut Optimization
So naturally if the number of potential cuts increases the computation time also
increases. It was found that using this algorithm it took 9 hours to undergo Slicing
over a captcha containing 12 characters. So to remove this draw back the only
remedy is to decrease the number of cuts in the Potential Cut set. So for optimizing
this algorithm a new approach is formulated which works by pruning(removing)
near- duplicate and improbable cuts from the set of potential cut points.First, we
removed all the cuts that have an angle > 30. Then we examined the ratio of
white pixels to black pixels to eliminate cut lines that pass through too many black
pixels, since they are most likely cutting through the middle of a letter. Finally we
Dept. of Computer Science & Engg. 13 SIMAT, Vavanoor
Generic Solving of Text-based captchas
Figure 5.5: Graph creation in Slicer
compared the pixel intensities of the left and right boundaries to estimate whether
the cut marks a transition between two letters.
5.1.3 Scorer
The scorer does the task of assigning a score value for each character which got seg-
mented. The scorer traverse the graph of potential segments and applies OCR and
then assign a confidence value. In the previous step the algorithm generates per-
fectly segmented character they are called the potential segments. Now the Scorer
will scan or analyze these segments and generate a score for the character. After
generating the score using KNN algorithm the class to which the potential segment
belongs is found out.
The KNN algorithm [29] known as k Nearest Neighbor algorithm. This algo-
rithm is an classification algorithm, it works on the principle of making an element
belongs to a class from the set of classes by measuring the distance k from each class
in the feature space. But here modified version of KNN algorithm is used.
First after getting the potential segments in the captcha, at pixel level the score
value is calculated for the corresponding character and then based on the overall
confidence value it assigns a recognition confidence score. This recognition confi-
dence score is used as the source for the KNN algorithm and its value is checked
with the surrounding class values. The class contain set of similar characters, like for
example A,4 belong to same class similarly 0,O belong to same class. The same class
elements have a nearly same value because their character appearance are alike. So
allocating each captcha value to a particular class is the prime task done here.
Segments are processed at the pixel level, as this has been demonstrated to be
the best approach for text recognition. Here the KNN algorithm is more preferable
because of the following factors : computation at pixel level, noise resistance and
computational speed. The noise resistance arises from using a relatively small k
(less than 10) in our KNN to identify the nearest neighbors. This is essential in our
case because most of the potential segments generated by the slicer are meaningless
and belong to the garbage class.
Dept. of Computer Science & Engg. 14 SIMAT, Vavanoor
Generic Solving of Text-based captchas
A metric distance function is a function that defines a distance between elements
of a set. It was realized that the problem was assigning an equal weight to each pixel
regardless of its position in the segment or its gray scale value. It turns out that
pixels on the edge of segments are less meaningful than pixels in the center precisely
because they are shared between characters that have been collapsed together. We
achieved very good results on all captcha schemes by assigning higher weight to
pixels nearer the center of the segment, and to darker ones.
5.1.4 Arbiter
The arbiter is the final component of this algorithm. The arbiter does the task
of taking in the input from the Scorer, that is the recognition confidence score ,
and will check it with the trained accurate values of the class members and then
generate the solution of that segment. The scorer will generate the result that the
given segment of character will belong to which class. Now the arbiter does the job
of accessing all the data in that particular class and analyzing each one with the
segmented character.
For this method here the approach of ensemble learning approach is used. The
Ensemble learning [27] is a technique for combining many weak learners in an
attempt to produce a strong learner. The term ensemble is usually reserved for
methods that generate multiple hypotheses using the same base learner. So the
class will contain all set of segmented character set on the basis of their confidence
score. So based on the requirement the algorithm’s approach will deeply study the
input and tries its maximum to reach to the solution.
5.2 Reinforcement Learning
Reinforcement Learning [28] is an important part of this algorithm. Reinforcement
learning is an area of machine learning inspired by behaviorist psychology, con-
cerned with how software agents can probably take actions in an environment so as
to maximize reward. So as a result we can say that reinforcement learning can also
make the algorithm more smarter and clever. Reinforcement learning is based on
the concept of ”making understand first and then react”, so like wise here also we
can use this approach in the algorithm too.
The traditional way to train a character classifier is to provide a set of labeled
captchas which are already segmented and then let the classifier learn to recognize
each character from those segments using the labels. So here providing the labeled
segmented captcha is an intelligent process.In this process it is assumed that the
classifier is given with the correct number of segments - one for each letter in the
captcha. Now based on this labeled capthca the algorithm will learn the various
approaches or schemes used in the traditional text based captcha and then it will
generate the solution. In the ”segment then recognize” approach, this assumption
holds because the segmentation is handled by a vision algorithm that is not part of
Dept. of Computer Science & Engg. 15 SIMAT, Vavanoor
Generic Solving of Text-based captchas
(a) captcha error recognition
(b) captcha’s bad segmentation and bad
recognition
Figure 5.6: Reinforced Learning.
the classifier itself. Which states that the segmentation algorithm and the recogniz-
ing algorithm are two different independent algorithms.
Here this algorithm also uses reinforcement learning approach where the human
also plays a part. Instead of providing the classifier with labeled examples of valid
segments,here the algorithm asks the human to give explanation to the segments
that have been misclassified, and then the algorithm learns from the feedback. The
training is started using the traditional method, first the algorithm processes a set
of labeled captchas. During the decoding of this set of labeled some set of captchas
fails. So those set of captchas are stored/saved. Those captchas that were not
successfully recognized, the failed captchas, the algorithm asks for human feedback
when a segment surrounded by two correctly classified segments is misclassified. In
those cases, the algorithm needs the human expertise because the misclassification
could be due either to improper segmentation, or to bad recognition.If the error
was due to improper segmentation, the segment is discarded. If the error was due
to a recognition error, the segment is added to the classifier training set. When
all the cases are reviewed, the algorithm is retrained with the enriched dataset. In
practice, even a single round of reinforcement learning is enough to significantly
improve accuracy.
Dept. of Computer Science & Engg. 16 SIMAT, Vavanoor
Generic Solving of Text-based captchas
5.3 Dealing with Occluding Lines.
To make captcha more difficult to decode/break for any algorithm, the captcha de-
signers are using the concept of occluding lines. Occluding lines are those which
are unwanted source of lines, to deviate the functionality of the captcha breaking
algorithms. This algorithm was initially not able to solve the problem of occluding
lines but later it was solved.
The initial attempt was to introduce a new algorithm for the removal of inde-
pendent lines of the algorithm based on the soft margin algorithm.
Later two more simpler method was formulated. The first method was to add a
new class in the scorer part of the algorithm. As the scorer works more on the KNN
algorithm and due to the property of permitting discontinuous character classes a
new class was added up. This new class is a collection of many different shapes of
line. The second method is that as mentioned earlier the cut point detector works
on the concept of the finding the second derivative of the curve and then finding the
inflection points to find the potential cuts, well this part is suited for ignoring flat
parts.
Figure 5.7: Occluding Lines.
5.4 Sequential Recognition
For every algorithm one of the prime factor that needs to be considered all time
is the computational cost. The computational cost shows the performance of the
Dept. of Computer Science & Engg. 17 SIMAT, Vavanoor
Generic Solving of Text-based captchas
algorithm for any kind of related inputs. In the previous section we realized that the
computational cost of the algorithm increases with the increasing size of the captcha.
This computational cost increases because the number of characters to be segmented
and the number of character to be recognized is increased as a result the compu-
tation cost increases. This is a serious issue while looking into the factor of efficiency.
Now this problem is solved in a very interesting manner. A separate variation of
the current algorithm is designed here, which increases the efficiency of the character
recognizability process. Here local recognition algorithm is used. Local recognition
is a sub process of the main recognition process. This process is done by implement-
ing the approach of making a local decision in which a window is selected which
intakes two letters at a time. Considering two characters at a time yielded signif-
icantly better results than looking at one or three characters at a time. Now in a
window the number of all possible cuts are scanned. Suppose if there are 3 cuts in a
window of two characters. First step is to consider any one of the 3 cuts and undergo
segmentation. After segmentation we will get a result i.e two separated characters.
Now these separated character’s pixel value is calculated and stored. Now repeat
this process for all the types of cut which is present in the window. After that
the maximum pixel value is calculated and is selected as the recognition confidence
score. But there are chances that this process of the local decision is subjected to
many errors.
Figure 5.8: Occluding Lines.
The main areas of error in this method is that if the characters of the capthcas
are highly conjuncted to each other then the chances of creating the proper window
will become least accurate and the algorithm will give an entirely wrong output.
It was also found that some of the captcha schemes are oriented into one specific
direction i.e either left or right. This local decision process is more effective on the
left side based captcha over the right side based captcha.
Th best solution to this issue is to undergo sequential recognition from both direc-
tions and then combining the two recognition scores to improve the overall accuracy.
This is called left - right approach. This is done by executing two local decision pro-
cess simultaneously. One local decision process doe the process of recognition from
left to the right and the other one does the same process from right to left. After
successfully completion of both the process the result is combined together to get
the best optimal score value.
Dept. of Computer Science & Engg. 18 SIMAT, Vavanoor
Chapter 6
Areas of Improvement
The segmentation and recognition simultaneously approach, holistic [9] is being
formulated for the first time here. Though this algorithm produces good results, it
is just the first rough implementation.This chapter describes about the some of the
most promising directions for improvement.
6.1 Learn the KNN weights
During the algorithm it was discussed that the kNN algorithm is used to classify the
characters. The current implementation uses a single manually chosen set of weights
for the KNN distance computation that performed well on the set of captcha scheme
provided initially . It is believed that automatically learning of those weights for each
captcha scheme would improve accuracy, particularly for schemes that use unusual
fonts or specific distortions. It is believed that it is possible to accomplish this fully
unsupervised, similar to the cut-point detector and slicer phases of our algorithm.
6.2 Improve cut-point elimination
The computation time is directly related to the number of potential segments, the
first optimization was to come up with heuristics to reduce the number of cut points
considered by the cut point detector. This optimization works by pruning near-
duplicate and improbable cuts from the set of potential cut points. First, remove
all the cuts that have an angle >30. Then examine the ratio of white pixels to black
pixels to eliminate cut lines that pass through too many black pixels, since they
are most likely cutting through the middle of a letter. Finally we compared the
pixel intensities of the left and right boundaries to estimate whether the cut marks
a transition between two letters. Finding a better set of heuristics that are both
generic and more precise is an open question.
6.3 Additional Occlusion
As pointed out earlier, Baidu and CNN captcha schemes use occluding lines with low
curvature. While results on these captcha schemes are very good and ur algorithm
properly detects lines, future work should investigate in depth how various types of
19
Generic Solving of Text-based captchas
lines, e.g., sine waves that have a high curvature, impact the recognition rate. It
should also consider other types of occlusion, e.g., blobs. To date, we have not found
real world captcha schemes that employ this type of occlusion; perhaps occlusion of
this type presents usability challenges that make it impractical for humans.
6.4 Explore deep neural networks
A primary contribution of this work is to completely demonstrate the effectiveness
of performing segmentation and recognition simultaneously. Accordingly, a consid-
eration was also made on other algorithms that are able to process captchas in a
holistic manner. In particular, with collaborators, a experiment was conducted to
experiment with deep convolution neural networks, similar to those in [17]. These
experiments have confirmed the benefits of a unified approach, and have achieved
captcha-solving results that equal or improve upon those presented in this paper.
For certain ReCaptcha data sets, these new results show such dramatic improvement
in accuracy, while using large-scale training sets, that they suggest that deep neu-
ral networks may hold a substantial advantage over humans for solving text-based
captchas [18].
Dept. of Computer Science & Engg. 20 SIMAT, Vavanoor
Chapter 7
Future Works of captcha system
With the demonstration (through research publications) that character recognition
CAPTCHAs are vulnerable to computer vision based attacks, some researchers have
proposed alternatives to character recognition, in the form of image recognition
CAPTCHAs which require users to identify simple objects in the images presented.
The argument is that object recognition is typically considered a more challenging
problem than character recognition, due to the limited domain of characters and
digits in the English alphabet. This is the reason why captcha is taken as an image
rather than a set of characters. When captchas were invented, the designers real-
ized that with the passage of time one of two things would happen: either captchas
would remain an invaluable way to differentiate humans and computers, or very high
quality OCR would become readily available.
Here the entire description was based on solving the text based captcha. And
it was found that the end of using text based captcha has approached as it is quite
simple to decode. In this algorithm by using the concept of segmentation and recog-
nition together many of the captcha were able to decode successfully. So it is direct
that in near future, by updating this algorithm one can achieve 100% decoding of
the text based captcha. Due to all such reason the need for new type of reverse
Turing test is higher.
The first potential method is simply to find a more difficult problem in computer
vision. Incorporating video or requiring the user to perform a higher order cog-
nitive task such as circling or rotating an object. Due to the failure of the text
based captcha the new captcha schemes that arrived where the audio and the video
captcha. But the audio captcha resulted into a failure as it was able to decrypted
using an output of speech to text recognizer. The video captcha can also be decoded
successfully by taking the frame pictures of the video and then analyze each frame
and decode the captcha.
Datta et al. published a paper in the ACM Multimedia ’05 Conference, named
IMAGINATION (IMAge Generation for INternet AuthenticaTION), proposing a
systematic way to image recognition captchas.According to that paper a set of im-
ages are distorted in such a way that state-of-the-art image recognition approach
will fail to recognize them. But this captcha was able to be solved with quite diffi-
culty by the humans.[20]
21
Generic Solving of Text-based captchas
Microsoft have developed Animal Species Image Recognition for Restricting Ac-
cess (ASIRRA) which ask users to distinguish cats from dogs. Microsoft had a beta
version of this for websites to use.?? Microsoft claim ”Asirra is easy for users; it
can be solved by humans 99.6% of the time in under 30 seconds. Anecdotally, users
seemed to find the experience of using Asirra much more enjoyable than a text-based
CAPTCHA.” This solution was described in a 2007 paper to Proceedings of 14th
ACM Conference on Computer and Communications Security (CCSIts). However,
this project was closed in October 2014 and is no longer available. Asirra captcha
(Figure:7.1a), which asked users to distinguish between cats and dogs. Less than a
year after its release it was successfully broken using a classifier trained to recognize
image textures.
The MintEye captcha (Figure:7.1b) scheme was a moder version of captcha
scheme where a image is distorted and the user has to undistorted the image and
make it back as a perfect figure[6].This schema relies on undistorted an image was
broken by a very simple attack based on Sobel operators that only required 23 lines
of Python[5]. Due to this this schema was also rejected
Mitra et. al. have suggested using emergent images as an alternative way to encode
(a) Asirra captcha (b) MintEye captcha
Figure 7.1: Captcha Future.
information in video that might be robust against computer vision algorithms. A
short post on emergent images, still or moving images where objects at first only
appear with effort and concentration, but once recognized are very easy to see again
even after several months or years. In effect once a user have recognized the object
he/she remember it forever[23]. Emergence refers to the unique human ability to
aggregate information from seemingly meaningless pieces, and to perceive a whole
that is meaningful.
Recently game - based captchas have been developed[4]. However implementing
this idea as proven to be difficult, as the game captcha schemes for the leading game
captcha provider Are you a human have been broken[22]. This captcha system works
on the concept of giving the user a game to complete and reach the target goal. The
game is a simple design and only humans can solve it.However implementing this
idea as proven to be more difficult.
NuCaptcha is an early fraud detection service which utilities behavior analytics to
provision threat appropriate, animated video captcha. NuCaptcha is developed and
operated by Canadian-based firm, NuData Security. Static image-based captchas
are routinely used to prevent automated sign-ups to websites by using text or im-
Dept. of Computer Science & Engg. 22 SIMAT, Vavanoor
Generic Solving of Text-based captchas
ages of words disguised so that optical character recognition (OCR) software has
trouble reading them[10]. However, in common captcha systems, users often fail to
correctly solve the captcha 7% - 25% of the time.NuCaptcha uses animated video
technology that it claims make puzzles easier for humans to solve, but harder for
bots and hackers to decipher[24].
Cognitive Behavior : Another method, to compute whether the user is a human
or system is on the basis of the computation speed of the respective brains, i.e speed
of brain of human in solving to the ratio of the speed in solving by a computer is
relatively faster[25]. So based on this time variation in solving the captcha one can
recognize who is human and who is a system.
Leveraging reputation: In addition to considering how a reverse Turing test is
solved, captcha providers could consider the identity of the solver, for example the
IP address, the geographic location, etc. If a good enough proof of identity can be
established, providers can use this reputation to adapt the difficulty of the reverse
Turing test.
Dept. of Computer Science & Engg. 23 SIMAT, Vavanoor
Chapter 8
Conclusion
Here a detail explanation and study was made on the approach to solve captchas in a
single step that uses machine learning to attack the segmentation and the recognition
problems simultaneously. Performing both operations jointly allows this algorithm
to exploit information and context which is not available when it is done sequentially.
This algorithm was able to solve many prominent real-world captcha schemes
that use both negative kerning and occluding lines without any modification to
the algorithm. The algorithm was able to achieve a 38.68% recognition rate on
Baidu 2011, 55.22% on Baidu 2013, 51.09% on CNN, 51.39% on eBay, 22.67% on
ReCaptcha 2011, 22.34% on ReCaptcha 2013,28.29% on Wikipedia, and 5.33% on
Yahoo.
This study of algorithm gives us realization that the reverse Turing tests might
be improved going forward.The effectiveness and universality of the results suggests
that combining segmentation and recognition is the next evolution of catpcha solv-
ing, and that it supersedes the sequential approach used in earlier works. With these
advances, it seems that purely text-based captchas are likely to have declining util-
ity; significant effort may be needed to rethink the way we perform reverse Turing
tests.
24
Bibliography
[1] Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, John C. Mitchell,”The
End is Nigh: Generic Solving of Text-based CAPTCHAs”’,2013.
[2] P. Golle. Machine learning attacks against the asirra captcha. In ACM CCS
2008, 2008.
[3] R. Gossweiler, M. Kamvar, and S. Baluja. Whats up captcha? a captcha based
on image orientation. In World Wide Web, 2009.
[4] Are you human ? http://areyouahuman.com/.
[5] Breaking the minteye image captcha in 23 lines of python. Blog post
http://www.jwandrews.co.uk/2013/01/breakingthe-minteye-image-%captcha-
in-23-lines-of-python.
[6] Minteye captcha. website: http://www.minteye.com/, 2013.
[7] A. S. E. Ahmad, J. Yan, and M. Tayara. The robustness%of google captchas.
Technical report, Newcastle University, 2011.
[8] E. Athanasopoulos and S. Antonatos. Enhanced captchas: Using animation
to tell humans and computers apart. In IFIP International Federation for
Information Processing, 2006.
[9] P. Baecher, N. Buscher, M. Fischlin, and B. Milde. Breaking recaptcha: A
holistic approach via shape recognition. In Future Challenges in Security and
Privacy for Academia and Industry, pages 5667. Springer, 2011.
[10] E. Bursztein. How we broke the nucaptcha video scheme and what we propose
to fix it. blog post http://elie.im/blog/security/howwe-broke-the-nucaptcha-
videoscheme-%and-what-we-propose-tofix- it/, February 2012.
25
Generic Solving of Text-based captchas
[11] E. Bursztein, R. Bauxis, H. Paskov, D. Perito, C. Fabry, and J. C. Mitchell.
The failure of noisebased non-continuous audio captchas. In Security and
Privacy, 2011.
[12] E. Bursztein and S. Bethard. Decaptcha: breaking 75% of eBay audio
CAPTCHAs. In Proceedings of the 3rd USENIX conference on Offensive
technologies, page 8. USENIX Association, 2009.
[13] E. Bursztein, M. Martin, and J. Mitchell. Text-based captcha strengths and
weaknesses. In Proceedings of the 18th ACM conference on Computer and
communications security, CCS 11, pages 125138, New York, NY, USA, 2011.
ACM.
[14] E. Bursztein, A. Moscicki, C. Fabry, S. Bethard, D. Jurafsky, and J. C.
Mitchell. Easy does it: More usable captchas. CHI, 2014.
[15] K. Chellapilla, K. Larson, P. Simard, and M. Czerwinski. Computers beat
humans at single character recognition in reading based human interaction
proofs (hips). In CEAS, 2005.
[16] K. Chellapilla and P. Simard. Using machine learning to break visual human
interaction proofs (HIPs). Advances in Neural Information Processing
[17] Systems, 17, 2004. Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S.
Corrado, J. Dean, and A. Y. Ng. Building high-level features using large scale
unsupervised learning. In ICML, 2011.
[18] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit
number recognition from street view imagery using deep convolution neural
networks. arXiv preprint arXiv:1312.6082, 2013.
[19] J. Yan and A. El Ahmad. A Low-cost Attack on a Microsoft CAPTCHA. In
Proceedings of the 15th ACM conference on Computer and communications
security, pages 543554. ACM, 2008.
[20] ”Imagination Paper”. Infolab.stanford.edu. Retrieved 2013-09-28.
[21] ”Asirra is a human interactive proof that asks users to identify photos of cats
and dogs”.
Dept. of Computer Science & Engg. 26 SIMAT, Vavanoor
Generic Solving of Text-based captchas
[22] Spamtech. Cracking the areyouahuman captcha.
http://spamtech.co.uk/software/bots/cracking-the-
areyouhumancaptcha/,2012.
[23] N. J. Mitra, H.-K. Chu, T.-Y. Lee, L.Wolf, H. Yeshurun, and D. Cohen-Or.
Emerging images. ACM Transactions on Graphics, 28(5), 2009. to appear.
[24] Y. Xu, G. Reynaga, S. Chiasson, J.-M. Frahm, F. Monrose, and P. van
Oorschot. Security and usability challenges of moving-object captchas: Decod-
ing codewords in motion. In Usenix Security, 2012.
[25] C. Cruz-Perez, O. Starostenko, F. Uceda-Ponga, V. Alarcon-Aquino, and L.
Reyes-Cabrera. Breaking recaptchas with unpredictable collapse: heuristic
character segmentation and recognition. In Pattern Recognition, pages 155165.
Springer, 2012.
[26] C. Cortes and V. Vapnik. Support-vector networks. Machine learn-
ing,Septmenber 2014.
[27] Opitz, D.; Maclin, R. (1999). ”Popular ensemble methods: An empirical
study”. Journal of Artificial Intelligence Research 2014.
[28] Sutton, Richard S. (1984). Temporal Credit Assignment in Reinforcement
Learning (PhD thesis). University of Massachusetts, Amherst, MA.
[29] Altman, N. S. (1992). ”An introduction to kernel and nearest-neighbor
nonparametric regression”. The American Statistician 46, September 2014
[30] Pearl, Judea (1983). Heuristics: Intelligent Search Strategies for Computer
Problem Solving. New York, Addison-Wesley, December 2014.
[31] http://www.encyclopediaofmath.org/index.php/Point of inflection, Jan-
uary,2015.
[32] Barghout, Lauren, and Lawrence W. Lee. ”Perceptual information processing
system.” Paravue Inc. U.S. Patent Application 10/618,543, filed July 11, 2014.
Dept. of Computer Science & Engg. 27 SIMAT, Vavanoor

Generic Solving Of Text Based Captcha

  • 1.
    Generic Solving ofText-based captchas A seminar report submitted in partial fulfillment of the requirements for the award of the degree of Bachelor of Technology in Computer Science & Engineering Eighth Semester 2011 Admission
  • 2.
    ABSTRACT Over the lastdecade, it has become well-established that a captchas ability to with- stand automated solving lies in the difficulty of segmenting the image into individual characters. The standard approach to solve captchas automatically has been a se- quential process wherein a segmentation algorithm splits the image into segments that contain individual characters, followed by a character recognition step that uses machine learning. While this approach has been effective against particular captcha schemes, its generality is limited by the segmentation step, which is hand-crafted to defeat the distortion at hand. No general algorithm is known for the character collapsing anti-segmentation tech- nique used by most prominent real world captcha schemes. Here a novel approach to solve captchas in a single step that uses machine learning to attack the segmen- tation and the recognition problems simultaneously is formulated. Performing both operations jointly allows the algorithm to exploit information and context that is not available when they are done sequentially. At the same time, it removes the need for any hand-crafted component, making the approach generalize to new captcha schemes where the previous approach cannot. Many websites use captchas, or Completely Automated Public Turing tests to tell Computers and Humans Apart, to block automated interaction with their sites. For example, G mail uses captchas to block access by automated spammers, eBay[12] uses captchas to improve its marketplace by blocking bots from flooding the site with scams, and Facebook uses captchas to limit creation of fraudulent profiles used to spam honest users or cheat at games. The most widely used captcha schemes use combinations of distorted characters and obfuscation techniques that humans can recognize but that may be difficult for automated scripts. captchas are sometimes called reverse Turing tests, because they are intended to allow a computer to deter- mine whether a remote client is human or machine.
  • 3.
    Contents List of Figuresii 1 Introduction 1 2 Outline 3 3 Motivation 4 4 Approaches and Data Set 5 4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Algorithm 10 5.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1.1 Cut - Point Detector . . . . . . . . . . . . . . . . . . . . . . . 11 5.1.2 Slicer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.1.3 Scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.1.4 Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.3 Dealing with Occluding Lines. . . . . . . . . . . . . . . . . . . . . . . 17 5.4 Sequential Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6 Areas of Improvement 19 6.1 Learn the KNN weights . . . . . . . . . . . . . . . . . . . . . . . . . 19 6.2 Improve cut-point elimination . . . . . . . . . . . . . . . . . . . . . . 19 6.3 Additional Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6.4 Explore deep neural networks . . . . . . . . . . . . . . . . . . . . . . 20 7 Future Works of captcha system 21 8 Conclusion 24 Bibliography 25 i
  • 4.
    List of Figures 4.1Segmentation then Recognition. . . . . . . . . . . . . . . . . . . . . . 6 4.2 Various Distortion in Negatively Kerned captcha. . . . . . . . . . . . 7 4.3 15 Best Captchas over which test was conducted. . . . . . . . . . . . 8 4.4 Captchas over which test was conducted. . . . . . . . . . . . . . . . . 9 5.1 Overview of the algorithm’s four components . . . . . . . . . . . . . . 10 5.2 How Algorithm Works . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Example of the algorithm successfully applied to a Yahoo captcha . . 12 5.4 Cut Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.5 Graph creation in Slicer . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.6 Reinforced Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.7 Occluding Lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.8 Sequential Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.1 Captcha Future. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 ii
  • 5.
    Chapter 1 Introduction A captchastand for Completely Automated Public Turing test to tell Computers and Humans Apart [8]is a type of challenge-response test used in computing to de- termine whether or not the user is human. From the abbreviation it was clear that captcha is a Turing test. Turing test is a test a machines intelligence level, to know whether the machines intelligence reaches up to the level of humans. captcha was found in 2000 by Luis von Ahn, Manuel Blum and Nicholas J. Hopper of Carnegie Mellon University and John Langford of IBM[14]. We know that captcha is used to determine whether user is machine or human but this was old concept. With the implementation of the captcha breaking system as mentioned in the reference paper most of the captchas can be decode by the algorithm designed by the authors. So in order to undergo Turing Test captcha is old fashion now stronger method or captcha has to be designed. This captcha breaking system can break most of the captcha in various web application systems The standard approach to solve captcha automatically (i.e. by a computational de- vice) is by sequential processing. This Sequential processing consists of two major functions they are Segmentation [32] and Recognition. The segmentation algorithm splits the image into segments that contain individual characters. The recognition algorithm uses machine learning to recognize a single character. After recogniz- ing all the character the machine can generate the perfect decoded format of the captcha. After segmentation of captcha, Recognition is performed that is why it is called sequential processing. This approach is effective only against a particular set of captcha schemes. In some captcha schemes the sequential processing will fail at segmentation step. These exceptional captcha schemes follow hand-crafted to technique. There is no general algorithm known for the character segmentation process for hand real world captcha schemes. Due to this drawback the traditional sequential processing failed. Here the discussion is about the algorithm which is not sequential but simultane- ous processing. That is the two major functions Segmentation and Recognition are executed simultaneously over the captcha. Performing both operations jointly allows the algorithm to get full information for machine learning and context which was not available when sequential Algorithm was used. It also removes the hand-crafted schemes, making this approach the generalized approach to new captcha schemes where the previous approach cannot. Many websites use captchas, Gmail uses captchas[7] to block spam access, eBay uses 1
  • 6.
    Generic Solving ofText-based captchas captchas to prevent flooding into the site with scams, and Facebook uses captchas to limit creation of system based fake profile generation. users or cheat at games. The most widely used captcha schemes use combinations of distorted characters and obfuscation techniques that humans can recognize but that may be difficult for au- tomated scripts. Captchas are sometimes called reverse Turing tests, because they are intended to allow a computer to determine whether a remote client is human or machine[15]. The effectiveness and universality of the results suggests that combin- ing segmentation and recognition is the next evolution of automated captcha solving, and can suppress the sequential approach used in earlier works. After comparing the accuracy of the algorithm with the accuracy of humans it was found that purely text based captchas[16] may be nearing their end, and provides early steps toward rethinking how reverse Turing tests can be performed securely. Dept. of Computer Science & Engg. 2 SIMAT, Vavanoor
  • 7.
    Chapter 2 Outline This reportis presenting the entirely new concept of breaking a text based captcha. It is know that breaking a captcha using an algorithm is not a good deed. But the main highlight of this report is to make the computer analysis and research people understand the approach being used here to break the captcha and design a stronger and more complex reverse Turing test. This algorithm is only concentrating on solv- ing a text based captcha. The First chapter is about describing out an introduction about the proposed algorithm which does the task of solving a text based captcha. This chapter also describes about the introductory parts of this new algorithm along with the brief- ings about the inventors of the captcha and the basic informations needed to know about the captcha. The Third Chapter describes about the relevance and motiva- tion behind the study of this algorithm. The Chapter Approaches and Data Set states about the various methods to implement this algorithm various important approaches to solve a captcha. The Data Set lists out the various available and the most famous list of text based captchas. In the Fifth chapter the algorithm is defined and illustrated in detail with all the substitution process being used in the algorithm. The Sixth chapter details about the future works being made to impro- vise the reverse Turing test process. This chapter also states about the various new captcha schemes which can replace the traditional text based captcha system. The final chapter is the conclusion where a summarized format of the entire report is described. The conclusion chapter is followed by the Bibliography. 3
  • 8.
    Chapter 3 Motivation The mainmotivation of this method is to bring forward a new concept to solve the captcha. The captchas are believed to be not solvable by the computing machine but here the authors of the main reference paper has proposed a new algorithm to solve the captcha. The authors themselves are mentioning that they are publishing this algorithm for the advancement of technology in the field of Reverse Turing Test and for the academic research purpose. The algorithm is complex and more costly to reproduce than employing cheap manual labor to solve captchas. Due to the higher accuracy rate and effective functionality in solving the captcha using this algorithm, leads the designers and the research specialist to invent new complex and more captcha system so that the security level and can be more enhanced in the field of Computer Science. 4
  • 9.
    Chapter 4 Approaches andData Set 4.1 Approach As mentioned earlier in the Introduction part, that this algorithm is only applicable for text based captcha as a result the discussion of the topic is only related to the text based captcha system. The text based captchas[13] are treated as an image. As the captcha is an image the various image processings techniques are used to un- dergone. In this section we will discuss the various approaches made in past and the approach made to implement automated captcha solving system and its limitations. In order to implement the automated captcha the entire process of automation consist of two main process they are: • Segmentation - Segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as superpixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.Image segmen- tation is typically used to locate objects and boundaries (lines, curves, etc.) in images. Here this process is used for dividing the characters of the captcha in to different individual characters is called segmentation. This process is the most difficult and complex to design. It also uses the concepts of Image processing as the captcha is an image. Practically it is said that there is no effective algorithm which does the process of segmentation accurately. • Recognition - Recognition is a field that includes methods for acquiring, pro- cessing, analyzing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic infor- mation. The process of recognizing each distorted individual character (seg- mented character) with the help of machine learning is called recognition. The concept of machine learning is used in recognition because it was found that the machine learning algorithms consistently outperform humans for sin- gle character recognition. Due to the presence of the Image recognition the algorithm becomes more smarter and intelligent 5
  • 10.
    Generic Solving ofText-based captchas Figure 4.1: Segmentation then Recognition. There are generally two main methods to implement automated captcha system they are: • Segment then Recognize Method • Segment and Recognize Together Method The Figure:4.1 shows the various process involved in the process of decoding or breaking a text based captcha scheme. The process of solving the automated captcha can be divided into five generic steps: pre - processing, segmentation, post - seg- mentation, recognition, and post-processing. While segmentation, the separation of a sequence of characters into individual characters, and recognition, identifying those characters, are intuitive and generally understood, there are good reasons for considering the additional pre - processing and post-processing steps as part of a standard process. For example, preprocessing can remove background patterns or eliminate other additions to the image that could interfere with segmentation, while post - segmentation steps can clean up the segmentation output by normalizing the size of each image or otherwise performing steps distinct from segmentation. After recognition, post-processing can improve accuracy by, for example, applying spell checking to any captcha that is based on actual words (such as slashdot).Based on this generic captcha - solving architecture, test experimentswith various specific al- gorithms were tried on various popular website captchas. From these set of analysis report,a set of techniques was identified that make captchas more difficult to solve automatically. By varying these techniques, a larger set of techniques were created that helped the study, the effect of each of these features in detail and refines the automated attack methods. humans. The Segment then Recognize Method is the traditional method in which the process of segmentation is done first and then the segmented characters are passed into the recognition algorithm which uses the concept of machine learning to rec- ognize each character. This approach has been effective against particular captcha schemes, its complexity in solving is deviated due to the segmentation step, which is hand - crafted to defeat the distortion at hand. No general algorithm is known for the character collapsing anti-segmentation technique used by most prominent real world captcha schemes. This technique is called negative kerning which is a variant of the object occlusion problem. Dept. of Computer Science & Engg. 6 SIMAT, Vavanoor
  • 11.
    Generic Solving ofText-based captchas (a) captcha with no noise and distortion (b) captcha with cut through lines (c) captcha with color and distortion Figure 4.2: Various Distortion in Negatively Kerned captcha. Negative Kerning (Figure:4.2) is a character collapsing technique in which the space between the characters are removed and each characters are occluded with the neighboring character. The process of occluding means to joint or to attach the characters. Along with the process of occluding characters in the negative kerning process some extra noises, distortion and randomization are also added to prevent side channel attack. This adding up of noises is in the form of adding colors, dis- torted cutting through text lines in order to make the captcha more complex. Side channel attack is the process of recognizing the captcha content from the process of continuous learning of each character in captcha and predict the result. When noises are added up in the Negatively Kerned captcha then it will be difficult for undergoing side channel attack[19]. The Figure:4.2a shows the captcha which undergone negative kerning but there was no noise[11], no occluding lines and no external distortions are added up. The Figure:4.2b shows the captcha on which negative kerning and occluding lines are added up to make more distortions and causing confusion. The Figure:4.2c shows a simple captcha with distortions in the form of color, occluding lines and dots, here no negative kerning is implemented. Negative kerning is considered the most secure method for preventing segmentation because it has successfully withstood years of attacks. Almost all of the most prominently used captcha schemes rely on it. The other method of choice to prevent segmentation, which seems to have fallen out of fashion after a successful wave of attacks, is to use occluding lines. A captchas ability to withstand automated solving lies in the difficulty of segmenting the image into individual characters rather than recognizing the characters themselves. Which means that the segmentation process is the difficult part in the automated captcha system. Up till now there have been two approaches/works which have been formu- lated for automated captcha solving: The first type of attack is to undergo side channel attack for all type of captcha. In this method the segmentation algorithm will does the task of dividing the captcha into different characters along with the consecutive distortion faced by the particular character. The machine learning part will then try to remove the distortion or pre- dict the character and generate the output. But this approach is not much favorable because the defender can easily defend the captcha by making the captcha difficult for segmentation and if the segmentation was carried out then also the output will not be proper. So as a result this attack approach cannot be applied over all the captchas. The second type of attack focuses on finding weaknesses in the distortion algo- rithms of particular captcha schemes. A specially designed segmentation algorithm Dept. of Computer Science & Engg. 7 SIMAT, Vavanoor
  • 12.
    Generic Solving ofText-based captchas is designed by the attacker which works on the principle of image processing and morphological segmentation. This is used to remove the distortion. The image pro- cessing algorithm is provided with the features to recognize the twisted and turns in the captcha text. So based on the twist and turns in the captcha the image process- ing algorithm will undergo the process to make it into understandable format. The morphological segmentation does the task of filling the missing data based on the relevant information obtained from the image processing part. While this attack was also a failure. This approach was only applicable over the reCaptcha 2011 scheme. Later in 2013, a group or researchers examined hollow captcha, specifically and were able to solve all of the captcha schemes by extending the segmentation process then recognize approach that involves nine consecutive steps. Up till now research in captcha solving has followed the exploit - patch cycle. The exploit - patch method was tried on the best 15 captcha scheme shown in Figure:4.3 In the exploit - patch cycle the attacker finds a flaw in a particular anti - segmentation technique, and then the defender tries to patch it, the process of removing the flaw int the anti - segmentation technique or moves on to a new one. The limitation of the segment then recognize approach has been the attacker’s ability to find new flaws. This proposed algorithm can overcomes this limitation by segmenting and recognizing the captcha simultaneously, thus removing the need for manually discovered heuristics to segment captchas [1]. Figure 4.3: 15 Best Captchas over which test was conducted. Dept. of Computer Science & Engg. 8 SIMAT, Vavanoor
  • 13.
    Generic Solving ofText-based captchas 4.2 Dataset In this section the list of various famous captcha which is used widely are described. The Algorithm as per designed by the authors evaluated the efficiency and complex- ity of their algorithm based on the Dataset(Figure:4.4) given here. Figure 4.4: Captchas over which test was conducted. It was found that most of the captcha schemes out of these six captcha schemes are depended on negative kerning to prevent segmentation. Since 2011, some of those schemes, namely Baidu and ReCaptcha, have evolved. To keep the algorithm evaluation relevant to the state of the art in captcha design, we extended our corpus to include the new versions of Baidu and ReCaptcha in 2013. Corpus is a large and structured set of texts used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. Here the language is the captcha. As visible in figure Baidu and ReCaptcha evolved (the updated version is the 2013 version) in two radically different ways: Baidu decided to use hollow letters whereas ReCaptcha introduced more aggressive distortions. But the success rate decreased on the new version of ReCaptcha compared to the previous version. On the other hand, surprisingly, its accuracy significantly increased on the newer version of Baidu. Dept. of Computer Science & Engg. 9 SIMAT, Vavanoor
  • 14.
    Chapter 5 Algorithm In thischapter an overview of our algorithm is clearly mentioned along with the description of its major components. As mentioned early the process of learning for the process of recognition is a very important part of this algorithm. The process of reinforcement learning process which is the main reason for the accuracy of this algorithm. In the previous chapter it was described that not only negative kern- ing but occluding lines are also used for the process of creating captcha so solving the occluding lines in a generic manner since it is a natural extension of our algo- rithm. The discussion about optimizations and trade-offs that can be applied to the algorithm will be done in the next chapter. 5.1 Algorithm Overview As mentioned early this algorithm works on the process of undergoing the process of segmentation and recognition together. So here the first thing to do is to find all possible ways to segment the given captcha. to find the set of all possible ways to segment the captcha it means that to find the set of methods to segment each characters of the captcha. An unstructured captcha image can be segmented into dif- ferent forms. After getting the set of segmented captcha, decide which combination is most likely to be the correct one. After analyzing all the possible segmentation paths find out the path(set of segments) that can maximize the recognition rate. We contrast this with the segment then recognize approach where an uninformed segmentation algorithm passes at most a small number of possible segmentations to an independent recognition algorithm as a result it is more time consuming and will may or may not generate correct result after many attempts. Figure 5.1: Overview of the algorithm’s four components In the Figure:5.1 itself it is very clear that this Algorithm consist of four main components: 10
  • 15.
    Generic Solving ofText-based captchas • Cut - Point Detector: This will find all the potential ways to segment the captcha. • Slicer: It will extract the segments and combine them into a graph. • Scorer: It will perform Optical Character Recognition (OCR) on the segments and assigns a recognition confidence score to each one of them. • Arbiter: It is responsible for processing the scores and determining what are the most likely letters. Figure 5.2: How Algorithm Works The graph representation in the Figure: 5.2 is used to find and store all possible segmentations which can be derived out from the captcha at once. Due to this procedure the algorithm was successful to simultaneously solve the segmentation and recognition problems. Now we will discuss each components working in detail. 5.1.1 Cut - Point Detector The Cut - Point Detector is the first and the initial step in this algorithm. Here as mentioned earlier will generate the set of segmentation area. Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as superpixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. Dept. of Computer Science & Engg. 11 SIMAT, Vavanoor
  • 16.
    Generic Solving ofText-based captchas So in order to undergo the segmentation we need two points(pixels), to construct a line segment, for segmentation. This two points can be obtained by characters ex- amining the second derivative of the curve generated by following the bottom pixels of the captcha, and the curve generated by following the top pixels of the captcha. Now we got two set of curves, first curve is the second derivative of the top pixels of captcha and the second curve is the second derivative of the bottom pixels of captcha. Now in both the curves we have to mark the points for undergoing segmentation. These points can be obtained by finding the Inflection points [31] on the curve. The inflection point is a point on a curve at which the curve changes from being concave (concave downward) to convex (concave upward), or vice versa. So we will get a set of inflection points on the first curve and the second curve. The set of inflection points on the first curve is marked as red color and the set of inflection points in bottom is marked in blue color. Now after getting the points to generate the cuts the process of finding all possible cut lines is initiated. Now each cut is constructed by connecting the inflection points - one from the top, and one from the bottom. On doing this process for the entire curve we will get a set of cuts called as the Potential Cut. And this Potential Cuts are marked over the captcha. Now this captcha containing the Potential cuts are given as input to the Slicer. Figure 5.3: Example of the algorithm successfully applied to a Yahoo captcha Dept. of Computer Science & Engg. 12 SIMAT, Vavanoor
  • 17.
    Generic Solving ofText-based captchas 5.1.2 Slicer The Slicer is provided with a captcha containing potential cuts marked over it as input. The slicer applies some heuristics [30] to extract the meaningful potential segments based on the cut points and builds the graph as shown in Figure: 5.4. A potential segment is considered meaningful if the two cuts that define its left and right boundaries are sufficiently far apart, yet not too far apart. The process of formation of graph (Figure:5.5) is a real unique process. In this algorithm the entire text based captcha is divided into a set of window consisting of 2 characters at a time. So as shown in the Fig. we can see the Characters A and B are the two char- acters of the captcha and resides in the window. This window consist of 4 potential cuts which was found in the previous step. Now the process of taking the content in the segments region is initiated. Initially the algorithm takes the region within the segment 0 and 1, where 0 and 5 are the borders of the window treated as a segment. The content is analyzed and a weight is assigned to the recognized part along with the possible character. Moving to the next cut i.e the region between the segments 1 and 2, this region doesn’t gives a meaningful segment as a result it is discarded. After getting all the values a graph is traversed with the segments as the nodes and the vertices’s are with the possible character and the weight assigned. Based on this graph the best cut can be found here cut 3 because region from 0 to 3 and 3 to 5 gives nearly same character and same weight value In simple terms it means that a cut or segment is said to be po- tential if the distance towards the black pixels from the segment pixel is sufficiently far yet not too far. Figure 5.4: Cut Optimization So naturally if the number of potential cuts increases the computation time also increases. It was found that using this algorithm it took 9 hours to undergo Slicing over a captcha containing 12 characters. So to remove this draw back the only remedy is to decrease the number of cuts in the Potential Cut set. So for optimizing this algorithm a new approach is formulated which works by pruning(removing) near- duplicate and improbable cuts from the set of potential cut points.First, we removed all the cuts that have an angle > 30. Then we examined the ratio of white pixels to black pixels to eliminate cut lines that pass through too many black pixels, since they are most likely cutting through the middle of a letter. Finally we Dept. of Computer Science & Engg. 13 SIMAT, Vavanoor
  • 18.
    Generic Solving ofText-based captchas Figure 5.5: Graph creation in Slicer compared the pixel intensities of the left and right boundaries to estimate whether the cut marks a transition between two letters. 5.1.3 Scorer The scorer does the task of assigning a score value for each character which got seg- mented. The scorer traverse the graph of potential segments and applies OCR and then assign a confidence value. In the previous step the algorithm generates per- fectly segmented character they are called the potential segments. Now the Scorer will scan or analyze these segments and generate a score for the character. After generating the score using KNN algorithm the class to which the potential segment belongs is found out. The KNN algorithm [29] known as k Nearest Neighbor algorithm. This algo- rithm is an classification algorithm, it works on the principle of making an element belongs to a class from the set of classes by measuring the distance k from each class in the feature space. But here modified version of KNN algorithm is used. First after getting the potential segments in the captcha, at pixel level the score value is calculated for the corresponding character and then based on the overall confidence value it assigns a recognition confidence score. This recognition confi- dence score is used as the source for the KNN algorithm and its value is checked with the surrounding class values. The class contain set of similar characters, like for example A,4 belong to same class similarly 0,O belong to same class. The same class elements have a nearly same value because their character appearance are alike. So allocating each captcha value to a particular class is the prime task done here. Segments are processed at the pixel level, as this has been demonstrated to be the best approach for text recognition. Here the KNN algorithm is more preferable because of the following factors : computation at pixel level, noise resistance and computational speed. The noise resistance arises from using a relatively small k (less than 10) in our KNN to identify the nearest neighbors. This is essential in our case because most of the potential segments generated by the slicer are meaningless and belong to the garbage class. Dept. of Computer Science & Engg. 14 SIMAT, Vavanoor
  • 19.
    Generic Solving ofText-based captchas A metric distance function is a function that defines a distance between elements of a set. It was realized that the problem was assigning an equal weight to each pixel regardless of its position in the segment or its gray scale value. It turns out that pixels on the edge of segments are less meaningful than pixels in the center precisely because they are shared between characters that have been collapsed together. We achieved very good results on all captcha schemes by assigning higher weight to pixels nearer the center of the segment, and to darker ones. 5.1.4 Arbiter The arbiter is the final component of this algorithm. The arbiter does the task of taking in the input from the Scorer, that is the recognition confidence score , and will check it with the trained accurate values of the class members and then generate the solution of that segment. The scorer will generate the result that the given segment of character will belong to which class. Now the arbiter does the job of accessing all the data in that particular class and analyzing each one with the segmented character. For this method here the approach of ensemble learning approach is used. The Ensemble learning [27] is a technique for combining many weak learners in an attempt to produce a strong learner. The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner. So the class will contain all set of segmented character set on the basis of their confidence score. So based on the requirement the algorithm’s approach will deeply study the input and tries its maximum to reach to the solution. 5.2 Reinforcement Learning Reinforcement Learning [28] is an important part of this algorithm. Reinforcement learning is an area of machine learning inspired by behaviorist psychology, con- cerned with how software agents can probably take actions in an environment so as to maximize reward. So as a result we can say that reinforcement learning can also make the algorithm more smarter and clever. Reinforcement learning is based on the concept of ”making understand first and then react”, so like wise here also we can use this approach in the algorithm too. The traditional way to train a character classifier is to provide a set of labeled captchas which are already segmented and then let the classifier learn to recognize each character from those segments using the labels. So here providing the labeled segmented captcha is an intelligent process.In this process it is assumed that the classifier is given with the correct number of segments - one for each letter in the captcha. Now based on this labeled capthca the algorithm will learn the various approaches or schemes used in the traditional text based captcha and then it will generate the solution. In the ”segment then recognize” approach, this assumption holds because the segmentation is handled by a vision algorithm that is not part of Dept. of Computer Science & Engg. 15 SIMAT, Vavanoor
  • 20.
    Generic Solving ofText-based captchas (a) captcha error recognition (b) captcha’s bad segmentation and bad recognition Figure 5.6: Reinforced Learning. the classifier itself. Which states that the segmentation algorithm and the recogniz- ing algorithm are two different independent algorithms. Here this algorithm also uses reinforcement learning approach where the human also plays a part. Instead of providing the classifier with labeled examples of valid segments,here the algorithm asks the human to give explanation to the segments that have been misclassified, and then the algorithm learns from the feedback. The training is started using the traditional method, first the algorithm processes a set of labeled captchas. During the decoding of this set of labeled some set of captchas fails. So those set of captchas are stored/saved. Those captchas that were not successfully recognized, the failed captchas, the algorithm asks for human feedback when a segment surrounded by two correctly classified segments is misclassified. In those cases, the algorithm needs the human expertise because the misclassification could be due either to improper segmentation, or to bad recognition.If the error was due to improper segmentation, the segment is discarded. If the error was due to a recognition error, the segment is added to the classifier training set. When all the cases are reviewed, the algorithm is retrained with the enriched dataset. In practice, even a single round of reinforcement learning is enough to significantly improve accuracy. Dept. of Computer Science & Engg. 16 SIMAT, Vavanoor
  • 21.
    Generic Solving ofText-based captchas 5.3 Dealing with Occluding Lines. To make captcha more difficult to decode/break for any algorithm, the captcha de- signers are using the concept of occluding lines. Occluding lines are those which are unwanted source of lines, to deviate the functionality of the captcha breaking algorithms. This algorithm was initially not able to solve the problem of occluding lines but later it was solved. The initial attempt was to introduce a new algorithm for the removal of inde- pendent lines of the algorithm based on the soft margin algorithm. Later two more simpler method was formulated. The first method was to add a new class in the scorer part of the algorithm. As the scorer works more on the KNN algorithm and due to the property of permitting discontinuous character classes a new class was added up. This new class is a collection of many different shapes of line. The second method is that as mentioned earlier the cut point detector works on the concept of the finding the second derivative of the curve and then finding the inflection points to find the potential cuts, well this part is suited for ignoring flat parts. Figure 5.7: Occluding Lines. 5.4 Sequential Recognition For every algorithm one of the prime factor that needs to be considered all time is the computational cost. The computational cost shows the performance of the Dept. of Computer Science & Engg. 17 SIMAT, Vavanoor
  • 22.
    Generic Solving ofText-based captchas algorithm for any kind of related inputs. In the previous section we realized that the computational cost of the algorithm increases with the increasing size of the captcha. This computational cost increases because the number of characters to be segmented and the number of character to be recognized is increased as a result the compu- tation cost increases. This is a serious issue while looking into the factor of efficiency. Now this problem is solved in a very interesting manner. A separate variation of the current algorithm is designed here, which increases the efficiency of the character recognizability process. Here local recognition algorithm is used. Local recognition is a sub process of the main recognition process. This process is done by implement- ing the approach of making a local decision in which a window is selected which intakes two letters at a time. Considering two characters at a time yielded signif- icantly better results than looking at one or three characters at a time. Now in a window the number of all possible cuts are scanned. Suppose if there are 3 cuts in a window of two characters. First step is to consider any one of the 3 cuts and undergo segmentation. After segmentation we will get a result i.e two separated characters. Now these separated character’s pixel value is calculated and stored. Now repeat this process for all the types of cut which is present in the window. After that the maximum pixel value is calculated and is selected as the recognition confidence score. But there are chances that this process of the local decision is subjected to many errors. Figure 5.8: Occluding Lines. The main areas of error in this method is that if the characters of the capthcas are highly conjuncted to each other then the chances of creating the proper window will become least accurate and the algorithm will give an entirely wrong output. It was also found that some of the captcha schemes are oriented into one specific direction i.e either left or right. This local decision process is more effective on the left side based captcha over the right side based captcha. Th best solution to this issue is to undergo sequential recognition from both direc- tions and then combining the two recognition scores to improve the overall accuracy. This is called left - right approach. This is done by executing two local decision pro- cess simultaneously. One local decision process doe the process of recognition from left to the right and the other one does the same process from right to left. After successfully completion of both the process the result is combined together to get the best optimal score value. Dept. of Computer Science & Engg. 18 SIMAT, Vavanoor
  • 23.
    Chapter 6 Areas ofImprovement The segmentation and recognition simultaneously approach, holistic [9] is being formulated for the first time here. Though this algorithm produces good results, it is just the first rough implementation.This chapter describes about the some of the most promising directions for improvement. 6.1 Learn the KNN weights During the algorithm it was discussed that the kNN algorithm is used to classify the characters. The current implementation uses a single manually chosen set of weights for the KNN distance computation that performed well on the set of captcha scheme provided initially . It is believed that automatically learning of those weights for each captcha scheme would improve accuracy, particularly for schemes that use unusual fonts or specific distortions. It is believed that it is possible to accomplish this fully unsupervised, similar to the cut-point detector and slicer phases of our algorithm. 6.2 Improve cut-point elimination The computation time is directly related to the number of potential segments, the first optimization was to come up with heuristics to reduce the number of cut points considered by the cut point detector. This optimization works by pruning near- duplicate and improbable cuts from the set of potential cut points. First, remove all the cuts that have an angle >30. Then examine the ratio of white pixels to black pixels to eliminate cut lines that pass through too many black pixels, since they are most likely cutting through the middle of a letter. Finally we compared the pixel intensities of the left and right boundaries to estimate whether the cut marks a transition between two letters. Finding a better set of heuristics that are both generic and more precise is an open question. 6.3 Additional Occlusion As pointed out earlier, Baidu and CNN captcha schemes use occluding lines with low curvature. While results on these captcha schemes are very good and ur algorithm properly detects lines, future work should investigate in depth how various types of 19
  • 24.
    Generic Solving ofText-based captchas lines, e.g., sine waves that have a high curvature, impact the recognition rate. It should also consider other types of occlusion, e.g., blobs. To date, we have not found real world captcha schemes that employ this type of occlusion; perhaps occlusion of this type presents usability challenges that make it impractical for humans. 6.4 Explore deep neural networks A primary contribution of this work is to completely demonstrate the effectiveness of performing segmentation and recognition simultaneously. Accordingly, a consid- eration was also made on other algorithms that are able to process captchas in a holistic manner. In particular, with collaborators, a experiment was conducted to experiment with deep convolution neural networks, similar to those in [17]. These experiments have confirmed the benefits of a unified approach, and have achieved captcha-solving results that equal or improve upon those presented in this paper. For certain ReCaptcha data sets, these new results show such dramatic improvement in accuracy, while using large-scale training sets, that they suggest that deep neu- ral networks may hold a substantial advantage over humans for solving text-based captchas [18]. Dept. of Computer Science & Engg. 20 SIMAT, Vavanoor
  • 25.
    Chapter 7 Future Worksof captcha system With the demonstration (through research publications) that character recognition CAPTCHAs are vulnerable to computer vision based attacks, some researchers have proposed alternatives to character recognition, in the form of image recognition CAPTCHAs which require users to identify simple objects in the images presented. The argument is that object recognition is typically considered a more challenging problem than character recognition, due to the limited domain of characters and digits in the English alphabet. This is the reason why captcha is taken as an image rather than a set of characters. When captchas were invented, the designers real- ized that with the passage of time one of two things would happen: either captchas would remain an invaluable way to differentiate humans and computers, or very high quality OCR would become readily available. Here the entire description was based on solving the text based captcha. And it was found that the end of using text based captcha has approached as it is quite simple to decode. In this algorithm by using the concept of segmentation and recog- nition together many of the captcha were able to decode successfully. So it is direct that in near future, by updating this algorithm one can achieve 100% decoding of the text based captcha. Due to all such reason the need for new type of reverse Turing test is higher. The first potential method is simply to find a more difficult problem in computer vision. Incorporating video or requiring the user to perform a higher order cog- nitive task such as circling or rotating an object. Due to the failure of the text based captcha the new captcha schemes that arrived where the audio and the video captcha. But the audio captcha resulted into a failure as it was able to decrypted using an output of speech to text recognizer. The video captcha can also be decoded successfully by taking the frame pictures of the video and then analyze each frame and decode the captcha. Datta et al. published a paper in the ACM Multimedia ’05 Conference, named IMAGINATION (IMAge Generation for INternet AuthenticaTION), proposing a systematic way to image recognition captchas.According to that paper a set of im- ages are distorted in such a way that state-of-the-art image recognition approach will fail to recognize them. But this captcha was able to be solved with quite diffi- culty by the humans.[20] 21
  • 26.
    Generic Solving ofText-based captchas Microsoft have developed Animal Species Image Recognition for Restricting Ac- cess (ASIRRA) which ask users to distinguish cats from dogs. Microsoft had a beta version of this for websites to use.?? Microsoft claim ”Asirra is easy for users; it can be solved by humans 99.6% of the time in under 30 seconds. Anecdotally, users seemed to find the experience of using Asirra much more enjoyable than a text-based CAPTCHA.” This solution was described in a 2007 paper to Proceedings of 14th ACM Conference on Computer and Communications Security (CCSIts). However, this project was closed in October 2014 and is no longer available. Asirra captcha (Figure:7.1a), which asked users to distinguish between cats and dogs. Less than a year after its release it was successfully broken using a classifier trained to recognize image textures. The MintEye captcha (Figure:7.1b) scheme was a moder version of captcha scheme where a image is distorted and the user has to undistorted the image and make it back as a perfect figure[6].This schema relies on undistorted an image was broken by a very simple attack based on Sobel operators that only required 23 lines of Python[5]. Due to this this schema was also rejected Mitra et. al. have suggested using emergent images as an alternative way to encode (a) Asirra captcha (b) MintEye captcha Figure 7.1: Captcha Future. information in video that might be robust against computer vision algorithms. A short post on emergent images, still or moving images where objects at first only appear with effort and concentration, but once recognized are very easy to see again even after several months or years. In effect once a user have recognized the object he/she remember it forever[23]. Emergence refers to the unique human ability to aggregate information from seemingly meaningless pieces, and to perceive a whole that is meaningful. Recently game - based captchas have been developed[4]. However implementing this idea as proven to be difficult, as the game captcha schemes for the leading game captcha provider Are you a human have been broken[22]. This captcha system works on the concept of giving the user a game to complete and reach the target goal. The game is a simple design and only humans can solve it.However implementing this idea as proven to be more difficult. NuCaptcha is an early fraud detection service which utilities behavior analytics to provision threat appropriate, animated video captcha. NuCaptcha is developed and operated by Canadian-based firm, NuData Security. Static image-based captchas are routinely used to prevent automated sign-ups to websites by using text or im- Dept. of Computer Science & Engg. 22 SIMAT, Vavanoor
  • 27.
    Generic Solving ofText-based captchas ages of words disguised so that optical character recognition (OCR) software has trouble reading them[10]. However, in common captcha systems, users often fail to correctly solve the captcha 7% - 25% of the time.NuCaptcha uses animated video technology that it claims make puzzles easier for humans to solve, but harder for bots and hackers to decipher[24]. Cognitive Behavior : Another method, to compute whether the user is a human or system is on the basis of the computation speed of the respective brains, i.e speed of brain of human in solving to the ratio of the speed in solving by a computer is relatively faster[25]. So based on this time variation in solving the captcha one can recognize who is human and who is a system. Leveraging reputation: In addition to considering how a reverse Turing test is solved, captcha providers could consider the identity of the solver, for example the IP address, the geographic location, etc. If a good enough proof of identity can be established, providers can use this reputation to adapt the difficulty of the reverse Turing test. Dept. of Computer Science & Engg. 23 SIMAT, Vavanoor
  • 28.
    Chapter 8 Conclusion Here adetail explanation and study was made on the approach to solve captchas in a single step that uses machine learning to attack the segmentation and the recognition problems simultaneously. Performing both operations jointly allows this algorithm to exploit information and context which is not available when it is done sequentially. This algorithm was able to solve many prominent real-world captcha schemes that use both negative kerning and occluding lines without any modification to the algorithm. The algorithm was able to achieve a 38.68% recognition rate on Baidu 2011, 55.22% on Baidu 2013, 51.09% on CNN, 51.39% on eBay, 22.67% on ReCaptcha 2011, 22.34% on ReCaptcha 2013,28.29% on Wikipedia, and 5.33% on Yahoo. This study of algorithm gives us realization that the reverse Turing tests might be improved going forward.The effectiveness and universality of the results suggests that combining segmentation and recognition is the next evolution of catpcha solv- ing, and that it supersedes the sequential approach used in earlier works. With these advances, it seems that purely text-based captchas are likely to have declining util- ity; significant effort may be needed to rethink the way we perform reverse Turing tests. 24
  • 29.
    Bibliography [1] Elie Bursztein,Jonathan Aigrain, Angelika Moscicki, John C. Mitchell,”The End is Nigh: Generic Solving of Text-based CAPTCHAs”’,2013. [2] P. Golle. Machine learning attacks against the asirra captcha. In ACM CCS 2008, 2008. [3] R. Gossweiler, M. Kamvar, and S. Baluja. Whats up captcha? a captcha based on image orientation. In World Wide Web, 2009. [4] Are you human ? http://areyouahuman.com/. [5] Breaking the minteye image captcha in 23 lines of python. Blog post http://www.jwandrews.co.uk/2013/01/breakingthe-minteye-image-%captcha- in-23-lines-of-python. [6] Minteye captcha. website: http://www.minteye.com/, 2013. [7] A. S. E. Ahmad, J. Yan, and M. Tayara. The robustness%of google captchas. Technical report, Newcastle University, 2011. [8] E. Athanasopoulos and S. Antonatos. Enhanced captchas: Using animation to tell humans and computers apart. In IFIP International Federation for Information Processing, 2006. [9] P. Baecher, N. Buscher, M. Fischlin, and B. Milde. Breaking recaptcha: A holistic approach via shape recognition. In Future Challenges in Security and Privacy for Academia and Industry, pages 5667. Springer, 2011. [10] E. Bursztein. How we broke the nucaptcha video scheme and what we propose to fix it. blog post http://elie.im/blog/security/howwe-broke-the-nucaptcha- videoscheme-%and-what-we-propose-tofix- it/, February 2012. 25
  • 30.
    Generic Solving ofText-based captchas [11] E. Bursztein, R. Bauxis, H. Paskov, D. Perito, C. Fabry, and J. C. Mitchell. The failure of noisebased non-continuous audio captchas. In Security and Privacy, 2011. [12] E. Bursztein and S. Bethard. Decaptcha: breaking 75% of eBay audio CAPTCHAs. In Proceedings of the 3rd USENIX conference on Offensive technologies, page 8. USENIX Association, 2009. [13] E. Bursztein, M. Martin, and J. Mitchell. Text-based captcha strengths and weaknesses. In Proceedings of the 18th ACM conference on Computer and communications security, CCS 11, pages 125138, New York, NY, USA, 2011. ACM. [14] E. Bursztein, A. Moscicki, C. Fabry, S. Bethard, D. Jurafsky, and J. C. Mitchell. Easy does it: More usable captchas. CHI, 2014. [15] K. Chellapilla, K. Larson, P. Simard, and M. Czerwinski. Computers beat humans at single character recognition in reading based human interaction proofs (hips). In CEAS, 2005. [16] K. Chellapilla and P. Simard. Using machine learning to break visual human interaction proofs (HIPs). Advances in Neural Information Processing [17] Systems, 17, 2004. Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2011. [18] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolution neural networks. arXiv preprint arXiv:1312.6082, 2013. [19] J. Yan and A. El Ahmad. A Low-cost Attack on a Microsoft CAPTCHA. In Proceedings of the 15th ACM conference on Computer and communications security, pages 543554. ACM, 2008. [20] ”Imagination Paper”. Infolab.stanford.edu. Retrieved 2013-09-28. [21] ”Asirra is a human interactive proof that asks users to identify photos of cats and dogs”. Dept. of Computer Science & Engg. 26 SIMAT, Vavanoor
  • 31.
    Generic Solving ofText-based captchas [22] Spamtech. Cracking the areyouahuman captcha. http://spamtech.co.uk/software/bots/cracking-the- areyouhumancaptcha/,2012. [23] N. J. Mitra, H.-K. Chu, T.-Y. Lee, L.Wolf, H. Yeshurun, and D. Cohen-Or. Emerging images. ACM Transactions on Graphics, 28(5), 2009. to appear. [24] Y. Xu, G. Reynaga, S. Chiasson, J.-M. Frahm, F. Monrose, and P. van Oorschot. Security and usability challenges of moving-object captchas: Decod- ing codewords in motion. In Usenix Security, 2012. [25] C. Cruz-Perez, O. Starostenko, F. Uceda-Ponga, V. Alarcon-Aquino, and L. Reyes-Cabrera. Breaking recaptchas with unpredictable collapse: heuristic character segmentation and recognition. In Pattern Recognition, pages 155165. Springer, 2012. [26] C. Cortes and V. Vapnik. Support-vector networks. Machine learn- ing,Septmenber 2014. [27] Opitz, D.; Maclin, R. (1999). ”Popular ensemble methods: An empirical study”. Journal of Artificial Intelligence Research 2014. [28] Sutton, Richard S. (1984). Temporal Credit Assignment in Reinforcement Learning (PhD thesis). University of Massachusetts, Amherst, MA. [29] Altman, N. S. (1992). ”An introduction to kernel and nearest-neighbor nonparametric regression”. The American Statistician 46, September 2014 [30] Pearl, Judea (1983). Heuristics: Intelligent Search Strategies for Computer Problem Solving. New York, Addison-Wesley, December 2014. [31] http://www.encyclopediaofmath.org/index.php/Point of inflection, Jan- uary,2015. [32] Barghout, Lauren, and Lawrence W. Lee. ”Perceptual information processing system.” Paravue Inc. U.S. Patent Application 10/618,543, filed July 11, 2014. Dept. of Computer Science & Engg. 27 SIMAT, Vavanoor