Machine Learning Methods For Captcha Recognition

Machine Learning Methods
for CAPTCHA Recognition
Rachel Shadoan
Zachery Tidwell, II

CAPTCHA
Completely Automated Public Turing Test to tell Computers and Humans Apart

Why are they interesting?
o Harder than normal text recognition
On par with handwriting recognition,
reading damaged text
o Techniques translate well to other problems
Facial recognition (Gonzaga, 2002)
Weed identification (Yang, 2000)
o Near infinite data sets
Easier to avoid over-fitting

Hypothesis

CAPTCHA recognition can be
accomplished to a high degree
of accuracy using machine
learning methods with minimal
preprocessing of inputs.

Methods
Tools
o JCaptcha
o Image Processing

Learning Methods Segmentation Methods
o Feed-forward Neural o Overlapping
Nets o Whitespace
o Self-Organizing Maps o K-Means
o K-Means
o Cluster Classification

JCaptcha

o Open-source CAPTCHA
generation software
o Highly configurable
Can produce CAPTCHAs of
many levels of difficulty

o Check it out at:
http://jcaptcha.sourceforge.net

Image Processing
Sparse Image
Represents Images as unbounded set of pixels
Each pixel is a value between 0 and 1 and a
coordinate pair
Center each image before turning into a matrix of
0s and 1s

Original After Transformation

Feed-Forward Neural Nets

As covered in class

Self-Organizing Maps
Training Collection
Initialize N buckets to For many inputs
random values
Sort each input into
For each input the bucket it most
Find the bucket that is closely matches
“closest” to the input For each bucket and each
Adjust the “closest” character
bucket to more closely Calculate the
match the input using probability of that
exponential average character going into
that bucket.

K-Means
• Very similar to Self‐
Organizing Maps
(SOMs)
• Can use the same
classifying mechanism
as used for SOM

Overlapping Segmentation
• Divide image into
fixed number of
overlapping tiles of
the same size
• In our case, 20 x 20
pixels with a 50%
overlap
• Discard chunks
under a certain size Note: This is a B with
part of it cut off, not
and chunks that are an E. Therein lies the
all white rub.

Whitespace Segmentation
• Iterate through the
image from left to
right—segment
when a full column
of whitespace is
encountered
• Works perfectly for
well-spaced text

K-Means Segmentation
• Performs better
than heuristic
segmentation on
closely-packed
inputs

Segmentation Comparison
Even‐width

Whitespace

K‐Means

Even‐width

Whitespace

K‐Means

Experiment 1
Machine Learning Method:
Self-Organizing Map
Topology
200 buckets, initialized randomly
Inputs:
3 letter CATPCHAs
Random fonts
Letters A-G
“Chunked” using overlapping segmentation

Experiment 1 Results
Buckets fell into three primary categories:

Distinguishable
letters

Chunks with halves
of two letters

Indistinguishable
noise

Experiment 2
ML Method: Contains … ?
Neural Net
A: 0 or 1
Topology: B : 0 or 1
C: 0 or 1

400 Nodes
Fully connected

50 Nodes

7 Nodes
D: 0 or 1
E: 0 or 1
400 inputs F: 0 or 1
50 node hidden layer G: 0 or 1

7 outputs
Inputs:
Single letter CATPCHAs
Random fonts
Letters A-G


Neural Net Learning Curve


Past a certain
number of nodes
in the hidden
layer, the
topology ceases
to have a huge
impact on
accuracy.

Neural Net Accuracy vs. Size of Hidden Layer

Experiment 3
ML Method: ML Method:
SOM Neural Net
Topology: Topology:
500 buckets Fully connected
400 inputs
1000 node hidden layer
7 outputs
Inputs:
4 letter CATPCHAs
Fandom fonts
Letters A-G

Experiment 3

Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐G

Experiment 4
SOM Neural Net
Topology: Topology:
400 inputs
7 outputs
Inputs:
4 letter CATPCHAs
Fandom fonts
Letters A-Z

Experiment 4

Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐Z

Experiment 5
SOM Neural Net
Topology: Topology:
400 inputs
7 outputs
Inputs:
5 letter CATPCHAs
Fandom fonts
Letters A-Z

Experiment 5

Neural Net vs. SOM on CAPTCHAs Length 5, Letters A-Z

What it all means
• Increasing number of characters
dramatically decreases total accuracy
because segmentation quality decreases
• True positive rate goes down when
segmentation quality decreases
• Hence, better segmentation is the key

Future Work
Improved Segmentation
o Wirescreen segmentation
o Ensemble techniques
Improved True Positive Rates with Current
System
o Ensemble techniques
New problems
o Handwriting recognition
o Bot net of doom

Machine Learning Methods For Captcha Recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Machine Learning Methods For Captcha Recognition

Similar to Machine Learning Methods For Captcha Recognition (7)

Recently uploaded

Recently uploaded (20)

Machine Learning Methods For Captcha Recognition