Machine Learning Methods
for CAPTCHA Recognition
Rachel Shadoan
Zachery Tidwell, II
Constantine Priemski
Navya Chandana and Shakeeb
CAPTCHA
Completely Automated Public Turing Test to tell Computers and Humans Apart
Why are they interesting?
o Harder than normal text recognition
On par with handwriting recognition,
reading damaged text
o Techniques translate well to other problems
Facial recognition (Gonzaga, 2002)
Weed identification (Yang, 2000)
o Near infinite data sets
Easier to avoid over-fitting
Hypothesis
CAPTCHA recognition can be
accomplished to a high degree
of accuracy using machine
learning methods with minimal
preprocessing of inputs.
Methods
Learning Methods
o Feed-forward Neural
Nets
o Self-Organizing Maps
o K-Means
o Cluster Classification
Segmentation Methods
o Overlapping
o Whitespace
o K-Means
Tools
o JCaptcha
o Image Processing
JCaptcha
o Open-source CAPTCHA
generation software
o Highly configurable
Can produce CAPTCHAs of
many levels of difficulty
o Check it out at:
http://jcaptcha.sourceforge.net
Image Processing
Sparse Image
Represents Images as unbounded set of pixels
Each pixel is a value between 0 and 1 and a
coordinate pair
Center each image before turning into a matrix of
0s and 1s
Original After Transformation
As covered in class
Feed-Forward Neural Nets
Self-Organizing Maps
Training
Initialize N buckets to 
random values
For each input
Find the bucket that is 
“closest” to the input
Adjust the “closest” 
bucket to more closely 
match the input using 
exponential average
Collection
For many inputs
Sort each input into 
the bucket it most 
closely matches
For each bucket and each 
character
Calculate the 
probability of that 
character going into 
that bucket.
K-Means
• Very similar to Self‐
Organizing Maps 
(SOMs)
• Can use the same 
classifying mechanism 
as used for SOM
Overlapping Segmentation
• Divide image into
fixed number of
overlapping tiles of
the same size
• In our case, 20 x 20
pixels with a 50%
overlap
• Discard chunks
under a certain size
and chunks that are
all white
Note: This is a B with
part of it cut off, not
an E. Therein lies the
rub.
• Iterate through the
image from left to
right—segment
when a full column
of whitespace is
encountered
• Works perfectly for
well-spaced text
Whitespace Segmentation
K-Means Segmentation
• Performs better
than heuristic
segmentation on
closely-packed
inputs
Even‐width
K‐Means
Whitespace
Even‐width
K‐Means
Whitespace
Segmentation Comparison
Experiment 1
Machine Learning Method:
Self-Organizing Map
Topology
200 buckets, initialized randomly
Inputs:
3 letter CATPCHAs
Random fonts
Letters A-G
“Chunked” using overlapping segmentation
Experiment 1 Results
Buckets fell into three primary categories:
Distinguishable
letters
Chunks with halves
of two letters
Indistinguishable
noise
Experiment 1 Results
Experiment 2
ML Method:
Neural Net
Topology:
Fully connected
400 inputs
50 node hidden layer
7 outputs
Inputs:
Single letter CATPCHAs
Random fonts
Letters A-G
400 Nodes
50 Nodes
7 Nodes
Contains … ?
A: 0 or 1
B : 0 or 1
C: 0 or 1
D: 0 or 1
E: 0 or 1
F: 0 or 1
G: 0 or 1
Neural Net Learning Curve
Experiment 2 Results
Experiment 2 Results
Neural Net Accuracy vs. Size of Hidden Layer
Past a certain
number of nodes
in the hidden
layer, the
topology ceases
to have a huge
impact on
accuracy.
Experiment 3
ML Method:
Neural Net
Topology:
Fully connected
400 inputs
1000 node hidden layer
7 outputs
ML Method:
SOM
Topology:
500 buckets
Inputs:
4 letter CATPCHAs
Fandom fonts
Letters A-G
Experiment 3
Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐G
Experiment 4
ML Method:
Neural Net
Topology:
Fully connected
400 inputs
1000 node hidden layer
7 outputs
ML Method:
SOM
Topology:
500 buckets
Inputs:
4 letter CATPCHAs
Fandom fonts
Letters A-Z
Experiment 4
Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐Z
Experiment 5
ML Method:
Neural Net
Topology:
Fully connected
400 inputs
1000 node hidden layer
7 outputs
ML Method:
SOM
Topology:
500 buckets
Inputs:
5 letter CATPCHAs
Fandom fonts
Letters A-Z
Experiment 5
Neural Net vs. SOM on CAPTCHAs Length 5, Letters A-Z
What it all means
• Increasing number of characters
dramatically decreases total accuracy
because segmentation quality decreases
• True positive rate goes down when
segmentation quality decreases
• Hence, better segmentation is the key
Future Work
Improved Segmentation
o Wirescreen segmentation
o Ensemble techniques
Improved True Positive Rates with Current
System
o Ensemble techniques
New problems
o Handwriting recognition
o Bot net of doom
Questions?

Captcha Recognition using Neural Networks