Machine Learning Methods
 for CAPTCHA Recognition
       Rachel Shadoan
       Zachery Tidwell, II
CAPTCHA
Completely Automated Public Turing Test to tell Computers and Humans Apart


Why are they interesting?
  o Harder than normal text recognition
         On par with handwriting recognition,
         reading damaged text
  o Techniques translate well to other problems
         Facial recognition (Gonzaga, 2002)
         Weed identification (Yang, 2000)
  o Near infinite data sets
         Easier to avoid over-fitting
Hypothesis

CAPTCHA recognition can be
 accomplished to a high degree
 of accuracy using machine
 learning methods with minimal
 preprocessing of inputs.
Methods
           Tools
              o JCaptcha
              o Image Processing

Learning Methods        Segmentation Methods
  o Feed-forward Neural   o Overlapping
    Nets                     o Whitespace
  o Self-Organizing Maps     o K-Means
  o K-Means
  o Cluster Classification
JCaptcha

o Open-source CAPTCHA
  generation software
o Highly configurable
   Can produce CAPTCHAs of
   many levels of difficulty

o Check it out at:
  http://jcaptcha.sourceforge.net
Image Processing
Sparse Image
  Represents Images as unbounded set of pixels
  Each pixel is a value between 0 and 1 and a
    coordinate pair
  Center each image before turning into a matrix of
    0s and 1s




         Original          After Transformation
Feed-Forward Neural Nets




      As covered in class
Self-Organizing Maps
Training                          Collection
    Initialize N buckets to         For many inputs
       random values
                                          Sort each input into 
    For each input                        the bucket it most 
       Find the bucket that is            closely matches
       “closest” to the input       For each bucket and each 
       Adjust the “closest”         character
       bucket to more closely             Calculate the 
       match the input using              probability of that 
       exponential average                character going into 
                                          that bucket.
K-Means
• Very similar to Self‐
  Organizing Maps 
  (SOMs)
• Can use the same 
  classifying mechanism 
  as used for SOM
Overlapping Segmentation
• Divide image into
  fixed number of
  overlapping tiles of
  the same size
• In our case, 20 x 20
  pixels with a 50%
  overlap
• Discard chunks
  under a certain size   Note: This is a B with
                         part of it cut off, not
  and chunks that are    an E. Therein lies the
  all white              rub.
Whitespace Segmentation
• Iterate through the
  image from left to
  right—segment
  when a full column
  of whitespace is
  encountered
• Works perfectly for
  well-spaced text
K-Means Segmentation
• Performs better
  than heuristic
  segmentation on
  closely-packed
  inputs
Segmentation Comparison
     Even‐width


     Whitespace


     K‐Means



     Even‐width


     Whitespace


     K‐Means
Experiment 1
Machine Learning Method:
  Self-Organizing Map
Topology
  200 buckets, initialized randomly
Inputs:
  3 letter CATPCHAs
  Random fonts
  Letters A-G
  “Chunked” using overlapping segmentation
Experiment 1 Results
Buckets fell into three primary categories:

  Distinguishable
  letters


  Chunks with halves
  of two letters

  Indistinguishable
  noise
Experiment 1 Results
Experiment 2
ML Method:                                        Contains … ?
  Neural Net
                                                             A: 0 or 1
Topology:                                                    B : 0 or 1
                                                             C: 0 or 1




                           400 Nodes
  Fully connected




                                       50 Nodes




                                                   7 Nodes
                                                             D: 0 or 1
                                                             E: 0 or 1
  400 inputs                                                 F: 0 or 1
  50 node hidden layer                                       G: 0 or 1

  7 outputs
Inputs:
  Single letter CATPCHAs
  Random fonts
  Letters A-G
Experiment 2 Results




     Neural Net Learning Curve
Experiment 2 Results

                                               Past a certain
                                               number of nodes
                                               in the hidden
                                               layer, the
                                               topology ceases
                                               to have a huge
                                               impact on
                                               accuracy.



Neural Net Accuracy vs. Size of Hidden Layer
Experiment 3
ML Method:                ML Method:
 SOM                       Neural Net
Topology:                 Topology:
 500 buckets               Fully connected
                           400 inputs
                           1000 node hidden layer
                           7 outputs
Inputs:
      4 letter CATPCHAs
      Fandom fonts
      Letters A-G
Experiment 3




Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐G
Experiment 4
ML Method:                ML Method:
 SOM                       Neural Net
Topology:                 Topology:
 500 buckets               Fully connected
                           400 inputs
                           1000 node hidden layer
                           7 outputs
Inputs:
      4 letter CATPCHAs
      Fandom fonts
      Letters A-Z
Experiment 4




Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐Z
Experiment 5
ML Method:                ML Method:
 SOM                       Neural Net
Topology:                 Topology:
 500 buckets               Fully connected
                           400 inputs
                           1000 node hidden layer
                           7 outputs
Inputs:
      5 letter CATPCHAs
      Fandom fonts
      Letters A-Z
Experiment 5




Neural Net vs. SOM on CAPTCHAs Length 5, Letters A-Z
What it all means
• Increasing number of characters
  dramatically decreases total accuracy
  because segmentation quality decreases
• True positive rate goes down when
  segmentation quality decreases
• Hence, better segmentation is the key
Future Work
Improved Segmentation
   o Wirescreen segmentation
   o Ensemble techniques
Improved True Positive Rates with Current
  System
   o Ensemble techniques
New problems
   o Handwriting recognition
   o Bot net of doom
Questions?

Machine Learning Methods For Captcha Recognition

  • 1.
    Machine Learning Methods for CAPTCHA Recognition Rachel Shadoan Zachery Tidwell, II
  • 2.
    CAPTCHA Completely Automated PublicTuring Test to tell Computers and Humans Apart Why are they interesting? o Harder than normal text recognition On par with handwriting recognition, reading damaged text o Techniques translate well to other problems Facial recognition (Gonzaga, 2002) Weed identification (Yang, 2000) o Near infinite data sets Easier to avoid over-fitting
  • 3.
    Hypothesis CAPTCHA recognition canbe accomplished to a high degree of accuracy using machine learning methods with minimal preprocessing of inputs.
  • 4.
    Methods Tools o JCaptcha o Image Processing Learning Methods Segmentation Methods o Feed-forward Neural o Overlapping Nets o Whitespace o Self-Organizing Maps o K-Means o K-Means o Cluster Classification
  • 5.
    JCaptcha o Open-source CAPTCHA generation software o Highly configurable Can produce CAPTCHAs of many levels of difficulty o Check it out at: http://jcaptcha.sourceforge.net
  • 6.
    Image Processing Sparse Image Represents Images as unbounded set of pixels Each pixel is a value between 0 and 1 and a coordinate pair Center each image before turning into a matrix of 0s and 1s Original After Transformation
  • 7.
    Feed-Forward Neural Nets As covered in class
  • 8.
    Self-Organizing Maps Training Collection Initialize N buckets to  For many inputs random values Sort each input into  For each input the bucket it most  Find the bucket that is  closely matches “closest” to the input For each bucket and each  Adjust the “closest”  character bucket to more closely  Calculate the  match the input using  probability of that  exponential average character going into  that bucket.
  • 9.
    K-Means • Very similar to Self‐ Organizing Maps  (SOMs) • Can use the same  classifying mechanism  as used for SOM
  • 10.
    Overlapping Segmentation • Divideimage into fixed number of overlapping tiles of the same size • In our case, 20 x 20 pixels with a 50% overlap • Discard chunks under a certain size Note: This is a B with part of it cut off, not and chunks that are an E. Therein lies the all white rub.
  • 11.
    Whitespace Segmentation • Iteratethrough the image from left to right—segment when a full column of whitespace is encountered • Works perfectly for well-spaced text
  • 12.
    K-Means Segmentation • Performsbetter than heuristic segmentation on closely-packed inputs
  • 13.
    Segmentation Comparison Even‐width Whitespace K‐Means Even‐width Whitespace K‐Means
  • 14.
    Experiment 1 Machine LearningMethod: Self-Organizing Map Topology 200 buckets, initialized randomly Inputs: 3 letter CATPCHAs Random fonts Letters A-G “Chunked” using overlapping segmentation
  • 15.
    Experiment 1 Results Bucketsfell into three primary categories: Distinguishable letters Chunks with halves of two letters Indistinguishable noise
  • 16.
  • 17.
    Experiment 2 ML Method: Contains … ? Neural Net A: 0 or 1 Topology: B : 0 or 1 C: 0 or 1 400 Nodes Fully connected 50 Nodes 7 Nodes D: 0 or 1 E: 0 or 1 400 inputs F: 0 or 1 50 node hidden layer G: 0 or 1 7 outputs Inputs: Single letter CATPCHAs Random fonts Letters A-G
  • 18.
    Experiment 2 Results Neural Net Learning Curve
  • 19.
    Experiment 2 Results Past a certain number of nodes in the hidden layer, the topology ceases to have a huge impact on accuracy. Neural Net Accuracy vs. Size of Hidden Layer
  • 20.
    Experiment 3 ML Method: ML Method: SOM Neural Net Topology: Topology: 500 buckets Fully connected 400 inputs 1000 node hidden layer 7 outputs Inputs: 4 letter CATPCHAs Fandom fonts Letters A-G
  • 21.
  • 22.
    Experiment 4 ML Method: ML Method: SOM Neural Net Topology: Topology: 500 buckets Fully connected 400 inputs 1000 node hidden layer 7 outputs Inputs: 4 letter CATPCHAs Fandom fonts Letters A-Z
  • 23.
  • 24.
    Experiment 5 ML Method: ML Method: SOM Neural Net Topology: Topology: 500 buckets Fully connected 400 inputs 1000 node hidden layer 7 outputs Inputs: 5 letter CATPCHAs Fandom fonts Letters A-Z
  • 25.
    Experiment 5 Neural Netvs. SOM on CAPTCHAs Length 5, Letters A-Z
  • 26.
    What it allmeans • Increasing number of characters dramatically decreases total accuracy because segmentation quality decreases • True positive rate goes down when segmentation quality decreases • Hence, better segmentation is the key
  • 27.
    Future Work Improved Segmentation o Wirescreen segmentation o Ensemble techniques Improved True Positive Rates with Current System o Ensemble techniques New problems o Handwriting recognition o Bot net of doom
  • 28.