• Like
Bot detection algorithm
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Bot detection algorithm



Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Bot Detection Algorithm
    Parinita, Computational Linguistics Masters Program, University of Washington
  • 2. 7 Pitfalls to avoid when running controlled experimentation on the web
    Picking an OEC for which it is easy to beat the control by doing something clearly “wrong” from a business perspective
    Incorrectly computing confidence intervals for percent change and for OECs that involve a nonlinear combination of metrics
    Using standard statistical formulas for computations of variance and power
    Combining metrics over periods where the proportions assigned to Control and Treatment vary, or over subpopulations sampled at different rates
    Neglecting to filter robots
    Failing to validate each step of the analysis pipeline and the OEC components
    Forgetting to control for all differences, and assuming that humans can keep the variants in sync
    Source: http://exp-platform.com/Documents/2009-ExPpitfalls.pdf
  • 3. In practice, identifying Robots is difficult
    But, the navigation patterns of web robots is distinct from those of human users in terms of
    Average rate of queries submitted
    Interval between successive queries
    Coverage of the web site
    Length of the sessions
  • 4. Statistical bot detection model works better than a rule-based system
    Approach A: Heuristic-Regression Approach to bot pattern identification, classification algorithm: Decision Trees
    Extract of set of Heuristic Rules
    If Query = ‘robots.txt’ then Confidence_factor=1.0
    If Session_length > 100 and Session_duration < 10 secs then Confidence_factor=0.95
    If default then no identification
    If User_agent=’Mozilla’ And Session_length > 50 then Confidencefactor=0.87
    Example of Regression Tree for the crawlers’ confidence factor
    Overall Perspective of the processing
  • 5. Bot detection based on Hidden Markov Models
    Approach B: Hidden Markov Model, classification algorithm: Hidden Markov Model
    Use an HMM to describe robot access pattern and then detect robot based on the access model
    One or more requests from the same user that arrive in the same time unit are called a batch arrival
    Calculate the sequences of rtandRtfrom server logs (rt
    is the number of requests in tthtime unit, and Rt summation
    of requests in a given time interval)
    Because of the different behaviors between human users and robots, there will be different burst levels between them which can also be reflected in rt and Rt
    We assume that the process of batch arrival is controlled by a special Markov chain with M different states
    Our detect method is based on such a fact that most robots have similar request arrival patterns because they obey the same design guideline
    We use the robots observed sequences to train a robot request pattern model and then calculate the likelihood of the incoming request sequence against the robot request pattern
  • 6. Bot detection based on Bayesian Approach
    Approach C: Bayesian Approach, classification algorithm: Naïve Bayes
    This approach employs a Naïve Bayes network to classify the HTTP sessions of a web server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic.
    Algorithm for bot detection system based on Bayesian Approach
    Access-log analysis and session identification
    Session features are selected to be used as variables (nodes) in the Bayesian network
    Construction of the Bayesian network structure
    (a) Labeling of the set of training examples. At this step, sessions are classified as crawler- or human-initiated sessions to form the set of examples of the two classes
    (b) Learning the required Bayesian network parameters using the set of training examples derived from step a
    (c) Quantification of the Bayesian network using the learned parameters
    Classification: we extract the features of each session and use them as evidence to be inserted into the Bayesian network model. A probability of each session being a crawler is thus derived.
  • 7. References
    Crook, Thomas and Frasca, Brian and Kohavi, Ron and Longbotham, Roger.2009; Seven Pitfalls to Avoid when Running Controlled Experiments on the Web
    Tan, Pang-Ning and Kumar, Vipin. 2002.Discovery of Web Robot Sessions based on their Navigational Patterns. Data Mining and Knowledge. 2002, Vol. 6, 1, pp. 9-35. http://citeseer.ist.psu.edu/article/tan02discovery.html.
    Athena Stassopoulou, Marios D. Dikaiakos.2009; Web robot detection: A probabilistic reasoning approach, Computer Networks 2009.
    Lu, Wei-Zhou Lu and Yu, Shun-Zheng.2006 ;Web Robot Detection Based on Hidden Markov Model
    Alves, Ronnie and Belo, Orlando and Lourenço, Anália. A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data