Bot detection algorithm


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bot detection algorithm

  1. 1. Bot Detection Algorithm<br />Parinita, Computational Linguistics Masters Program, University of Washington <br />
  2. 2. 7 Pitfalls to avoid when running controlled experimentation on the web<br />Picking an OEC for which it is easy to beat the control by doing something clearly “wrong” from a business perspective<br />Incorrectly computing confidence intervals for percent change and for OECs that involve a nonlinear combination of metrics <br />Using standard statistical formulas for computations of variance and power<br />Combining metrics over periods where the proportions assigned to Control and Treatment vary, or over subpopulations sampled at different rates <br />Neglecting to filter robots <br />Failing to validate each step of the analysis pipeline and the OEC components <br />Forgetting to control for all differences, and assuming that humans can keep the variants in sync <br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />Source:<br />
  3. 3. In practice, identifying Robots is difficult<br />But, the navigation patterns of web robots is distinct from those of human users in terms of <br />Average rate of queries submitted<br />1<br />Interval between successive queries<br />3<br />Coverage of the web site<br />4<br />Length of the sessions <br />2<br />
  4. 4. Statistical bot detection model works better than a rule-based system<br />Approach A: Heuristic-Regression Approach to bot pattern identification, classification algorithm: Decision Trees<br />A<br />Extract of set of Heuristic Rules<br />If Query = ‘robots.txt’ then Confidence_factor=1.0 <br />If Session_length > 100 and Session_duration < 10 secs then Confidence_factor=0.95 <br />If default then no identification <br />If User_agent=’Mozilla’ And Session_length > 50 then Confidencefactor=0.87<br />Example of Regression Tree for the crawlers’ confidence factor<br />Overall Perspective of the processing<br />
  5. 5. Bot detection based on Hidden Markov Models<br />Approach B: Hidden Markov Model, classification algorithm: Hidden Markov Model<br />B<br />Use an HMM to describe robot access pattern and then detect robot based on the access model<br />One or more requests from the same user that arrive in the same time unit are called a batch arrival<br />Calculate the sequences of rtandRtfrom server logs (rt <br />is the number of requests in tthtime unit, and Rt summation<br /> of requests in a given time interval)<br />Because of the different behaviors between human users and robots, there will be different burst levels between them which can also be reflected in rt and Rt<br />We assume that the process of batch arrival is controlled by a special Markov chain with M different states<br />Our detect method is based on such a fact that most robots have similar request arrival patterns because they obey the same design guideline <br />We use the robots observed sequences to train a robot request pattern model and then calculate the likelihood of the incoming request sequence against the robot request pattern<br />
  6. 6. Bot detection based on Bayesian Approach<br />Approach C: Bayesian Approach, classification algorithm: Naïve Bayes<br />C<br />This approach employs a Naïve Bayes network to classify the HTTP sessions of a web server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic.<br />Algorithm for bot detection system based on Bayesian Approach<br />Access-log analysis and session identification<br />Session features are selected to be used as variables (nodes) in the Bayesian network<br />Construction of the Bayesian network structure<br />Learning:<br />(a) Labeling of the set of training examples. At this step, sessions are classified as crawler- or human-initiated sessions to form the set of examples of the two classes<br />(b) Learning the required Bayesian network parameters using the set of training examples derived from step a<br />(c) Quantification of the Bayesian network using the learned parameters<br />Classification: we extract the features of each session and use them as evidence to be inserted into the Bayesian network model. A probability of each session being a crawler is thus derived.<br />
  7. 7. References<br />Crook, Thomas and Frasca, Brian and Kohavi, Ron and Longbotham, Roger.2009; Seven Pitfalls to Avoid when Running Controlled Experiments on the Web<br />Tan, Pang-Ning and Kumar, Vipin. 2002.Discovery of Web Robot Sessions based on their Navigational Patterns. Data Mining and Knowledge. 2002, Vol. 6, 1, pp. 9-35. <br />Athena Stassopoulou, Marios D. Dikaiakos.2009; Web robot detection: A probabilistic reasoning approach, Computer Networks 2009.<br />Lu, Wei-Zhou Lu and Yu, Shun-Zheng.2006 ;Web Robot Detection Based on Hidden Markov Model<br /> Alves, Ronnie and Belo, Orlando and Lourenço, Anália. A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data<br />