Bot Detection Algorithm Parinita, Computational Linguistics Masters Program, University of Washington
7 Pitfalls to avoid when running controlled experimentation on the web Picking an OEC for which it is easy to beat the control by doing something clearly “wrong” from a business perspective Incorrectly computing confidence intervals for percent change and for OECs that involve a nonlinear combination of metrics Using standard statistical formulas for computations of variance and power Combining metrics over periods where the proportions assigned to Control and Treatment vary, or over subpopulations sampled at different rates Neglecting to filter robots Failing to validate each step of the analysis pipeline and the OEC components Forgetting to control for all differences, and assuming that humans can keep the variants in sync 1 2 3 4 5 6 7 Source: http://exp-platform.com/Documents/2009-ExPpitfalls.pdf
In practice, identifying Robots is difficult But, the navigation patterns of web robots is distinct from those of human users in terms of Average rate of queries submitted 1 Interval between successive queries 3 Coverage of the web site 4 Length of the sessions 2
Statistical bot detection model works better than a rule-based system Approach A: Heuristic-Regression Approach to bot pattern identification, classification algorithm: Decision Trees A Extract of set of Heuristic Rules If Query = ‘robots.txt’ then Confidence_factor=1.0 If Session_length > 100 and Session_duration < 10 secs then Confidence_factor=0.95 If default then no identification If User_agent=’Mozilla’ And Session_length > 50 then Confidencefactor=0.87 Example of Regression Tree for the crawlers’ confidence factor Overall Perspective of the processing
Bot detection based on Hidden Markov Models Approach B: Hidden Markov Model, classification algorithm: Hidden Markov Model B Use an HMM to describe robot access pattern and then detect robot based on the access model One or more requests from the same user that arrive in the same time unit are called a batch arrival Calculate the sequences of rtandRtfrom server logs (rt is the number of requests in tthtime unit, and Rt summation of requests in a given time interval) Because of the different behaviors between human users and robots, there will be different burst levels between them which can also be reflected in rt and Rt We assume that the process of batch arrival is controlled by a special Markov chain with M different states Our detect method is based on such a fact that most robots have similar request arrival patterns because they obey the same design guideline We use the robots observed sequences to train a robot request pattern model and then calculate the likelihood of the incoming request sequence against the robot request pattern
Bot detection based on Bayesian Approach Approach C: Bayesian Approach, classification algorithm: Naïve Bayes C This approach employs a Naïve Bayes network to classify the HTTP sessions of a web server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic. Algorithm for bot detection system based on Bayesian Approach Access-log analysis and session identification Session features are selected to be used as variables (nodes) in the Bayesian network Construction of the Bayesian network structure Learning: (a) Labeling of the set of training examples. At this step, sessions are classified as crawler- or human-initiated sessions to form the set of examples of the two classes (b) Learning the required Bayesian network parameters using the set of training examples derived from step a (c) Quantification of the Bayesian network using the learned parameters Classification: we extract the features of each session and use them as evidence to be inserted into the Bayesian network model. A probability of each session being a crawler is thus derived.
References Crook, Thomas and Frasca, Brian and Kohavi, Ron and Longbotham, Roger.2009; Seven Pitfalls to Avoid when Running Controlled Experiments on the Web Tan, Pang-Ning and Kumar, Vipin. 2002.Discovery of Web Robot Sessions based on their Navigational Patterns. Data Mining and Knowledge. 2002, Vol. 6, 1, pp. 9-35. http://citeseer.ist.psu.edu/article/tan02discovery.html. Athena Stassopoulou, Marios D. Dikaiakos.2009; Web robot detection: A probabilistic reasoning approach, Computer Networks 2009. Lu, Wei-Zhou Lu and Yu, Shun-Zheng.2006 ;Web Robot Detection Based on Hidden Markov Model Alves, Ronnie and Belo, Orlando and Lourenço, Anália. A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data