Your SlideShare is downloading.
×

×
Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- Rimini discussion by Christian Robert 2343 views
- Independent Objective Reviews of An... by Dr Augustine Fou 1392 views
- IB Chemistry on Periodic Trends, Ef... by Lawrence Kok 948 views
- Recruiting & Talent Strategy Summary by Don Gee 4889 views
- High Frequency Portable X-Ray Equip... by Dhanwantari Medic... 471 views
- Mirantis Folsom Meetup Intro by Mirantis 1633 views

3,719

Published on

No Downloads

Total Views

3,719

On Slideshare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

80

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Bot Detection Algorithm<br />Parinita, Computational Linguistics Masters Program, University of Washington <br />
- 2. 7 Pitfalls to avoid when running controlled experimentation on the web<br />Picking an OEC for which it is easy to beat the control by doing something clearly “wrong” from a business perspective<br />Incorrectly computing confidence intervals for percent change and for OECs that involve a nonlinear combination of metrics <br />Using standard statistical formulas for computations of variance and power<br />Combining metrics over periods where the proportions assigned to Control and Treatment vary, or over subpopulations sampled at different rates <br />Neglecting to filter robots <br />Failing to validate each step of the analysis pipeline and the OEC components <br />Forgetting to control for all differences, and assuming that humans can keep the variants in sync <br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />Source: http://exp-platform.com/Documents/2009-ExPpitfalls.pdf<br />
- 3. In practice, identifying Robots is difficult<br />But, the navigation patterns of web robots is distinct from those of human users in terms of <br />Average rate of queries submitted<br />1<br />Interval between successive queries<br />3<br />Coverage of the web site<br />4<br />Length of the sessions <br />2<br />
- 4. Statistical bot detection model works better than a rule-based system<br />Approach A: Heuristic-Regression Approach to bot pattern identification, classification algorithm: Decision Trees<br />A<br />Extract of set of Heuristic Rules<br />If Query = ‘robots.txt’ then Confidence_factor=1.0 <br />If Session_length > 100 and Session_duration < 10 secs then Confidence_factor=0.95 <br />If default then no identification <br />If User_agent=’Mozilla’ And Session_length > 50 then Confidencefactor=0.87<br />Example of Regression Tree for the crawlers’ confidence factor<br />Overall Perspective of the processing<br />
- 5. Bot detection based on Hidden Markov Models<br />Approach B: Hidden Markov Model, classification algorithm: Hidden Markov Model<br />B<br />Use an HMM to describe robot access pattern and then detect robot based on the access model<br />One or more requests from the same user that arrive in the same time unit are called a batch arrival<br />Calculate the sequences of rtandRtfrom server logs (rt <br />is the number of requests in tthtime unit, and Rt summation<br /> of requests in a given time interval)<br />Because of the different behaviors between human users and robots, there will be different burst levels between them which can also be reflected in rt and Rt<br />We assume that the process of batch arrival is controlled by a special Markov chain with M different states<br />Our detect method is based on such a fact that most robots have similar request arrival patterns because they obey the same design guideline <br />We use the robots observed sequences to train a robot request pattern model and then calculate the likelihood of the incoming request sequence against the robot request pattern<br />
- 6. Bot detection based on Bayesian Approach<br />Approach C: Bayesian Approach, classification algorithm: Naïve Bayes<br />C<br />This approach employs a Naïve Bayes network to classify the HTTP sessions of a web server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic.<br />Algorithm for bot detection system based on Bayesian Approach<br />Access-log analysis and session identification<br />Session features are selected to be used as variables (nodes) in the Bayesian network<br />Construction of the Bayesian network structure<br />Learning:<br />(a) Labeling of the set of training examples. At this step, sessions are classified as crawler- or human-initiated sessions to form the set of examples of the two classes<br />(b) Learning the required Bayesian network parameters using the set of training examples derived from step a<br />(c) Quantification of the Bayesian network using the learned parameters<br />Classification: we extract the features of each session and use them as evidence to be inserted into the Bayesian network model. A probability of each session being a crawler is thus derived.<br />
- 7. References<br />Crook, Thomas and Frasca, Brian and Kohavi, Ron and Longbotham, Roger.2009; Seven Pitfalls to Avoid when Running Controlled Experiments on the Web<br />Tan, Pang-Ning and Kumar, Vipin. 2002.Discovery of Web Robot Sessions based on their Navigational Patterns. Data Mining and Knowledge. 2002, Vol. 6, 1, pp. 9-35. http://citeseer.ist.psu.edu/article/tan02discovery.html. <br />Athena Stassopoulou, Marios D. Dikaiakos.2009; Web robot detection: A probabilistic reasoning approach, Computer Networks 2009.<br />Lu, Wei-Zhou Lu and Yu, Shun-Zheng.2006 ;Web Robot Detection Based on Hidden Markov Model<br /> Alves, Ronnie and Belo, Orlando and Lourenço, Anália. A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data<br />

Be the first to comment