Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Botnet Detection Techniques by Team Firefly 3116 views
- 1. botnet detection algorithms and... by Djona Fegnem 1139 views
- A Dynamic Botnet Detection Model ba... by idescitation 555 views
- Jpdcs1(data lekage detection) by Chaitanya Kn 477 views
- 69 122-128 by idescitation 492 views
- 72 129-135 by idescitation 444 views

5,507 views

Published on

No Downloads

Total views

5,507

On SlideShare

0

From Embeds

0

Number of Embeds

21

Shares

0

Downloads

120

Comments

0

Likes

6

No embeds

No notes for slide

- 1. Bot Detection Algorithm<br />Parinita, Computational Linguistics Masters Program, University of Washington <br />
- 2. 7 Pitfalls to avoid when running controlled experimentation on the web<br />Picking an OEC for which it is easy to beat the control by doing something clearly “wrong” from a business perspective<br />Incorrectly computing confidence intervals for percent change and for OECs that involve a nonlinear combination of metrics <br />Using standard statistical formulas for computations of variance and power<br />Combining metrics over periods where the proportions assigned to Control and Treatment vary, or over subpopulations sampled at different rates <br />Neglecting to filter robots <br />Failing to validate each step of the analysis pipeline and the OEC components <br />Forgetting to control for all differences, and assuming that humans can keep the variants in sync <br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />Source: http://exp-platform.com/Documents/2009-ExPpitfalls.pdf<br />
- 3. In practice, identifying Robots is difficult<br />But, the navigation patterns of web robots is distinct from those of human users in terms of <br />Average rate of queries submitted<br />1<br />Interval between successive queries<br />3<br />Coverage of the web site<br />4<br />Length of the sessions <br />2<br />
- 4. Statistical bot detection model works better than a rule-based system<br />Approach A: Heuristic-Regression Approach to bot pattern identification, classification algorithm: Decision Trees<br />A<br />Extract of set of Heuristic Rules<br />If Query = ‘robots.txt’ then Confidence_factor=1.0 <br />If Session_length > 100 and Session_duration < 10 secs then Confidence_factor=0.95 <br />If default then no identification <br />If User_agent=’Mozilla’ And Session_length > 50 then Confidencefactor=0.87<br />Example of Regression Tree for the crawlers’ confidence factor<br />Overall Perspective of the processing<br />
- 5. Bot detection based on Hidden Markov Models<br />Approach B: Hidden Markov Model, classification algorithm: Hidden Markov Model<br />B<br />Use an HMM to describe robot access pattern and then detect robot based on the access model<br />One or more requests from the same user that arrive in the same time unit are called a batch arrival<br />Calculate the sequences of rtandRtfrom server logs (rt <br />is the number of requests in tthtime unit, and Rt summation<br /> of requests in a given time interval)<br />Because of the different behaviors between human users and robots, there will be different burst levels between them which can also be reflected in rt and Rt<br />We assume that the process of batch arrival is controlled by a special Markov chain with M different states<br />Our detect method is based on such a fact that most robots have similar request arrival patterns because they obey the same design guideline <br />We use the robots observed sequences to train a robot request pattern model and then calculate the likelihood of the incoming request sequence against the robot request pattern<br />
- 6. Bot detection based on Bayesian Approach<br />Approach C: Bayesian Approach, classification algorithm: Naïve Bayes<br />C<br />This approach employs a Naïve Bayes network to classify the HTTP sessions of a web server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic.<br />Algorithm for bot detection system based on Bayesian Approach<br />Access-log analysis and session identification<br />Session features are selected to be used as variables (nodes) in the Bayesian network<br />Construction of the Bayesian network structure<br />Learning:<br />(a) Labeling of the set of training examples. At this step, sessions are classified as crawler- or human-initiated sessions to form the set of examples of the two classes<br />(b) Learning the required Bayesian network parameters using the set of training examples derived from step a<br />(c) Quantification of the Bayesian network using the learned parameters<br />Classification: we extract the features of each session and use them as evidence to be inserted into the Bayesian network model. A probability of each session being a crawler is thus derived.<br />
- 7. References<br />Crook, Thomas and Frasca, Brian and Kohavi, Ron and Longbotham, Roger.2009; Seven Pitfalls to Avoid when Running Controlled Experiments on the Web<br />Tan, Pang-Ning and Kumar, Vipin. 2002.Discovery of Web Robot Sessions based on their Navigational Patterns. Data Mining and Knowledge. 2002, Vol. 6, 1, pp. 9-35. http://citeseer.ist.psu.edu/article/tan02discovery.html. <br />Athena Stassopoulou, Marios D. Dikaiakos.2009; Web robot detection: A probabilistic reasoning approach, Computer Networks 2009.<br />Lu, Wei-Zhou Lu and Yu, Shun-Zheng.2006 ;Web Robot Detection Based on Hidden Markov Model<br /> Alves, Ronnie and Belo, Orlando and Lourenço, Anália. A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data<br />

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment