SlideShare a Scribd company logo
1 of 24
A Density Based Clustering Approach
for Web Robot Detection
Mahdieh Zabihi,
M.Sc. Student (Computer Engineering),
Imam Reza International University.
Majid Vafaei Jahan,
Department of Computer Engineering, Islamic Azad University.
Javad Hamidzadeh,
Faculty of Computer Engineering and Information Technology,
Sadjad University of Technology.
Mashhad, Iran
29 October, 2014, Ferdowsi University of Mashhad
Web Robots
• Web robots: programs traveling the web
autonomously, starting from a ‘seed’ list of web pages
and recursively visiting accessible documents.
• Other names: Crawlers, Wanderers, Spiders,
Harvesters.
• Main Goal: Discover and retrieve content and
knowledge from the Web-based systems and services.
• Examples: Search-engine crawlers, Shopping bots,
focused crawlers, profile-driven crawlers, email
harvesters.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
1/22
Web Robots (cont)
• Growing need for advanced information and
knowledge-retrieval tools on the Web.
o Remarkable increase in the number of crawlers.
• Occupying the network bandwidth and
reducing the performance of web servers.
o Growing need to distinguish robots from human
users.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
2/22
Main obstacles for robot detection
• Negligence in web robot design.
• Failure to follow instructions on how to design
a robot.
• Changing robot’s characteristics, in order to
imitate human’s behavior.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
3/22
Web Robot Detection Problem
• Given a set of HTTP records R= {r1,r2,…,rn}; find the
set of sessions S= {s1,s2,…,sm}, such that:
o n: number of http records stored in the original access log file
o m: number of all sessions in S.
• Function f is the DBSCAN clustering algorithm.
1 1
,{ | , }i
m m
i i i
i i
sS s R s R s
 
   
0 s is Human
: {0,1}, ( )
1 s is Robot
i
i
i
f S f s 



ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
4/22
Related Works
Access Log file
Session Identification
Feature Extraction
Session LabelingTest Data
Set
Learning
Data Set
Learning Model
Pre Processing
Of
Access Log File
Doran, D. and Gokhale, S. S. (2010). Web robot detection techniques: overview and limitations. Data
Mining and Knowledge Discovery, 22: 183-210.
5/22
Related Works (cont)
• Bayesian Network [1]
• Hidden Markov Model [2]
• Decision Tree [3]
• Fuzzy Inference System [4]
• Neural Network [5]
• …
 Being unreliable
of the
conventional
methods used for
session labeling.
6/22
1. Stassopoulou and M. D. Dikaiakos. (2009). "web robot detection: a probabilistic reasoning approach",
Computer Network, vol. 53, pp. 265-278.
2. Lu, WZ., Yu, SZ. (2006). Web robot detection based on hidden Markov model. international conference
on communications, circuits and systems, 1806–1810.
3. Tan, P.-N. and Kumar, V. (2002). Discovery of web robot sessions based on their navigational patterns.
Data Mining and Knowledge Discovery, 6: 9-35.
4. Zabihi, M., Vafaei Jahan, M., and Hamidzadeh, J. (2014). Fuzzy Intrusion Detection of Web Robots in
Computer Networks. International Journal of Applied Mathematics in Engineering, Management
and Technology, 152-158.
5. Bomhardt, C., Gaul, W., and Schmidt-Thieme, L. (2005). Web Robot detection pre-processing web log
files for Robot Detection. New Developments in Classification and Data Analysis: 113-124.
Related Works (cont)
• SOM Clustering Algorithm as the
unsupervised learning technique.
• Using the labels of sessions only for the
supervised evaluation of their proposed
methods.
Stevanovic, D., Vlajic, N., and An, A. (2013). Detection of malicious and non-malicious website visitors
using unsupervised neural network learning. Applied Soft Computing, 13(1): 698-708.
7/22
The Proposed Method (DBC_WRD)
• DBSCAN Clustering Algorithm:
– Being very effective at clustering the data sets of
significantly more than just a few thousand objects
– Having the ability of finding complicated cluster shapes in
a reasonable amount of time.
– Having only two assigned input parameters and also
supporting the user in determining the values for them.
• Two new features to distinguish robots from human
users:
– Based on the navigational patterns of visitors and the
resources requested by them.
– Selecting these features in a way that does not change over
time, in order to keep the techniques effective across server
domains and in the face of evolving robot traffics.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
8/22
Pre Processing of Access Log
1. Session Identification:
for each do
for each do
if ( ) then
close s;
else
if (containsIP(s, r.ip) or containsAgent(s, r.agent)) then
add r to s.list;
else
create new session
add r to
end;
end;
. – .r time s LastTime thre
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
9/22
Pre Processing of Access Log
2. Feature Extraction:
– Trap file request
– Maximum rate of browser file request (new)
– Penalty (new)
– Percentage of 304 response codes
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
10/22
Pre Processing of Access Log
Trap File Requests:
– A binary feature which demonstrates whether a session
contains a request for Trap files.
– Trap files: the resources that should never be requested by
human users.
• There is no link from the web site to these files.
• Most users are unaware of them.
• ‘cmd.exe’, ‘robots.txt’, ‘sitemap.xml’ (new).
– ‘sitemap.xml’: a guideline for most search engine web
robots and contains a list of all web pages and URL
addresses of a web site that cannot be discovered easily by
search engine web robots.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
11/22
Pre Processing of Access Log
 Maximum rate of browser file request (new):
• Typing a URL address in the address bar or clicking on a link
to visit a new page:
1. The requested web page is analyzed by the browser
2. A barrage of requests automatically sent by the browser
to the server to achieve all embedded files on the page:
 js, css, woff, eot, svg, ttf, jsp, asp, aspx, tpl, xsl, cfm, xml, swf,
flv, fla, f4v, sw, raw, amr, bwf,…
• In contrast to humans, web robots can freely decide which
resource is suitable to be requested.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
12/22
Pre Processing of Access Log
Penalty:
• a numerical attribute proposed based on the
navigational patterns of humans
o involve a large number of frequent back-and-forward
movements and loops:
1. Having a view restricted by the structure of links
of a site to find the required information,
2. “back” and “forward” option in web browser’s
history
3. disorienting the humans during their visits
o While after the first crawl of a site, robots can detect where
the required information resides and restrict their next
requests to specific areas of that site.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
13/22
Pre Processing of Access Log
• Penalizing each back-and-forward navigation or
loop
a
d
cb
e
• A larger value for
this attribute
among human
users than web
robots.
S = a, b, a, c, d, c, d, e, d, a
Penalty=5
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
14/22
Pre Processing of Access Log
Percentage of 304 response codes
• A numerical attribute calculated as the
percentage of responses with status code 304
in a session.
• A response sent with this code indicates that
the resource has not been modified since the
version specified by the request headers.
• Web browsers tend to have higher percentage
of 304 response code than robots.
15/22
Bomhardt, C., Gaul, W., and Schmidt-Thieme, L. (2005). Web Robot detection pre-processing web log files
for Robot Detection. New Developments in Classification and Data Analysis: 113-124.
Pre Processing of Access Log
3. Session Labeling:
The session labeling procedure:
DB_robots: set of web robots detected by WebLog Expert.
for each do
if containsTrapFile (s) then
s.label =1;
else if contains (DB_robots, s.ip) or contains (DB_robots, s.agent) then
s.label =1;
else
s.label =0;
end;
end;
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
16/22
Experiments
Specifications
Data Set Name
Imam Reza University ArticleBaz
Name D1 D2
# of Requests 311633 372304
# of Sessions 17969 22092
# of Humans 16799 18144
# of Robots 1170 3948
• The implementation of the DBSCAN algorithm provided in
WEKA data mining software:
o Epsilon=1, MinPts=6
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
17/22
Experiments (cont)
• Final distribution of instances of data set D1:
( , )
ij
i
n
recall i j
n

0
0.2
0.4
0.6
0.8
Robot Human
recall Metric for Robot Cluster
0
0.2
0.4
0.6
0.8
1
Robot Human
recall Metric for Human Cluster
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
18/22
Experiments (cont.)
• SOM Clustering Algorithm:
– MATLAB’s Neural Network Toolbox
• A SOM comprising 100 neurons in a 10-by-10
hexagonal arrangement.
• The training is done by 200 epochs.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
19/22
Experiments (cont)
Metric value for
SOM(Avg.)
Metric value for
DBSCAN(Avg.)
Evaluation metric
0.720.997
0.2650.964
0.3450.0215
0.860.977
TP TN
RI
TP TN FP FN


  
TP
Jaccard
TP FP FN

 
2
1 1
, log
c Ln n
j ij ij
j j
j i j j
n n n
Entropy e e
n n n 
   
1
, max
cn
j ij
j j i
j j
n n
Purity p p
n n
 
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
20/22
Experiments (cont)
User-agent stringRobot name
Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/534+
(KHTML, like Gecko) BingPreview/1.0b
BingPreview
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko; Google Web Preview)
Google Web
Preview
HuaweiSymantecSpider/1.0+DSE-
support@huaweisymantec.com
Huawei
Symantec
Spider
Some examples of such web robots which imitate the human’s
behaviors:
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
21/22
Conclusions and Remarks
• According to the supervised evaluations:
DBC_WRD can have the 96% of Jaccard metric and
produce two clusters which have the entropy and
purity rates of 0.0215 and 0.97, respectively.
• From the standpoint of clustering quality and
accuracy:
– DBC_WRD performs better than state-of-the-art
algorithm.
• Some non-malicious popular web robots,
through imitating the human’s behavior, make
it difficult to be identified.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
22/22
Mahdieh Zabihi
M.Sc. Student (Computer Engineering), Imam Reza International University.
m.zabihi@imamreza.ac.ir
Majid Vafaei Jahan
Department of Computer Engineering, Islamic Azad University.
vafaeiJahan@mshdiau.ac.ir
Javad Hamidzadeh
Department of Computer Engineering, Faculty of Computer Engineering and
Information Technology, Sadjad University of Technology.
j_hamidzadeh@sadjad.ac.ir

More Related Content

Similar to A density based clustering approach for web robot detection

Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...ijdkp
 
User Navigation Pattern Prediction from Web Log Data: A Survey
User Navigation Pattern Prediction from Web Log Data:  A SurveyUser Navigation Pattern Prediction from Web Log Data:  A Survey
User Navigation Pattern Prediction from Web Log Data: A SurveyIJMER
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSAM Publications
 
A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningIJERA Editor
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningIOSR Journals
 
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...cscpconf
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining Editor IJMTER
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningIRJET Journal
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
IEEE- Intrusion Detection Model using Self Organizing Map
IEEE- Intrusion Detection Model using Self Organizing MapIEEE- Intrusion Detection Model using Self Organizing Map
IEEE- Intrusion Detection Model using Self Organizing MapTushar Shinde
 
User Navigation Pattern Prediction from Web Log Data: A Survey
User Navigation Pattern Prediction from Web Log Data: A SurveyUser Navigation Pattern Prediction from Web Log Data: A Survey
User Navigation Pattern Prediction from Web Log Data: A SurveyIJMER
 

Similar to A density based clustering approach for web robot detection (20)

Presentation mz
Presentation mzPresentation mz
Presentation mz
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
 
User Navigation Pattern Prediction from Web Log Data: A Survey
User Navigation Pattern Prediction from Web Log Data:  A SurveyUser Navigation Pattern Prediction from Web Log Data:  A Survey
User Navigation Pattern Prediction from Web Log Data: A Survey
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
H0314450
H0314450H0314450
H0314450
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
 
A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage Mining
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage mining
 
A SOFT COMPUTING APPROACH FOR BENIGN AND MALICIOUS WEB ROBOT DETECTION
A SOFT COMPUTING APPROACH FOR BENIGN AND MALICIOUS WEB ROBOT DETECTIONA SOFT COMPUTING APPROACH FOR BENIGN AND MALICIOUS WEB ROBOT DETECTION
A SOFT COMPUTING APPROACH FOR BENIGN AND MALICIOUS WEB ROBOT DETECTION
 
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine Learning
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
IEEE- Intrusion Detection Model using Self Organizing Map
IEEE- Intrusion Detection Model using Self Organizing MapIEEE- Intrusion Detection Model using Self Organizing Map
IEEE- Intrusion Detection Model using Self Organizing Map
 
User Navigation Pattern Prediction from Web Log Data: A Survey
User Navigation Pattern Prediction from Web Log Data: A SurveyUser Navigation Pattern Prediction from Web Log Data: A Survey
User Navigation Pattern Prediction from Web Log Data: A Survey
 

Recently uploaded

On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirtrahman018755
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdfkeithzhangding
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Personfurqan222004
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...aditipandeya
 
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our EscortsCall Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escortsindian call girls near you
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
 

Recently uploaded (20)

On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Person
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
 
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our EscortsCall Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 

A density based clustering approach for web robot detection

  • 1. A Density Based Clustering Approach for Web Robot Detection Mahdieh Zabihi, M.Sc. Student (Computer Engineering), Imam Reza International University. Majid Vafaei Jahan, Department of Computer Engineering, Islamic Azad University. Javad Hamidzadeh, Faculty of Computer Engineering and Information Technology, Sadjad University of Technology. Mashhad, Iran 29 October, 2014, Ferdowsi University of Mashhad
  • 2. Web Robots • Web robots: programs traveling the web autonomously, starting from a ‘seed’ list of web pages and recursively visiting accessible documents. • Other names: Crawlers, Wanderers, Spiders, Harvesters. • Main Goal: Discover and retrieve content and knowledge from the Web-based systems and services. • Examples: Search-engine crawlers, Shopping bots, focused crawlers, profile-driven crawlers, email harvesters. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 1/22
  • 3. Web Robots (cont) • Growing need for advanced information and knowledge-retrieval tools on the Web. o Remarkable increase in the number of crawlers. • Occupying the network bandwidth and reducing the performance of web servers. o Growing need to distinguish robots from human users. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 2/22
  • 4. Main obstacles for robot detection • Negligence in web robot design. • Failure to follow instructions on how to design a robot. • Changing robot’s characteristics, in order to imitate human’s behavior. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 3/22
  • 5. Web Robot Detection Problem • Given a set of HTTP records R= {r1,r2,…,rn}; find the set of sessions S= {s1,s2,…,sm}, such that: o n: number of http records stored in the original access log file o m: number of all sessions in S. • Function f is the DBSCAN clustering algorithm. 1 1 ,{ | , }i m m i i i i i sS s R s R s       0 s is Human : {0,1}, ( ) 1 s is Robot i i i f S f s     ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 4/22
  • 6. Related Works Access Log file Session Identification Feature Extraction Session LabelingTest Data Set Learning Data Set Learning Model Pre Processing Of Access Log File Doran, D. and Gokhale, S. S. (2010). Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22: 183-210. 5/22
  • 7. Related Works (cont) • Bayesian Network [1] • Hidden Markov Model [2] • Decision Tree [3] • Fuzzy Inference System [4] • Neural Network [5] • …  Being unreliable of the conventional methods used for session labeling. 6/22 1. Stassopoulou and M. D. Dikaiakos. (2009). "web robot detection: a probabilistic reasoning approach", Computer Network, vol. 53, pp. 265-278. 2. Lu, WZ., Yu, SZ. (2006). Web robot detection based on hidden Markov model. international conference on communications, circuits and systems, 1806–1810. 3. Tan, P.-N. and Kumar, V. (2002). Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 6: 9-35. 4. Zabihi, M., Vafaei Jahan, M., and Hamidzadeh, J. (2014). Fuzzy Intrusion Detection of Web Robots in Computer Networks. International Journal of Applied Mathematics in Engineering, Management and Technology, 152-158. 5. Bomhardt, C., Gaul, W., and Schmidt-Thieme, L. (2005). Web Robot detection pre-processing web log files for Robot Detection. New Developments in Classification and Data Analysis: 113-124.
  • 8. Related Works (cont) • SOM Clustering Algorithm as the unsupervised learning technique. • Using the labels of sessions only for the supervised evaluation of their proposed methods. Stevanovic, D., Vlajic, N., and An, A. (2013). Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Applied Soft Computing, 13(1): 698-708. 7/22
  • 9. The Proposed Method (DBC_WRD) • DBSCAN Clustering Algorithm: – Being very effective at clustering the data sets of significantly more than just a few thousand objects – Having the ability of finding complicated cluster shapes in a reasonable amount of time. – Having only two assigned input parameters and also supporting the user in determining the values for them. • Two new features to distinguish robots from human users: – Based on the navigational patterns of visitors and the resources requested by them. – Selecting these features in a way that does not change over time, in order to keep the techniques effective across server domains and in the face of evolving robot traffics. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 8/22
  • 10. Pre Processing of Access Log 1. Session Identification: for each do for each do if ( ) then close s; else if (containsIP(s, r.ip) or containsAgent(s, r.agent)) then add r to s.list; else create new session add r to end; end; . – .r time s LastTime thre ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 9/22
  • 11. Pre Processing of Access Log 2. Feature Extraction: – Trap file request – Maximum rate of browser file request (new) – Penalty (new) – Percentage of 304 response codes ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 10/22
  • 12. Pre Processing of Access Log Trap File Requests: – A binary feature which demonstrates whether a session contains a request for Trap files. – Trap files: the resources that should never be requested by human users. • There is no link from the web site to these files. • Most users are unaware of them. • ‘cmd.exe’, ‘robots.txt’, ‘sitemap.xml’ (new). – ‘sitemap.xml’: a guideline for most search engine web robots and contains a list of all web pages and URL addresses of a web site that cannot be discovered easily by search engine web robots. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 11/22
  • 13. Pre Processing of Access Log  Maximum rate of browser file request (new): • Typing a URL address in the address bar or clicking on a link to visit a new page: 1. The requested web page is analyzed by the browser 2. A barrage of requests automatically sent by the browser to the server to achieve all embedded files on the page:  js, css, woff, eot, svg, ttf, jsp, asp, aspx, tpl, xsl, cfm, xml, swf, flv, fla, f4v, sw, raw, amr, bwf,… • In contrast to humans, web robots can freely decide which resource is suitable to be requested. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 12/22
  • 14. Pre Processing of Access Log Penalty: • a numerical attribute proposed based on the navigational patterns of humans o involve a large number of frequent back-and-forward movements and loops: 1. Having a view restricted by the structure of links of a site to find the required information, 2. “back” and “forward” option in web browser’s history 3. disorienting the humans during their visits o While after the first crawl of a site, robots can detect where the required information resides and restrict their next requests to specific areas of that site. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 13/22
  • 15. Pre Processing of Access Log • Penalizing each back-and-forward navigation or loop a d cb e • A larger value for this attribute among human users than web robots. S = a, b, a, c, d, c, d, e, d, a Penalty=5 ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 14/22
  • 16. Pre Processing of Access Log Percentage of 304 response codes • A numerical attribute calculated as the percentage of responses with status code 304 in a session. • A response sent with this code indicates that the resource has not been modified since the version specified by the request headers. • Web browsers tend to have higher percentage of 304 response code than robots. 15/22 Bomhardt, C., Gaul, W., and Schmidt-Thieme, L. (2005). Web Robot detection pre-processing web log files for Robot Detection. New Developments in Classification and Data Analysis: 113-124.
  • 17. Pre Processing of Access Log 3. Session Labeling: The session labeling procedure: DB_robots: set of web robots detected by WebLog Expert. for each do if containsTrapFile (s) then s.label =1; else if contains (DB_robots, s.ip) or contains (DB_robots, s.agent) then s.label =1; else s.label =0; end; end; ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 16/22
  • 18. Experiments Specifications Data Set Name Imam Reza University ArticleBaz Name D1 D2 # of Requests 311633 372304 # of Sessions 17969 22092 # of Humans 16799 18144 # of Robots 1170 3948 • The implementation of the DBSCAN algorithm provided in WEKA data mining software: o Epsilon=1, MinPts=6 ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 17/22
  • 19. Experiments (cont) • Final distribution of instances of data set D1: ( , ) ij i n recall i j n  0 0.2 0.4 0.6 0.8 Robot Human recall Metric for Robot Cluster 0 0.2 0.4 0.6 0.8 1 Robot Human recall Metric for Human Cluster ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 18/22
  • 20. Experiments (cont.) • SOM Clustering Algorithm: – MATLAB’s Neural Network Toolbox • A SOM comprising 100 neurons in a 10-by-10 hexagonal arrangement. • The training is done by 200 epochs. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 19/22
  • 21. Experiments (cont) Metric value for SOM(Avg.) Metric value for DBSCAN(Avg.) Evaluation metric 0.720.997 0.2650.964 0.3450.0215 0.860.977 TP TN RI TP TN FP FN      TP Jaccard TP FP FN    2 1 1 , log c Ln n j ij ij j j j i j j n n n Entropy e e n n n      1 , max cn j ij j j i j j n n Purity p p n n   ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 20/22
  • 22. Experiments (cont) User-agent stringRobot name Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b BingPreview Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Google Web Preview HuaweiSymantecSpider/1.0+DSE- support@huaweisymantec.com Huawei Symantec Spider Some examples of such web robots which imitate the human’s behaviors: ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 21/22
  • 23. Conclusions and Remarks • According to the supervised evaluations: DBC_WRD can have the 96% of Jaccard metric and produce two clusters which have the entropy and purity rates of 0.0215 and 0.97, respectively. • From the standpoint of clustering quality and accuracy: – DBC_WRD performs better than state-of-the-art algorithm. • Some non-malicious popular web robots, through imitating the human’s behavior, make it difficult to be identified. ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh 22/22
  • 24. Mahdieh Zabihi M.Sc. Student (Computer Engineering), Imam Reza International University. m.zabihi@imamreza.ac.ir Majid Vafaei Jahan Department of Computer Engineering, Islamic Azad University. vafaeiJahan@mshdiau.ac.ir Javad Hamidzadeh Department of Computer Engineering, Faculty of Computer Engineering and Information Technology, Sadjad University of Technology. j_hamidzadeh@sadjad.ac.ir