Distinction between humans and web robots, in
terms of computer network security, has led to the robot
detection problem. An exact solution for this issue can
preserve web sites from the intrusion of malicious robots and
increase the performance of web servers by prioritizing
human users. In this article, we propose a density based
method called DBC_WRD (Density Based Clustering for
Web Robot Detection) to discover the traffic of web robots
on two large real data sets. So, we assume the visitors as the
spatial instances and introduce two new features to describe
and distinguish them. These attributes are based on the
behavioral patterns of web visitors and remain invariant
over time. By focusing on one of the disadvantages of
DBSCAN as the density based clustering algorithm used in
this paper, we just utilize 4 features to reduce the
dimensions. According to the supervised evaluations,
DBC_WRD can have the 96% of Jaccard metric and
produce two clusters which have the entropy and purity
rates of 0.0215 and 0.97, respectively. Furthermore, the
comparisons show that from the standpoint of clustering
quality and accuracy, DBC_WRD performs better than
state-of-the-art algorithms. Finally, it can be concluded that
some non-malicious popular web robots, through imitating
the human’s behavior, make it difficult to be identified.
A density based clustering approach for web robot detection
1. A Density Based Clustering Approach
for Web Robot Detection
Mahdieh Zabihi,
M.Sc. Student (Computer Engineering),
Imam Reza International University.
Majid Vafaei Jahan,
Department of Computer Engineering, Islamic Azad University.
Javad Hamidzadeh,
Faculty of Computer Engineering and Information Technology,
Sadjad University of Technology.
Mashhad, Iran
29 October, 2014, Ferdowsi University of Mashhad
2. Web Robots
• Web robots: programs traveling the web
autonomously, starting from a ‘seed’ list of web pages
and recursively visiting accessible documents.
• Other names: Crawlers, Wanderers, Spiders,
Harvesters.
• Main Goal: Discover and retrieve content and
knowledge from the Web-based systems and services.
• Examples: Search-engine crawlers, Shopping bots,
focused crawlers, profile-driven crawlers, email
harvesters.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
1/22
3. Web Robots (cont)
• Growing need for advanced information and
knowledge-retrieval tools on the Web.
o Remarkable increase in the number of crawlers.
• Occupying the network bandwidth and
reducing the performance of web servers.
o Growing need to distinguish robots from human
users.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
2/22
4. Main obstacles for robot detection
• Negligence in web robot design.
• Failure to follow instructions on how to design
a robot.
• Changing robot’s characteristics, in order to
imitate human’s behavior.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
3/22
5. Web Robot Detection Problem
• Given a set of HTTP records R= {r1,r2,…,rn}; find the
set of sessions S= {s1,s2,…,sm}, such that:
o n: number of http records stored in the original access log file
o m: number of all sessions in S.
• Function f is the DBSCAN clustering algorithm.
1 1
,{ | , }i
m m
i i i
i i
sS s R s R s
0 s is Human
: {0,1}, ( )
1 s is Robot
i
i
i
f S f s
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
4/22
6. Related Works
Access Log file
Session Identification
Feature Extraction
Session LabelingTest Data
Set
Learning
Data Set
Learning Model
Pre Processing
Of
Access Log File
Doran, D. and Gokhale, S. S. (2010). Web robot detection techniques: overview and limitations. Data
Mining and Knowledge Discovery, 22: 183-210.
5/22
7. Related Works (cont)
• Bayesian Network [1]
• Hidden Markov Model [2]
• Decision Tree [3]
• Fuzzy Inference System [4]
• Neural Network [5]
• …
Being unreliable
of the
conventional
methods used for
session labeling.
6/22
1. Stassopoulou and M. D. Dikaiakos. (2009). "web robot detection: a probabilistic reasoning approach",
Computer Network, vol. 53, pp. 265-278.
2. Lu, WZ., Yu, SZ. (2006). Web robot detection based on hidden Markov model. international conference
on communications, circuits and systems, 1806–1810.
3. Tan, P.-N. and Kumar, V. (2002). Discovery of web robot sessions based on their navigational patterns.
Data Mining and Knowledge Discovery, 6: 9-35.
4. Zabihi, M., Vafaei Jahan, M., and Hamidzadeh, J. (2014). Fuzzy Intrusion Detection of Web Robots in
Computer Networks. International Journal of Applied Mathematics in Engineering, Management
and Technology, 152-158.
5. Bomhardt, C., Gaul, W., and Schmidt-Thieme, L. (2005). Web Robot detection pre-processing web log
files for Robot Detection. New Developments in Classification and Data Analysis: 113-124.
8. Related Works (cont)
• SOM Clustering Algorithm as the
unsupervised learning technique.
• Using the labels of sessions only for the
supervised evaluation of their proposed
methods.
Stevanovic, D., Vlajic, N., and An, A. (2013). Detection of malicious and non-malicious website visitors
using unsupervised neural network learning. Applied Soft Computing, 13(1): 698-708.
7/22
9. The Proposed Method (DBC_WRD)
• DBSCAN Clustering Algorithm:
– Being very effective at clustering the data sets of
significantly more than just a few thousand objects
– Having the ability of finding complicated cluster shapes in
a reasonable amount of time.
– Having only two assigned input parameters and also
supporting the user in determining the values for them.
• Two new features to distinguish robots from human
users:
– Based on the navigational patterns of visitors and the
resources requested by them.
– Selecting these features in a way that does not change over
time, in order to keep the techniques effective across server
domains and in the face of evolving robot traffics.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
8/22
10. Pre Processing of Access Log
1. Session Identification:
for each do
for each do
if ( ) then
close s;
else
if (containsIP(s, r.ip) or containsAgent(s, r.agent)) then
add r to s.list;
else
create new session
add r to
end;
end;
. – .r time s LastTime thre
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
9/22
11. Pre Processing of Access Log
2. Feature Extraction:
– Trap file request
– Maximum rate of browser file request (new)
– Penalty (new)
– Percentage of 304 response codes
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
10/22
12. Pre Processing of Access Log
Trap File Requests:
– A binary feature which demonstrates whether a session
contains a request for Trap files.
– Trap files: the resources that should never be requested by
human users.
• There is no link from the web site to these files.
• Most users are unaware of them.
• ‘cmd.exe’, ‘robots.txt’, ‘sitemap.xml’ (new).
– ‘sitemap.xml’: a guideline for most search engine web
robots and contains a list of all web pages and URL
addresses of a web site that cannot be discovered easily by
search engine web robots.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
11/22
13. Pre Processing of Access Log
Maximum rate of browser file request (new):
• Typing a URL address in the address bar or clicking on a link
to visit a new page:
1. The requested web page is analyzed by the browser
2. A barrage of requests automatically sent by the browser
to the server to achieve all embedded files on the page:
js, css, woff, eot, svg, ttf, jsp, asp, aspx, tpl, xsl, cfm, xml, swf,
flv, fla, f4v, sw, raw, amr, bwf,…
• In contrast to humans, web robots can freely decide which
resource is suitable to be requested.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
12/22
14. Pre Processing of Access Log
Penalty:
• a numerical attribute proposed based on the
navigational patterns of humans
o involve a large number of frequent back-and-forward
movements and loops:
1. Having a view restricted by the structure of links
of a site to find the required information,
2. “back” and “forward” option in web browser’s
history
3. disorienting the humans during their visits
o While after the first crawl of a site, robots can detect where
the required information resides and restrict their next
requests to specific areas of that site.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
13/22
15. Pre Processing of Access Log
• Penalizing each back-and-forward navigation or
loop
a
d
cb
e
• A larger value for
this attribute
among human
users than web
robots.
S = a, b, a, c, d, c, d, e, d, a
Penalty=5
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
14/22
16. Pre Processing of Access Log
Percentage of 304 response codes
• A numerical attribute calculated as the
percentage of responses with status code 304
in a session.
• A response sent with this code indicates that
the resource has not been modified since the
version specified by the request headers.
• Web browsers tend to have higher percentage
of 304 response code than robots.
15/22
Bomhardt, C., Gaul, W., and Schmidt-Thieme, L. (2005). Web Robot detection pre-processing web log files
for Robot Detection. New Developments in Classification and Data Analysis: 113-124.
17. Pre Processing of Access Log
3. Session Labeling:
The session labeling procedure:
DB_robots: set of web robots detected by WebLog Expert.
for each do
if containsTrapFile (s) then
s.label =1;
else if contains (DB_robots, s.ip) or contains (DB_robots, s.agent) then
s.label =1;
else
s.label =0;
end;
end;
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
16/22
18. Experiments
Specifications
Data Set Name
Imam Reza University ArticleBaz
Name D1 D2
# of Requests 311633 372304
# of Sessions 17969 22092
# of Humans 16799 18144
# of Robots 1170 3948
• The implementation of the DBSCAN algorithm provided in
WEKA data mining software:
o Epsilon=1, MinPts=6
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
17/22
19. Experiments (cont)
• Final distribution of instances of data set D1:
( , )
ij
i
n
recall i j
n
0
0.2
0.4
0.6
0.8
Robot Human
recall Metric for Robot Cluster
0
0.2
0.4
0.6
0.8
1
Robot Human
recall Metric for Human Cluster
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
18/22
20. Experiments (cont.)
• SOM Clustering Algorithm:
– MATLAB’s Neural Network Toolbox
• A SOM comprising 100 neurons in a 10-by-10
hexagonal arrangement.
• The training is done by 200 epochs.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
19/22
21. Experiments (cont)
Metric value for
SOM(Avg.)
Metric value for
DBSCAN(Avg.)
Evaluation metric
0.720.997
0.2650.964
0.3450.0215
0.860.977
TP TN
RI
TP TN FP FN
TP
Jaccard
TP FP FN
2
1 1
, log
c Ln n
j ij ij
j j
j i j j
n n n
Entropy e e
n n n
1
, max
cn
j ij
j j i
j j
n n
Purity p p
n n
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
20/22
22. Experiments (cont)
User-agent stringRobot name
Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/534+
(KHTML, like Gecko) BingPreview/1.0b
BingPreview
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko; Google Web Preview)
Google Web
Preview
HuaweiSymantecSpider/1.0+DSE-
support@huaweisymantec.com
Huawei
Symantec
Spider
Some examples of such web robots which imitate the human’s
behaviors:
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
21/22
23. Conclusions and Remarks
• According to the supervised evaluations:
DBC_WRD can have the 96% of Jaccard metric and
produce two clusters which have the entropy and
purity rates of 0.0215 and 0.97, respectively.
• From the standpoint of clustering quality and
accuracy:
– DBC_WRD performs better than state-of-the-art
algorithm.
• Some non-malicious popular web robots,
through imitating the human’s behavior, make
it difficult to be identified.
ICCKE 2014- M. Zabihi, M. Majid Vafaei Jahan, J. Hamidzadeh
22/22
24. Mahdieh Zabihi
M.Sc. Student (Computer Engineering), Imam Reza International University.
m.zabihi@imamreza.ac.ir
Majid Vafaei Jahan
Department of Computer Engineering, Islamic Azad University.
vafaeiJahan@mshdiau.ac.ir
Javad Hamidzadeh
Department of Computer Engineering, Faculty of Computer Engineering and
Information Technology, Sadjad University of Technology.
j_hamidzadeh@sadjad.ac.ir