Despite emerging of Web 2.0 applications and increasing requirements to well-behaved Web robots, malicious ones can reveal irreparable risks for Web sites. Regardless of behavior of Web robots, they may occupy bandwidth and reduce performance of Web servers. In spite of many prestigious researches trying to characterize Web visitors and classify them, there is a lack of concentration on feature selection to dynamically choose attributes used to describe Web sessions. Therefore, in this research, a new algorithm is proposed based on Fuzzy Rough Set (RST) theory to better characterize and cluster Web visitors of three real Web sites. RST describes how a collection of data may be separated based on a decision boundary and an indiscernibility relation (Pawlak, 1982).The report of this research is resulted in “A soft computing approach for benign and malicious web robot detection” published in Expert Systems with Applications journals (Impact factor: 3.928).
The (Source code) is also publicly available.
3. Introduction
■ What are web robots (crawlers)?
– Web robots (also calledWeb crawlers) are employed byWeb technologies to collect and
scrutinize the dynamic content repositories contain. Perhaps surprisingly, robot agents
are likely the dominant type of agent on theWeb today.
■ What types of web robots we have?
– Benign (well-behaved) robots, powering theWeb search engines society relies on to find
information on theWeb.They are also core to the “Internet ofThings” concept.
– Malicious robots, posing a threat to the performance, privacy of information, and security
of aWeb server. Indeed, aggressive behaviors with specialized functions to harvest e-mail
addresses, perform click fraud, and access information behind ‘paywalls’ or login screens
where only authorized human users should be allowed access.
■ Why should web robots be distinguished?
– Control the traffic of web servers and security protection.
4. Related work
■ Using Classification
– Drawback : no trust to labels
■ Using Clustering
– No feature selection (using static features)1
– Feature selection with regard to the distribution of data (T-test)2
1. Stevanovic, D.,Vlajic, N., & An, A. (2013). Detection of malicious and non-malicious website visitors using unsupervised neural
network learning. Applied Soft Com- puting, 13(1), 698–708.
2. Zabihi, M., Jahan, M.V., & Hamidzadeh, J. (2014a). A density based clustering ap- proach for web robot detection. In Proceedings of
2014 4th international e-confer- ence on computer and knowledge engineering (pp. 23–28). IEEE.
9. Conclusion
■ Through a soft computing approaches for feature selection,
SMART achieves accurate detection of benign and malicious
web robot traffic with features tailored to any specific web
server. An experimental evaluation compared SMART against
other leading soft computing algorithms for this task, using
data from some live web servers for different domains of the
internet, and demonstrated the variety of features
considered.