Classifying malicious websites using an ensemble weighted features

Detecting MaliciousWeb Pages
Using An EnsembleWeighted
Average Model
- Research Project Presentation
Dharmendra Lalji
Vishwakarma
X18108181
MSc in DataAnalytics –
CohortA
September 2018-19

Area of Study & Motivation
Increase in internet Users
- Popularity of Cyber
Crimes
- Websites as a medium
of attack
Cyber-criminal activities such as ransomware, botnet,
information stealing, and DDOS etc.
- Leads to loss of Information privacy
- Loss to the businesses
1 2

Present Solutions –
1. Education & Legislation
2. Hand Crafted Techniques
1. Static Technique - Black-listing & White-listing Approach.
2. Dynamic Technique – Useful for creating blacklists
3. Intelligent Machine learning models – Using features present in the
malicious webpage.
1. Recent case study – Keyword-density approach (Altay et al., 2018)
3

Research Question
How can weighted average ensemble of features set of keyword-density, URL
features and JavaScript Code offer substantial improvements to keyword-
density predictor in identifying malicious web pages?

ResearchObjectives
• Analysing the important attributes such as URL length for URL
characteristics in distinguishing malicious class.
• Reproducing the keyword-density methods of classifying webpages. It acts
as a baseline model over an improved version of classification for the similar
dataset.
• Experimenting with each independent feature against the outcome to see
their contribution in the prediction.
• Dynamically calculating the weights for each feature set for classification
using an ensemble weighted approach.

Literature Review
• Detection of malicious websites using URL features
• (Chakraborty and Lin, 2017) and (Kim et al., 2018)
• Malicious websites detection using JavaScript codes
• (Liu et al., 2018) and (Stokes et al., 2018)
• Using machine learning with a content-based approach
• (Altay et al., 2018) and (Saxe et al., 2018)
• Using Hybrid features approach
• (Akiyama et al., 2017) and (Kazemian and Ahmed, 2015)
• Review of Ensemble learning
• (Nagaraj et al., 2018) and (Anne Ubing et al., 2019)

Research Methodology
• CRISP-DM Methodology
(Wirth, 2000)

Data Set Description
• Sources:
• Alexa – Benign Websites
• PhishTank – Malicious Websites

Features
Extraction -
JavaScript

Features
Extraction
- HTML
• Sklearn pipeline –
TF-IDFVectoriser module
• Takes care ofText processing
such as tokenisation, stop word
removal, stemming & n-grams.

Final generated Data
Set
After Data Cleansing – Duplicates &
Null values

Results
Final Ensemble Results
OptimalWeights –
(2,3,2) – (URL, JS, KW)

Comparison Results
McNemar’sTest on contingency table
- Statistically showed difference in developed models.
- α = 0.05, p < 0.05

Discussion
• URL based models are proved to be a best classifier.
• Dataset difference (2019)
• Data extraction differences (Tools, Legal policies & Techniques)

FutureWork
• Browser plugins
• More features can be added such as DNS, Server relations.
• Combination of Static & Dynamic techniques.
• Predicting more broader categories of classes. E.g. Threat Types.

References
• Altay, B., Dokeroglu, T. and Cosar, A. (2018). Context-sensitive and keyword density-based supervised machine
learning techniques for malicious webpage detection, Soft Computing.
• Chakraborty, G. and Lin, T. T. (2017). A url address aware classification of malicious websites for online security
during web-surfing, 2017 IEEE International Conference on Advanced Networks and Telecommunications
Systems (ANTS), pp. 1-6.
• Kim, S., Kim, J., Nam, S. and Kim, D. (2018). Webmon: Ml- and yara-based malicious webpage detection,
Computer Networks 137: 119-131.
• Liu, J., Xu, M., Wang, X., Shen, S. and Li, M. (2018). A markov detection tree-based centralized scheme to
automatically identify malicious webpages on cloud platforms, IEEE Access 6: 74025-74038.
• Messabi, K. A., Aldwairi, M., Yousif, A. A., Thoban, A. and Belqasmi, F. (2018). Malware detection using dns
records and domain name features, Proceedings of the 2Nd International Conference on Future Networks and
Distributed Systems, ICFNDS '18, ACM, New York, NY, USA, pp. 29:1-29:7.
• Saxe, J., Harang, R. E., Wild, C. and Sanders, H. (2018). A deep learning approach to fast,format-agnostic
detection of malicious web content, CoRR abs/1804.05020.
• Seifert, C., Welch, I., Komisarczuk, P., Aval, C. U. and Endicott-Popovsky, B. (2008). Identification of malicious
web pages through analysis of underlying dns and web server relationships, 2008 33rd IEEE Conference on
Local Computer Networks (LCN), pp. 935-941.
• Stokes, J. W., Agrawal, R. and McDonald, G. (2018). Neural classification of malicious scripts: A study with
javascript and vbscript, CoRR abs/1805.05603.
• Wirth, R. (2000). Crisp-dm: Towards a standard process model for data mining, Proceedings of the Fourth
International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39.

Classifying malicious websites using an ensemble weighted features

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Classifying malicious websites using an ensemble weighted features

Similar to Classifying malicious websites using an ensemble weighted features (20)

Recently uploaded

Recently uploaded (20)

Classifying malicious websites using an ensemble weighted features

Editor's Notes