SlideShare a Scribd company logo
Detecting MaliciousWeb Pages
Using An EnsembleWeighted
Average Model
- Research Project Presentation
Dharmendra Lalji
Vishwakarma
X18108181
MSc in DataAnalytics –
CohortA
September 2018-19
Area of Study & Motivation
Increase in internet Users
- Popularity of Cyber
Crimes
- Websites as a medium
of attack
Cyber-criminal activities such as ransomware, botnet,
information stealing, and DDOS etc.
- Leads to loss of Information privacy
- Loss to the businesses
1 2
Present Solutions –
1. Education & Legislation
2. Hand Crafted Techniques
1. Static Technique - Black-listing & White-listing Approach.
2. Dynamic Technique – Useful for creating blacklists
3. Intelligent Machine learning models – Using features present in the
malicious webpage.
1. Recent case study – Keyword-density approach (Altay et al., 2018)
3
Research Question
How can weighted average ensemble of features set of keyword-density, URL
features and JavaScript Code offer substantial improvements to keyword-
density predictor in identifying malicious web pages?
ResearchObjectives
• Analysing the important attributes such as URL length for URL
characteristics in distinguishing malicious class.
• Reproducing the keyword-density methods of classifying webpages. It acts
as a baseline model over an improved version of classification for the similar
dataset.
• Experimenting with each independent feature against the outcome to see
their contribution in the prediction.
• Dynamically calculating the weights for each feature set for classification
using an ensemble weighted approach.
Literature Review
• Detection of malicious websites using URL features
• (Chakraborty and Lin, 2017) and (Kim et al., 2018)
• Malicious websites detection using JavaScript codes
• (Liu et al., 2018) and (Stokes et al., 2018)
• Using machine learning with a content-based approach
• (Altay et al., 2018) and (Saxe et al., 2018)
• Using Hybrid features approach
• (Akiyama et al., 2017) and (Kazemian and Ahmed, 2015)
• Review of Ensemble learning
• (Nagaraj et al., 2018) and (Anne Ubing et al., 2019)
Research Methodology
• CRISP-DM Methodology
(Wirth, 2000)
Data Set Description
• Sources:
• Alexa – Benign Websites
• PhishTank – Malicious Websites
Features
Extraction
- URL
Features
Extraction -
JavaScript
Features
Extraction
- HTML
• Sklearn pipeline –
TF-IDFVectoriser module
• Takes care ofText processing
such as tokenisation, stop word
removal, stemming & n-grams.
Final generated Data
Set
After Data Cleansing – Duplicates &
Null values
EDA-1
EDA-2
EDA-3
Implementation
Results
Final Ensemble Results
OptimalWeights –
(2,3,2) – (URL, JS, KW)
Comparison Results
McNemar’sTest on contingency table
- Statistically showed difference in developed models.
- α = 0.05, p < 0.05
Discussion
• URL based models are proved to be a best classifier.
• Dataset difference (2019)
• Data extraction differences (Tools, Legal policies & Techniques)
FutureWork
• Browser plugins
• More features can be added such as DNS, Server relations.
• Combination of Static & Dynamic techniques.
• Predicting more broader categories of classes. E.g. Threat Types.
References
• Altay, B., Dokeroglu, T. and Cosar, A. (2018). Context-sensitive and keyword density-based supervised machine
learning techniques for malicious webpage detection, Soft Computing.
• Chakraborty, G. and Lin, T. T. (2017). A url address aware classification of malicious websites for online security
during web-surfing, 2017 IEEE International Conference on Advanced Networks and Telecommunications
Systems (ANTS), pp. 1-6.
• Kim, S., Kim, J., Nam, S. and Kim, D. (2018). Webmon: Ml- and yara-based malicious webpage detection,
Computer Networks 137: 119-131.
• Liu, J., Xu, M., Wang, X., Shen, S. and Li, M. (2018). A markov detection tree-based centralized scheme to
automatically identify malicious webpages on cloud platforms, IEEE Access 6: 74025-74038.
• Messabi, K. A., Aldwairi, M., Yousif, A. A., Thoban, A. and Belqasmi, F. (2018). Malware detection using dns
records and domain name features, Proceedings of the 2Nd International Conference on Future Networks and
Distributed Systems, ICFNDS '18, ACM, New York, NY, USA, pp. 29:1-29:7.
• Saxe, J., Harang, R. E., Wild, C. and Sanders, H. (2018). A deep learning approach to fast,format-agnostic
detection of malicious web content, CoRR abs/1804.05020.
• Seifert, C., Welch, I., Komisarczuk, P., Aval, C. U. and Endicott-Popovsky, B. (2008). Identification of malicious
web pages through analysis of underlying dns and web server relationships, 2008 33rd IEEE Conference on
Local Computer Networks (LCN), pp. 935-941.
• Stokes, J. W., Agrawal, R. and McDonald, G. (2018). Neural classification of malicious scripts: A study with
javascript and vbscript, CoRR abs/1805.05603.
• Wirth, R. (2000). Crisp-dm: Towards a standard process model for data mining, Proceedings of the Fourth
International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39.
ThankYou!

More Related Content

What's hot

Scratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataScratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity data
Vince Smith
 
Research Topics on Data Mining
Research Topics on Data MiningResearch Topics on Data Mining
Research Topics on Data Mining
Phdtopiccom
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Yashwant Rautela
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
Information Convergence in the Long Tail
Information Convergence in the Long TailInformation Convergence in the Long Tail
Information Convergence in the Long Tail
Alessandro Inversini
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
Ian Foster
 
CLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information RetrievalCLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information Retrievalbutest
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
Li Ding
 
Image Processing Phd Thesis Projects
Image Processing Phd Thesis ProjectsImage Processing Phd Thesis Projects
Image Processing Phd Thesis Projects
Phdtopiccom
 
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Jill Walker Rettberg
 
SLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 AltmetricsSLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 AltmetricsWilliam Gunn
 
Being A Good Data Provider
Being A Good Data ProviderBeing A Good Data Provider
Being A Good Data Provider
Alastair Dunning
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
University of Washington
 
Being a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair DunningBeing a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair Dunning
Alastair Dunning
 
Search, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data StreamsSearch, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data Streams
PayamBarnaghi
 
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative
 
Real time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 mReal time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 m
Nigar Gasimli
 

What's hot (18)

Scratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataScratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity data
 
Research Topics on Data Mining
Research Topics on Data MiningResearch Topics on Data Mining
Research Topics on Data Mining
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Information Convergence in the Long Tail
Information Convergence in the Long TailInformation Convergence in the Long Tail
Information Convergence in the Long Tail
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
 
CLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information RetrievalCLAIR: Computational Linguistics And Information Retrieval
CLAIR: Computational Linguistics And Information Retrieval
 
Courses Completed
Courses CompletedCourses Completed
Courses Completed
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
Image Processing Phd Thesis Projects
Image Processing Phd Thesis ProjectsImage Processing Phd Thesis Projects
Image Processing Phd Thesis Projects
 
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
Visualising Dissertations on Electronic Literature (Visualising E-lit seminar...
 
SLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 AltmetricsSLA Silicon Valley 2013 Altmetrics
SLA Silicon Valley 2013 Altmetrics
 
Being A Good Data Provider
Being A Good Data ProviderBeing A Good Data Provider
Being A Good Data Provider
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Being a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair DunningBeing a Good Data Provider, by Alastair Dunning
Being a Good Data Provider, by Alastair Dunning
 
Search, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data StreamsSearch, Discovery and Analysis of Sensory Data Streams
Search, Discovery and Analysis of Sensory Data Streams
 
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
 
Real time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 mReal time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 m
 

Similar to Classifying malicious websites using an ensemble weighted features

A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage Mining
IJERA Editor
 
SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"
SMART Infrastructure Facility
 
ICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptxICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptx
ajagbesundayadeola
 
Phishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningPhishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine Learning
IRJET Journal
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
vivatechijri
 
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
ijdkp
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
IRJET Journal
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
csandit
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
cscpconf
 
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
cscpconf
 
The Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network SecurityThe Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network Security
Deris Stiawan
 
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
IJCNCJournal
 
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
IJCNCJournal
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine Learning
IRJET Journal
 
Lei_Resume-it.doc
Lei_Resume-it.docLei_Resume-it.doc
Lei_Resume-it.docbutest
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
IJERA Editor
 
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
James Heller
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
AM Publications
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detection
Selman Bozkır
 

Similar to Classifying malicious websites using an ensemble weighted features (20)

A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage Mining
 
SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"
 
ICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptxICMCSI 2023 PPT 1074.pptx
ICMCSI 2023 PPT 1074.pptx
 
Phishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningPhishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine Learning
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
 
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
 
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...
 
The Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network SecurityThe Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network Security
 
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...
 
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine Learning
 
Lei_Resume-it.doc
Lei_Resume-it.docLei_Resume-it.doc
Lei_Resume-it.doc
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB US...
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detection
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 

Classifying malicious websites using an ensemble weighted features

  • 1. Detecting MaliciousWeb Pages Using An EnsembleWeighted Average Model - Research Project Presentation Dharmendra Lalji Vishwakarma X18108181 MSc in DataAnalytics – CohortA September 2018-19
  • 2. Area of Study & Motivation Increase in internet Users - Popularity of Cyber Crimes - Websites as a medium of attack Cyber-criminal activities such as ransomware, botnet, information stealing, and DDOS etc. - Leads to loss of Information privacy - Loss to the businesses 1 2
  • 3. Present Solutions – 1. Education & Legislation 2. Hand Crafted Techniques 1. Static Technique - Black-listing & White-listing Approach. 2. Dynamic Technique – Useful for creating blacklists 3. Intelligent Machine learning models – Using features present in the malicious webpage. 1. Recent case study – Keyword-density approach (Altay et al., 2018) 3
  • 4. Research Question How can weighted average ensemble of features set of keyword-density, URL features and JavaScript Code offer substantial improvements to keyword- density predictor in identifying malicious web pages?
  • 5. ResearchObjectives • Analysing the important attributes such as URL length for URL characteristics in distinguishing malicious class. • Reproducing the keyword-density methods of classifying webpages. It acts as a baseline model over an improved version of classification for the similar dataset. • Experimenting with each independent feature against the outcome to see their contribution in the prediction. • Dynamically calculating the weights for each feature set for classification using an ensemble weighted approach.
  • 6. Literature Review • Detection of malicious websites using URL features • (Chakraborty and Lin, 2017) and (Kim et al., 2018) • Malicious websites detection using JavaScript codes • (Liu et al., 2018) and (Stokes et al., 2018) • Using machine learning with a content-based approach • (Altay et al., 2018) and (Saxe et al., 2018) • Using Hybrid features approach • (Akiyama et al., 2017) and (Kazemian and Ahmed, 2015) • Review of Ensemble learning • (Nagaraj et al., 2018) and (Anne Ubing et al., 2019)
  • 7. Research Methodology • CRISP-DM Methodology (Wirth, 2000)
  • 8. Data Set Description • Sources: • Alexa – Benign Websites • PhishTank – Malicious Websites
  • 11. Features Extraction - HTML • Sklearn pipeline – TF-IDFVectoriser module • Takes care ofText processing such as tokenisation, stop word removal, stemming & n-grams.
  • 12. Final generated Data Set After Data Cleansing – Duplicates & Null values
  • 13. EDA-1
  • 14. EDA-2
  • 15. EDA-3
  • 17. Results Final Ensemble Results OptimalWeights – (2,3,2) – (URL, JS, KW)
  • 18. Comparison Results McNemar’sTest on contingency table - Statistically showed difference in developed models. - α = 0.05, p < 0.05
  • 19. Discussion • URL based models are proved to be a best classifier. • Dataset difference (2019) • Data extraction differences (Tools, Legal policies & Techniques)
  • 20. FutureWork • Browser plugins • More features can be added such as DNS, Server relations. • Combination of Static & Dynamic techniques. • Predicting more broader categories of classes. E.g. Threat Types.
  • 21. References • Altay, B., Dokeroglu, T. and Cosar, A. (2018). Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection, Soft Computing. • Chakraborty, G. and Lin, T. T. (2017). A url address aware classification of malicious websites for online security during web-surfing, 2017 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), pp. 1-6. • Kim, S., Kim, J., Nam, S. and Kim, D. (2018). Webmon: Ml- and yara-based malicious webpage detection, Computer Networks 137: 119-131. • Liu, J., Xu, M., Wang, X., Shen, S. and Li, M. (2018). A markov detection tree-based centralized scheme to automatically identify malicious webpages on cloud platforms, IEEE Access 6: 74025-74038. • Messabi, K. A., Aldwairi, M., Yousif, A. A., Thoban, A. and Belqasmi, F. (2018). Malware detection using dns records and domain name features, Proceedings of the 2Nd International Conference on Future Networks and Distributed Systems, ICFNDS '18, ACM, New York, NY, USA, pp. 29:1-29:7. • Saxe, J., Harang, R. E., Wild, C. and Sanders, H. (2018). A deep learning approach to fast,format-agnostic detection of malicious web content, CoRR abs/1804.05020. • Seifert, C., Welch, I., Komisarczuk, P., Aval, C. U. and Endicott-Popovsky, B. (2008). Identification of malicious web pages through analysis of underlying dns and web server relationships, 2008 33rd IEEE Conference on Local Computer Networks (LCN), pp. 935-941. • Stokes, J. W., Agrawal, R. and McDonald, G. (2018). Neural classification of malicious scripts: A study with javascript and vbscript, CoRR abs/1805.05603. • Wirth, R. (2000). Crisp-dm: Towards a standard process model for data mining, Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39.

Editor's Notes

  1. Hello Everyone! My name is Dharmendra Vishwakarma. This is a presentation of the Research Project for Master’s in Data Analytics course. The research topic is on “Detecting malicious web pages using an ensemble weighted average model”.
  2. The area of my study is a mix of both in cyber security and data analytics domain. 1. With advancement in communication technologies and ever-increasing internet, most of the services are online nowadays such as e-banking, social networking, e-commerce and entertainment, etc. Due to the easy availability of services and information, users tend to browse the internet freely without knowing the negative side of it. These services are exploited by cyber attackers to steal useful and private user-sensitive information. 2. The cyber-attackers use websites as a medium to redirect users to their malicious network for further attacks or using drive-by-download software to install malware locally on the user’s computer. This enables attackers to perform other cyber-criminal activities such as ransomware, botnet, information stealing, and DDOS etc. These leads to loss of information privacy and many cases loss to the businesses.
  3. To solve this problem, there are primarily three categories of solutions are present. Firstly, users are given knowledge about the prevention techniques in the form of education and legislation through government initiative to discourage such activities. However, due to the busy nature of the business, people often tend to make a mistake in a real-world scenario. The second approach consists of preparing computerised hand-crafted techniques to prevent phishing activities. It usually involves static techniques such as blacklist and white-listing approach. A dynamic approach is used wherein a virtual sandbox environment is used to observe the behavior of web pages in order to detect the presence of deceptive nature. But this method is not ideal for real-time detection and can be employed for creating a blacklist of URLs. Lastly, intelligent machine learning models are used for solving this problem using features present in the website. Recent study using a keyword density-based approach for detecting malicious websites has shown significant accuracy. However, the content present on the page can not be a significant factor alone that contributes towards the deceptive nature of the website given that varying nature of the attack.
  4. So, research question for my proposal is “”
  5. And The specific objectives of this research is “” In this research proposal, there is a consideration of various other important factors along with the content-based approach. These factors are URL based features, DNS information, Server details, JavaScript codes present on the page. These factors can contribute to making the final decision as URL alone cannot efficiently detect phishing behaviour of the website. The main contribution of this research will be using an ensemble learning in deciding the final classification result using individual models.
  6. The literature review suggest following trends. Many authors have considered different features from malicious websites such as URL, DNS, JavaScript and page contents. All these previous researches considered different aspects of malicious threats to develop solutions. However, there is a need to develop a hybrid set of solutions which can detect malicious content even if one feature set fails to detect it. For instance, web threats can appear in many forms within the page such as XSS, phishing, a DDOS attack. The idea is to consider weighted impact on the final decision.
  7. The research methodology is based on the CRISP-DM which is a successful methodology for data mining projects. Therefore, each task of the research is majorly divided into 6 phases as per the CRISP-DM paradigm.
  8. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  9. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  10. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  11. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  12. The dataset for this research is as follows - 100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank. Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
  13. Box-plot for outlier detection - URL length shows outlier, further explored by classes
  14. # data is not normally distributed. # most of the data is right skewed
  15. #correleated attributes are detected # for example, cookies_ref_count related to setinterval time # rest all seems fine. and equally important for the model building
  16. The Implementation is as follows Web pages from the dataset is extracted and stored along with the URLs. The features related to keyword-density, URL, JavaScript code and DNS server relationships are extracted using feature extraction process. This features with class variable is supplied to individual machine learning models. Their outcome is given as input for weighted ensemble model. This way dynamic weights are be determined and trained model will be generated. Entire process is splitted into training and prediction. During prediction, unseen web pages is evaluated on predictive model. The evaluation is conducted using Precision, Recall, F1-score, Area Under the ROC curve and 10-fold cross validation. Furthermore, statistical test is carried out to check significance of model.
  17. Ensemble techniques has lowest error among other individual models.
  18. These are the references used in the presentation.
  19. Thank You.