Research Project - Master's in Data Analytics
Applying different statistical and machine learning techniques learned as a part of Data Analytics coursework is applied on Thesis Project to solve the malicious web page detection.
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
Digital transformation is driving a new wave of large-scale datafication in every aspect of our world. Today our society creates data ecosystems where data moves among actors within complex information supply chains that can form around an organization, community, sector, or smart environment. These ecosystems of data can be exploited to transform our world and present new challenges and opportunities in the design of intelligent systems. This talk presents my recent work on using the dataspace paradigm as a best-effort approach to data management within data ecosystems. The talk explores the theoretical foundations and principles of dataspaces and details a set of specialized best-effort techniques and models to enable loose administrative proximity and semantic integration of heterogeneous data sources. Finally, I share my perspectives on future dataspace research challenges, including multimedia data, data governance and the role of dataspaces to enable large-scale data sharing within Europe to power data-driven AI.
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...tmra
We propose a framework for ranking information based on quality, relevance and importance, and argue that a socio-semantic contextual approach that extends topicality can lead to increased value of information retrieval systems. We use Topic Maps to implement our framework, and discuss procedures for calculating the resource ranking. A fuzzy neural network approach is envisioned to complement the process of manual metadata creation.
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
Digital transformation is driving a new wave of large-scale datafication in every aspect of our world. Today our society creates data ecosystems where data moves among actors within complex information supply chains that can form around an organization, community, sector, or smart environment. These ecosystems of data can be exploited to transform our world and present new challenges and opportunities in the design of intelligent systems. This talk presents my recent work on using the dataspace paradigm as a best-effort approach to data management within data ecosystems. The talk explores the theoretical foundations and principles of dataspaces and details a set of specialized best-effort techniques and models to enable loose administrative proximity and semantic integration of heterogeneous data sources. Finally, I share my perspectives on future dataspace research challenges, including multimedia data, data governance and the role of dataspaces to enable large-scale data sharing within Europe to power data-driven AI.
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...tmra
We propose a framework for ranking information based on quality, relevance and importance, and argue that a socio-semantic contextual approach that extends topicality can lead to increased value of information retrieval systems. We use Topic Maps to implement our framework, and discuss procedures for calculating the resource ranking. A fuzzy neural network approach is envisioned to complement the process of manual metadata creation.
Scratchpads: the Virtual Research Environment for biodiversity dataVince Smith
Rycroft, S., Roberts, D., Smith, V., Heaton, A., Bouton, K., Livermore, L., Koureas, D., Baker, E. 2013. Scratchpads: the Virtual Research Environment for biodiversity data. TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
GENI Engineering Conference -- Ian FosterIan Foster
I was invited to talk at the 18th GENI Engineering Conference (http://groups.geni.net/geni/wiki/GEC18Agenda) on experiences in the Grid community with creating and operating large shared infrastructures. I chose to focus on our experiences using Software as a Service (SaaS: aka Cloud) to reduce barriers to the use of the capabilities required to create and operate virtual organizations.
Extending Memory on the Web via Human-Centric Knowledge Exchange Network. Presented at W3C Workshop on Social Standards: The Future of Business, 7-8 August 2013, San Francisco, USA
This presentation is intended to give some brief advice for those publishing
digital content (digital images, cultural heritage, scholarly information etc.)
on the Internet and in particular how to ensure good visibility via Google and other portals
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
Making sure your content is licenced and discoverable
A presentation from the JISC Programme Meeting for its Content Programme for 2011 http://www.jisc.ac.uk/whatwedo/programmes/digitisation/econtent11.aspx
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative
An August 2017 presentation by Eleanor Fink to "The Networked Curator: Association of Art Museum Curators Foundation Digital Literacy Workshop for Art Curators"
A Review on Pattern Discovery Techniques of Web Usage MiningIJERA Editor
In the recent years with the development of Internet technology the growth of World Wide Web exceeded all expectations. A lot of information is available in different formats and retrieving interesting content has become a very difficult task. One possible approach to solve this problem is Web Usage Mining (WUM), the important application of Web Mining. Extracting the hidden knowledge in the log files of a web server, recognizing various interests of web users, discovering customer behavior while at the site are normally referred as the applications of web usage mining. In this paper we provide an updated focused survey on techniques of web usage mining.
"From Big Data to Smart data"
Jie (Jack) Yang, Associate Research Fellow, SMART Infrastructure Facility, presented a summary of his research as part of the SMART Seminar Series on 28 April 2016.
For more information, visit the event page at: http://smart.uow.edu.au/events/UOW212890.html.
Scratchpads: the Virtual Research Environment for biodiversity dataVince Smith
Rycroft, S., Roberts, D., Smith, V., Heaton, A., Bouton, K., Livermore, L., Koureas, D., Baker, E. 2013. Scratchpads: the Virtual Research Environment for biodiversity data. TDWG, Biodiversity Information Standards. Grand Hotel Mediterraneo Florence, Italy, 27 Oct - 1 Nov., 2013.
GENI Engineering Conference -- Ian FosterIan Foster
I was invited to talk at the 18th GENI Engineering Conference (http://groups.geni.net/geni/wiki/GEC18Agenda) on experiences in the Grid community with creating and operating large shared infrastructures. I chose to focus on our experiences using Software as a Service (SaaS: aka Cloud) to reduce barriers to the use of the capabilities required to create and operate virtual organizations.
Extending Memory on the Web via Human-Centric Knowledge Exchange Network. Presented at W3C Workshop on Social Standards: The Future of Business, 7-8 August 2013, San Francisco, USA
This presentation is intended to give some brief advice for those publishing
digital content (digital images, cultural heritage, scholarly information etc.)
on the Internet and in particular how to ensure good visibility via Google and other portals
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
Making sure your content is licenced and discoverable
A presentation from the JISC Programme Meeting for its Content Programme for 2011 http://www.jisc.ac.uk/whatwedo/programmes/digitisation/econtent11.aspx
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative
An August 2017 presentation by Eleanor Fink to "The Networked Curator: Association of Art Museum Curators Foundation Digital Literacy Workshop for Art Curators"
A Review on Pattern Discovery Techniques of Web Usage MiningIJERA Editor
In the recent years with the development of Internet technology the growth of World Wide Web exceeded all expectations. A lot of information is available in different formats and retrieving interesting content has become a very difficult task. One possible approach to solve this problem is Web Usage Mining (WUM), the important application of Web Mining. Extracting the hidden knowledge in the log files of a web server, recognizing various interests of web users, discovering customer behavior while at the site are normally referred as the applications of web usage mining. In this paper we provide an updated focused survey on techniques of web usage mining.
"From Big Data to Smart data"
Jie (Jack) Yang, Associate Research Fellow, SMART Infrastructure Facility, presented a summary of his research as part of the SMART Seminar Series on 28 April 2016.
For more information, visit the event page at: http://smart.uow.edu.au/events/UOW212890.html.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.vivatechijri
In this technical age there are many ways where an attacker can get access to people’s sensitive information illegitimately. One of the ways is Phishing, Phishing is an activity of misleading people into giving their sensitive information on fraud websites that lookalike to the real website. The phishers aim is to steal personal information, bank details etc. Day by day it’s getting more and more risky to enter your personal information on websites fearing that it might be a phishing attack and can steal your sensitive information. That’s why phishing website detection is necessary to alert the user and block the website. An automated detection of phishing attack is necessary one of which is machine learning. Machine Learning is one of the efficient techniques to detect phishing attack as it removes drawback of existing approaches. Efficient machine learning model with content based approach proves very effective to detect phishing websites.
Our proposed system uses Hybrid approach which combines machine learning based method and content based method. The URL based features will be extracted and passed to machine learning model and in content based approach, TF-IDF algorithm will detect a phishing website by using the top keywords of a web page. This hybrid approach is used to achieve highly efficient result. Finally, our system will notify and alert user if the website is Phishing or Legitimate.
Integrated Web Recommendation Model with Improved Weighted Association Rule M...ijdkp
World Wide Web plays a significant role in human life. It requires a technological improvement to satisfy
the user needs. Web log data is essential for improving the performance of the web. It contains large,
heterogeneous and diverse data. Analyzing g the web log data is a tedious process for Web developers,
Web designers, technologists and end users. In this work, a new weighted association mining algorithm is
developed to identify the best association rules that are useful for web site restructuring and
recommendation that reduces false visit and improve users’ navigation behavior. The algorithm finds the
frequent item set from a large uncertain database. Frequent scanning of database in each time is the
problem with the existing algorithms which leads to complex output set and time consuming process. The
proposed algorithm scans the database only once at the beginning of the process and the generated
frequent item sets, which are stored into the database. The evaluation parameters such as support,
confidence, lift and number of rules are considered to analyze the performance of proposed algorithm and
traditional association mining algorithm. The new algorithm produced best result that helps the developer
to restructure their website in a way to meet the requirements of the end user within short time span.
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM csandit
With the increasing growth of Internet and World Wide Web, information retrieval (IR) has
attracted much attention in recent years. Quick, accurate and quality information mining is the
core concern of successful search companies. Likewise, spammers try to manipulate IR system
to fulfil their stealthy needs. Spamdexing, (also known as web spamming) is one of the
spamming techniques of adversarial IR, allowing users to exploit ranking of specific documents
in search engine result page (SERP). Spammers take advantage of different features of web
indexing system for notorious motives. Suitable machine learning approaches can be useful in
analysis of spam patterns and automated detection of spam. This paper examines content based
features of web documents and discusses the potential of feature selection (FS) in upcoming
studies to combat web spam. The objective of feature selection is to select the salient features to
improve prediction performance and to understand the underlying data generation techniques.
A publically available web data set namely WEBSPAM - UK2007 is used for all evaluations.
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM cscpconf
With the increasing growth of Internet and World Wide Web, information retrieval (IR) has attracted much attention in recent years. Quick, accurate and quality information mining is the core concern of successful search companies. Likewise, spammers try to manipulate IR system to fulfil their stealthy needs. Spamdexing, (also known as web spamming) is one of the spamming techniques of adversarial IR, allowing users to exploit ranking of specific documents in search engine result page (SERP). Spammers take advantage of different features of web indexing system for notorious motives. Suitable machine learning approaches can be useful in analysis of spam patterns and automated detection of spam. This paper examines content based features of web documents and discusses the potential of feature selection (FS) in upcoming studies to combat web spam. The objective of feature selection is to select the salient features to
improve prediction performance and to understand the underlying data generation techniques. A publically available web data set namely WEBSPAM - UK2007 is used for all evaluations.
A NEW IMPROVED WEIGHTED ASSOCIATION RULE MINING WITH DYNAMIC PROGRAMMING APPR...cscpconf
With the rapid development of Internet, Web search has been taken an important role in our
ordinary life. In web search, mining frequent patterns in large database is a major research area. Due to increase of user activities on web, web-searching methods, to predict the nextrequest of user visits in web pages plays a major role. Web searching methods are helpful to provide quality results, timely answer and also offer a customized navigation. In web search, Association rule mining is an important data analysis method to discover associated web pages. Most of the researchers implemented association mining using Apriori algorithm with binary representation. The problem of this approach is not address the issue like the navigation order of web pages. To overcome this problem researchers proposed a weighted Apriori to maintain navigation order but unable to produce optimal results. With the goal of a most favorable result we proposed a novel approach which combines weighted Apriori and dynamic programming. The experimental result shows that this approach maintains the navigation order of web pages and achieves a best solution. The proposed technique enhances the web site effectiveness, increases the user browsing knowledge, improves the prediction accuracy and decreases the computational complexities.
WEB ATTACK PREDICTION USING STEPWISE CONDITIONAL PARAMETER TUNING IN MACHINE ...IJCNCJournal
There is a rapid growth in internet and website usage. A wide variety of devices are used to access
websites, such as mobile phones, tablets, laptops, and personal computers. Attackers are finding more and
more vulnerabilities on websites that they can exploit for malicious purposes. A web application attack
occurs when cyber criminals gain access to unauthorized areas. Typically, attackers look for
vulnerabilities in web applications at the application layer. SQL injection attacks and Cross Site script
attacks is used to access web applications to obtain sensitive data. A key objective of this work is to
develop new features and investigate how automatic tuning of machine learning techniques can improve
the performance of Web Attack detections that use HTTP CSIC datasets to block and detect attacks. The
Stepwise Conditional parameter tuning in machine learning algorithms is a proposed model. This model is
a dynamic and automated parameter choosing and tuning based on the better outcome. This work also
compares two datasets for performance of the proposed model.
Web Attack Prediction using Stepwise Conditional Parameter Tuning in Machine ...IJCNCJournal
There is a rapid growth in internet and website usage. A wide variety of devices are used to access websites, such as mobile phones, tablets, laptops, and personal computers. Attackers are finding more and more vulnerabilities on websites that they can exploit for malicious purposes. A web application attack occurs when cyber criminals gain access to unauthorized areas. Typically, attackers look for vulnerabilities in web applications at the application layer. SQL injection attacks and Cross Site script attacks is used to access web applications to obtain sensitive data. A key objective of this work is to develop new features and investigate how automatic tuning of machine learning techniques can improve the performance of Web Attack detections that use HTTP CSIC datasets to block and detect attacks. The Stepwise Conditional parameter tuning in machine learning algorithms is a proposed model. This model is a dynamic and automated parameter choosing and tuning based on the better outcome. This work also compares two datasets for performance of the proposed model.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSAM Publications
In the current development, millions of clients are accessing daily the internet and World Wide Web (WWW) to search the information and achieve their necessities. Web mining is a technique to automatic discovers and Extract information from www. Websites are a common stage to discussion the information between users. Web mining is one of the applications of Data mining techniques for extracting information from web data. The area of web mining is web content mining, web usage mining and web structure mining. These three category focus on Knowledge discovery from web. Web content mining involves technique for summarization, classification, clustering and the process of extracting or discovering useful information web pages, it includes image, audio, video and metadata. Web usage mining is the process of extracting information from web server logs. Web structure mining it is the process of using graph theory to analyse the node and connection structure of a website and deals with the hyperlink structure of web. Web mining is a part of data mining which relates to various research communities such as information retrieval, database management systems and Artificial intelligence.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Use of hog descriptors in phishing detectionSelman Bozkır
In this paper we are diving into the details of an anti phishing detection system which employs HOG features.
* The presentation is built with voice recording
Similar to Classifying malicious websites using an ensemble weighted features (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Classifying malicious websites using an ensemble weighted features
1. Detecting MaliciousWeb Pages
Using An EnsembleWeighted
Average Model
- Research Project Presentation
Dharmendra Lalji
Vishwakarma
X18108181
MSc in DataAnalytics –
CohortA
September 2018-19
2. Area of Study & Motivation
Increase in internet Users
- Popularity of Cyber
Crimes
- Websites as a medium
of attack
Cyber-criminal activities such as ransomware, botnet,
information stealing, and DDOS etc.
- Leads to loss of Information privacy
- Loss to the businesses
1 2
3. Present Solutions –
1. Education & Legislation
2. Hand Crafted Techniques
1. Static Technique - Black-listing & White-listing Approach.
2. Dynamic Technique – Useful for creating blacklists
3. Intelligent Machine learning models – Using features present in the
malicious webpage.
1. Recent case study – Keyword-density approach (Altay et al., 2018)
3
4. Research Question
How can weighted average ensemble of features set of keyword-density, URL
features and JavaScript Code offer substantial improvements to keyword-
density predictor in identifying malicious web pages?
5. ResearchObjectives
• Analysing the important attributes such as URL length for URL
characteristics in distinguishing malicious class.
• Reproducing the keyword-density methods of classifying webpages. It acts
as a baseline model over an improved version of classification for the similar
dataset.
• Experimenting with each independent feature against the outcome to see
their contribution in the prediction.
• Dynamically calculating the weights for each feature set for classification
using an ensemble weighted approach.
6. Literature Review
• Detection of malicious websites using URL features
• (Chakraborty and Lin, 2017) and (Kim et al., 2018)
• Malicious websites detection using JavaScript codes
• (Liu et al., 2018) and (Stokes et al., 2018)
• Using machine learning with a content-based approach
• (Altay et al., 2018) and (Saxe et al., 2018)
• Using Hybrid features approach
• (Akiyama et al., 2017) and (Kazemian and Ahmed, 2015)
• Review of Ensemble learning
• (Nagaraj et al., 2018) and (Anne Ubing et al., 2019)
11. Features
Extraction
- HTML
• Sklearn pipeline –
TF-IDFVectoriser module
• Takes care ofText processing
such as tokenisation, stop word
removal, stemming & n-grams.
19. Discussion
• URL based models are proved to be a best classifier.
• Dataset difference (2019)
• Data extraction differences (Tools, Legal policies & Techniques)
20. FutureWork
• Browser plugins
• More features can be added such as DNS, Server relations.
• Combination of Static & Dynamic techniques.
• Predicting more broader categories of classes. E.g. Threat Types.
21. References
• Altay, B., Dokeroglu, T. and Cosar, A. (2018). Context-sensitive and keyword density-based supervised machine
learning techniques for malicious webpage detection, Soft Computing.
• Chakraborty, G. and Lin, T. T. (2017). A url address aware classification of malicious websites for online security
during web-surfing, 2017 IEEE International Conference on Advanced Networks and Telecommunications
Systems (ANTS), pp. 1-6.
• Kim, S., Kim, J., Nam, S. and Kim, D. (2018). Webmon: Ml- and yara-based malicious webpage detection,
Computer Networks 137: 119-131.
• Liu, J., Xu, M., Wang, X., Shen, S. and Li, M. (2018). A markov detection tree-based centralized scheme to
automatically identify malicious webpages on cloud platforms, IEEE Access 6: 74025-74038.
• Messabi, K. A., Aldwairi, M., Yousif, A. A., Thoban, A. and Belqasmi, F. (2018). Malware detection using dns
records and domain name features, Proceedings of the 2Nd International Conference on Future Networks and
Distributed Systems, ICFNDS '18, ACM, New York, NY, USA, pp. 29:1-29:7.
• Saxe, J., Harang, R. E., Wild, C. and Sanders, H. (2018). A deep learning approach to fast,format-agnostic
detection of malicious web content, CoRR abs/1804.05020.
• Seifert, C., Welch, I., Komisarczuk, P., Aval, C. U. and Endicott-Popovsky, B. (2008). Identification of malicious
web pages through analysis of underlying dns and web server relationships, 2008 33rd IEEE Conference on
Local Computer Networks (LCN), pp. 935-941.
• Stokes, J. W., Agrawal, R. and McDonald, G. (2018). Neural classification of malicious scripts: A study with
javascript and vbscript, CoRR abs/1805.05603.
• Wirth, R. (2000). Crisp-dm: Towards a standard process model for data mining, Proceedings of the Fourth
International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39.
Hello Everyone! My name is Dharmendra Vishwakarma. This is a presentation of the Research Project for Master’s in Data Analytics course. The research topic is on “Detecting malicious web pages using an ensemble weighted average model”.
The area of my study is a mix of both in cyber security and data analytics domain.
1. With advancement in communication technologies and ever-increasing internet, most of the services are online nowadays such as e-banking, social networking, e-commerce and entertainment, etc.
Due to the easy availability of services and information, users tend to browse the internet freely without knowing the negative side of it. These services are exploited by cyber attackers to
steal useful and private user-sensitive information.
2. The cyber-attackers use websites as a medium to redirect users to their malicious network for further attacks or using drive-by-download
software to install malware locally on the user’s computer. This enables attackers to perform other cyber-criminal activities such as ransomware, botnet, information stealing, and DDOS
etc. These leads to loss of information privacy and many cases loss to the businesses.
To solve this problem, there are primarily three categories of solutions are present.
Firstly, users are given knowledge about the prevention techniques in the form of education and
legislation through government initiative to discourage such activities. However, due to the busy nature of the business, people often tend to make a mistake in a real-world scenario.
The second approach consists of preparing computerised hand-crafted techniques to prevent phishing activities. It usually involves static techniques such as blacklist and white-listing approach.
A dynamic approach is used wherein a virtual sandbox environment is used to observe the behavior of web pages in order to detect the presence of deceptive nature. But this method is not ideal for real-time detection and can be employed for creating a blacklist of URLs.
Lastly, intelligent machine learning models are used for solving this problem using features present in the website. Recent study using a keyword density-based approach for detecting malicious websites has shown significant accuracy. However, the content present on the page can not be a significant factor alone that contributes
towards the deceptive nature of the website given that varying nature of the attack.
So, research question for my proposal is “”
And The specific objectives of this research is “”
In this research proposal, there is a consideration of various other important factors along with the content-based approach. These factors
are URL based features, DNS information, Server details, JavaScript codes present on the page. These factors can contribute to making the final decision as URL alone cannot
efficiently detect phishing behaviour of the website.
The main contribution of this research will be using an ensemble learning in deciding the final classification result using individual models.
The literature review suggest following trends.
Many authors have considered different features from malicious websites such as URL, DNS, JavaScript and page contents.
All these previous researches considered different aspects of malicious threats to develop solutions. However, there is a need to develop a hybrid set of solutions which can detect malicious content even if one feature set fails to detect it. For instance, web threats can appear in many forms within the page such as XSS, phishing, a DDOS attack. The idea is to consider weighted impact on the final decision.
The research methodology is based on the CRISP-DM which is a successful methodology for data mining projects.
Therefore, each task of the research is majorly divided into 6 phases as per the CRISP-DM paradigm.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
The dataset for this research is as follows -
100 thousand benign URLs will be extracted from Alexa and 20 thousand malicious URLs will be downloaded from PhishTank.
Both datasets have been previously used in the literature. Since a comparison will be made over baseline model, same dataset is considered.
Box-plot for outlier detection
- URL length shows outlier, further explored by classes
# data is not normally distributed.
# most of the data is right skewed
#correleated attributes are detected
# for example, cookies_ref_count related to setinterval time
# rest all seems fine. and equally important for the model building
The Implementation is as follows
Web pages from the dataset is extracted and stored along with the URLs. The features related to keyword-density, URL, JavaScript code and DNS server relationships are extracted using feature extraction process. This features with class variable is supplied to individual machine learning models. Their outcome is given as input for weighted ensemble model. This way dynamic weights are be determined and trained model will be generated. Entire process is splitted into training and prediction. During prediction, unseen web pages is evaluated on predictive model. The evaluation is conducted using Precision, Recall, F1-score, Area Under the ROC curve and 10-fold cross validation. Furthermore, statistical test is carried out to check significance of model.
Ensemble techniques has lowest error among other individual models.
These are the references used in the presentation.