SlideShare a Scribd company logo
1 of 4
Base paper Title: PhiKitA: Phishing Kit Attacks Dataset for Phishing Websites Identification
Modified Title: PhiKitA: Phishing Kit Attacks Dataset for the Identification of Phishing
Websites
Abstract
Recent studies have shown that phishers are using phishing kits to deploy phishing
attacks faster, easier and more massive. Detecting phishing kits in deployed websites might
help to detect phishing campaigns earlier. To the best of our knowledge, there are no datasets
providing a set of phishing kits that are used in websites that were attacked by phishing. In this
work, we propose PhiKitA, a novel dataset that contains phishing kits and also phishing
websites generated using these kits. We have applied MD5 hashes, fingerprints, and graph
representation DOM algorithms to obtain baseline results in PhiKitA in three experiments:
familiarity analysis of phishing kit samples, phishing website detection and identifying the
source of a phishing website. In the familiarity analysis, we find evidence of different types of
phishing kits and a small phishing campaign. In the binary classification problem for phishing
detection, the graph representation algorithm achieved an accuracy of 92.50%, showing that
the phishing kit data contain useful information to classify phishing. Finally, the MD5 hash
representation obtained a 39.54% F1 score, which means that this algorithm does not extract
enough information to distinguish phishing websites and their phishing kit sources properly.
Existing System
The Internet has become more and more accessible over the world in the last decades,
going from 20% of the world population with Internet access in 2005 to 63% in 2021 [1]. This
amount represents 4.9 billion people using the internet. With this exponential growth,
protecting internet users and their data has become a concern textcolorbluefor Law
Enforcement Agencies (LEAs), research centres, and people in general. As a
textcolorblueresult, researchers have focused on important topics related to cybersecurity.
Recent relevant works include spam identification or classification [2], [3], bots detection to
early response and Distributed Denial of Service (DDoS) attacks [4], [5], algorithms to classify
suspicious content on the darknet [6], [7], [8], [9], and even image analysis as a forensic tool
to detect criminal activity on videos [10], [11]. Phishing is a cybercrime that uses social
engineering and aims to deceive people and steal their financial account credentials or other
sensitive data [12]. Phishers imitate websites to impersonate well-trusted companies and
request victims’ personal and sensitive information, as shown in Figure 1. Phishing has become
one of the most common cyberattacks due to the exponential growth of the Internet [13]. The
Anti-Phishing Working Group (APWG) reported a huge increase in the second quarter of 2022
[14], finding 1, 097, 811 phishing attacks, a record number, making that quarter the worst ever
observed. This increase in the number of attacks has also been motivated by the changes that
have taken place during the pandemic, as the studies by Hijji et al. [15] and Alzahrani et al.
[16] suggest.
Drawback in Existing System
 Bias and Representativeness:
Datasets may not be representative of the diverse range of phishing attacks and
techniques. Some datasets might focus more on certain types of phishing attacks,
potentially leading to biased models that perform well on those specific types but
struggle with others.
 Labeling Accuracy:
The accuracy of labeling phishing instances in a dataset is crucial. Manual labeling
can be subjective and may introduce errors. Additionally, there could be cases of
mislabeling or ambiguity, affecting the quality of the dataset.
 Evaluation Metrics:
The choice of evaluation metrics is crucial. Some metrics might not fully capture
the effectiveness of a phishing detection model. For example, accuracy may not be
sufficient if the dataset is imbalanced.
 Dataset Size:
The size of the dataset can impact the model's performance. A small dataset might
not capture the complexity and diversity of real-world phishing attacks, leading to
overfitting.
Proposed System
 The proposed methodology to collect phishing kit samples and phishing website
attacks, and it presents the content of the collected dataset. Section IV describes the
experimental setup, the proposed experiments on the data and the algorithms evaluated.
 Proposed one of the first methods that use phishing kits as a resource of information to
identify phishing attacks.
 Proposed a fingerprint representation based on file names, paths and strings in the
phishing kits.
 Proposed by Orunsolu et al. due to the need to train a machinelearning model. We
concluded that a larger dataset is necessary to do this process properly.
Algorithm
 PhishTank Dataset: PhishTank is a community-driven website that tracks phishing
websites. They may provide datasets that can be used for research purposes.
 Kaggle Datasets: Kaggle often hosts machine learning competitions and provides
datasets for various purposes. You can search Kaggle for datasets related to phishing
or cybersecurity.
 CERT Datasets: Computer Emergency Response Teams (CERTs) often release
datasets for research purposes. CERTs from different countries may have datasets
related to phishing attacks.
Advantages
 Realistic Scenarios: A dataset based on phishing kit attacks could provide realistic
and diverse scenarios encountered in actual phishing incidents. This realism enhances
the training of identification algorithms by exposing them to a wide range of phishing
techniques.
 Behavioral Patterns: Phishing kits often exhibit specific behavioral patterns or
characteristics. By using a dataset that includes these patterns, machine learning
algorithms can learn to recognize and distinguish features associated with phishing
attacks, aiding in accurate identification.
 Dynamic Threat Landscape: Phishing attacks are dynamic and constantly evolving.
A dataset based on phishing kit attacks can reflect the changing tactics used by
attackers. This adaptability is crucial for training algorithms that can keep up with the
evolving threat landscape.
 Security Research: The availability of a dedicated dataset enables security
researchers to conduct in-depth studies on phishing attacks, identify emerging trends,
and propose new countermeasures to enhance overall cybersecurity.
Software Specification
 Processor : I3 core processor
 Ram : 4 GB
 Hard disk : 500 GB
Software Specification
 Operating System : Windows 10 /11
 Frond End : Python
 Back End : Mysql Server
 IDE Tools : Pycharm

More Related Content

Similar to PhiKitA Phishing Kit Attacks Dataset for Phishing Websites Identification.docx

data mining for security application
data mining for security applicationdata mining for security application
data mining for security application
bharatsvnit
 
data mining for security application
data mining for security applicationdata mining for security application
data mining for security application
bharatsvnit
 
PHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINKPHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINK
RajeshRavi44
 
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
IJARIIT
 
Classification of Malware Attacks Using Machine Learning In Decision Tree
Classification of Malware Attacks Using Machine Learning In Decision TreeClassification of Malware Attacks Using Machine Learning In Decision Tree
Classification of Malware Attacks Using Machine Learning In Decision Tree
CSCJournals
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
vivatechijri
 
A survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac techniqueA survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac technique
bhas_ani
 

Similar to PhiKitA Phishing Kit Attacks Dataset for Phishing Websites Identification.docx (20)

data mining for security application
data mining for security applicationdata mining for security application
data mining for security application
 
data mining for security application
data mining for security applicationdata mining for security application
data mining for security application
 
PHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINKPHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINK
 
An In-Depth Benchmarking And Evaluation Of Phishing Detection Research For Se...
An In-Depth Benchmarking And Evaluation Of Phishing Detection Research For Se...An In-Depth Benchmarking And Evaluation Of Phishing Detection Research For Se...
An In-Depth Benchmarking And Evaluation Of Phishing Detection Research For Se...
 
Invesitigation of Malware and Forensic Tools on Internet
Invesitigation of Malware and Forensic Tools on Internet Invesitigation of Malware and Forensic Tools on Internet
Invesitigation of Malware and Forensic Tools on Internet
 
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
 
Cyber Threat Prediction using ML
Cyber Threat Prediction using MLCyber Threat Prediction using ML
Cyber Threat Prediction using ML
 
A proposed model_for_cybercrime_detectio
A proposed model_for_cybercrime_detectioA proposed model_for_cybercrime_detectio
A proposed model_for_cybercrime_detectio
 
Classification of Malware Attacks Using Machine Learning In Decision Tree
Classification of Malware Attacks Using Machine Learning In Decision TreeClassification of Malware Attacks Using Machine Learning In Decision Tree
Classification of Malware Attacks Using Machine Learning In Decision Tree
 
Classification with R
Classification with RClassification with R
Classification with R
 
PDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRON
PDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRONPDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRON
PDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRON
 
C3602021025
C3602021025C3602021025
C3602021025
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
 
A survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac techniqueA survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac technique
 
Ways To Protect Your Company From Cybercrime
Ways To Protect Your Company From CybercrimeWays To Protect Your Company From Cybercrime
Ways To Protect Your Company From Cybercrime
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
 
IRJET - Chrome Extension for Detecting Phishing Websites
IRJET -  	  Chrome Extension for Detecting Phishing WebsitesIRJET -  	  Chrome Extension for Detecting Phishing Websites
IRJET - Chrome Extension for Detecting Phishing Websites
 
V01 i010413
V01 i010413V01 i010413
V01 i010413
 
Phishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification AlgorithmsPhishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification Algorithms
 

More from Shakas Technologies

More from Shakas Technologies (20)

A Review on Deep-Learning-Based Cyberbullying Detection
A Review on Deep-Learning-Based Cyberbullying DetectionA Review on Deep-Learning-Based Cyberbullying Detection
A Review on Deep-Learning-Based Cyberbullying Detection
 
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
 
A Novel Framework for Credit Card.
A Novel Framework for Credit Card.A Novel Framework for Credit Card.
A Novel Framework for Credit Card.
 
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
 
NS2 Final Year Project Titles 2023- 2024
NS2 Final Year Project Titles 2023- 2024NS2 Final Year Project Titles 2023- 2024
NS2 Final Year Project Titles 2023- 2024
 
MATLAB Final Year IEEE Project Titles 2023-2024
MATLAB Final Year IEEE Project Titles 2023-2024MATLAB Final Year IEEE Project Titles 2023-2024
MATLAB Final Year IEEE Project Titles 2023-2024
 
Latest Python IEEE Project Titles 2023-2024
Latest Python IEEE Project Titles 2023-2024Latest Python IEEE Project Titles 2023-2024
Latest Python IEEE Project Titles 2023-2024
 
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
 
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSECYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
 
Detecting Mental Disorders in social Media through Emotional patterns-The cas...
Detecting Mental Disorders in social Media through Emotional patterns-The cas...Detecting Mental Disorders in social Media through Emotional patterns-The cas...
Detecting Mental Disorders in social Media through Emotional patterns-The cas...
 
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTIONCOMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
 
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCECO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
 
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
 
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
 
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
 
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
 
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
 
Fighting Money Laundering With Statistics and Machine Learning.docx
Fighting Money Laundering With Statistics and Machine Learning.docxFighting Money Laundering With Statistics and Machine Learning.docx
Fighting Money Laundering With Statistics and Machine Learning.docx
 
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
 
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

PhiKitA Phishing Kit Attacks Dataset for Phishing Websites Identification.docx

  • 1. Base paper Title: PhiKitA: Phishing Kit Attacks Dataset for Phishing Websites Identification Modified Title: PhiKitA: Phishing Kit Attacks Dataset for the Identification of Phishing Websites Abstract Recent studies have shown that phishers are using phishing kits to deploy phishing attacks faster, easier and more massive. Detecting phishing kits in deployed websites might help to detect phishing campaigns earlier. To the best of our knowledge, there are no datasets providing a set of phishing kits that are used in websites that were attacked by phishing. In this work, we propose PhiKitA, a novel dataset that contains phishing kits and also phishing websites generated using these kits. We have applied MD5 hashes, fingerprints, and graph representation DOM algorithms to obtain baseline results in PhiKitA in three experiments: familiarity analysis of phishing kit samples, phishing website detection and identifying the source of a phishing website. In the familiarity analysis, we find evidence of different types of phishing kits and a small phishing campaign. In the binary classification problem for phishing detection, the graph representation algorithm achieved an accuracy of 92.50%, showing that the phishing kit data contain useful information to classify phishing. Finally, the MD5 hash representation obtained a 39.54% F1 score, which means that this algorithm does not extract enough information to distinguish phishing websites and their phishing kit sources properly. Existing System The Internet has become more and more accessible over the world in the last decades, going from 20% of the world population with Internet access in 2005 to 63% in 2021 [1]. This amount represents 4.9 billion people using the internet. With this exponential growth, protecting internet users and their data has become a concern textcolorbluefor Law Enforcement Agencies (LEAs), research centres, and people in general. As a textcolorblueresult, researchers have focused on important topics related to cybersecurity. Recent relevant works include spam identification or classification [2], [3], bots detection to early response and Distributed Denial of Service (DDoS) attacks [4], [5], algorithms to classify suspicious content on the darknet [6], [7], [8], [9], and even image analysis as a forensic tool to detect criminal activity on videos [10], [11]. Phishing is a cybercrime that uses social engineering and aims to deceive people and steal their financial account credentials or other
  • 2. sensitive data [12]. Phishers imitate websites to impersonate well-trusted companies and request victims’ personal and sensitive information, as shown in Figure 1. Phishing has become one of the most common cyberattacks due to the exponential growth of the Internet [13]. The Anti-Phishing Working Group (APWG) reported a huge increase in the second quarter of 2022 [14], finding 1, 097, 811 phishing attacks, a record number, making that quarter the worst ever observed. This increase in the number of attacks has also been motivated by the changes that have taken place during the pandemic, as the studies by Hijji et al. [15] and Alzahrani et al. [16] suggest. Drawback in Existing System  Bias and Representativeness: Datasets may not be representative of the diverse range of phishing attacks and techniques. Some datasets might focus more on certain types of phishing attacks, potentially leading to biased models that perform well on those specific types but struggle with others.  Labeling Accuracy: The accuracy of labeling phishing instances in a dataset is crucial. Manual labeling can be subjective and may introduce errors. Additionally, there could be cases of mislabeling or ambiguity, affecting the quality of the dataset.  Evaluation Metrics: The choice of evaluation metrics is crucial. Some metrics might not fully capture the effectiveness of a phishing detection model. For example, accuracy may not be sufficient if the dataset is imbalanced.  Dataset Size: The size of the dataset can impact the model's performance. A small dataset might not capture the complexity and diversity of real-world phishing attacks, leading to overfitting.
  • 3. Proposed System  The proposed methodology to collect phishing kit samples and phishing website attacks, and it presents the content of the collected dataset. Section IV describes the experimental setup, the proposed experiments on the data and the algorithms evaluated.  Proposed one of the first methods that use phishing kits as a resource of information to identify phishing attacks.  Proposed a fingerprint representation based on file names, paths and strings in the phishing kits.  Proposed by Orunsolu et al. due to the need to train a machinelearning model. We concluded that a larger dataset is necessary to do this process properly. Algorithm  PhishTank Dataset: PhishTank is a community-driven website that tracks phishing websites. They may provide datasets that can be used for research purposes.  Kaggle Datasets: Kaggle often hosts machine learning competitions and provides datasets for various purposes. You can search Kaggle for datasets related to phishing or cybersecurity.  CERT Datasets: Computer Emergency Response Teams (CERTs) often release datasets for research purposes. CERTs from different countries may have datasets related to phishing attacks. Advantages  Realistic Scenarios: A dataset based on phishing kit attacks could provide realistic and diverse scenarios encountered in actual phishing incidents. This realism enhances the training of identification algorithms by exposing them to a wide range of phishing techniques.  Behavioral Patterns: Phishing kits often exhibit specific behavioral patterns or characteristics. By using a dataset that includes these patterns, machine learning algorithms can learn to recognize and distinguish features associated with phishing attacks, aiding in accurate identification.  Dynamic Threat Landscape: Phishing attacks are dynamic and constantly evolving. A dataset based on phishing kit attacks can reflect the changing tactics used by
  • 4. attackers. This adaptability is crucial for training algorithms that can keep up with the evolving threat landscape.  Security Research: The availability of a dedicated dataset enables security researchers to conduct in-depth studies on phishing attacks, identify emerging trends, and propose new countermeasures to enhance overall cybersecurity. Software Specification  Processor : I3 core processor  Ram : 4 GB  Hard disk : 500 GB Software Specification  Operating System : Windows 10 /11  Frond End : Python  Back End : Mysql Server  IDE Tools : Pycharm