MINI PROJECT – MSCP34
PHISHING DETECTION USING
MACHINE LEARNING
GUIDED BY:
DR.CHANDRA MOULI P.V.S.S.R.
HEAD OF THE DEPARTMENT
DEPARTMENT OF COMPUTER SCIENCE
PRESENTED BY:
KAVITA –P211307
ANKIT KUMAR – P211303
M.Sc. COMPUTER SCIENCE,
DEPARTMENT OF COMPUTER SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
Presentation Outline
DAT
ACOLLECTION
MODELING &
DEPLOYMENT
BACKGROUND PRE-PROCESSING CONCLUSIONS
Definition, History,
Impacts
Final Thoughts and
Recommendations
Dataset Descriptions
Feature Engineering
and EDA
Ten Models Tested,
Deployment in Streamlit
Application
DEPARTMENT OF COMPUTER SCIENCE
BACKGROUND
DEPARTMENT OF COMPUTER SCIENCE
What is phishing?
Phishing is a form of cybercrime in which a target is contacted via
email, telephone, or text message by an attacker disguising as a
reputable entity or person. The attacker then lures individuals to
counterfeit websites to trick recipients into providing sensitive data.
The purpose of this project is to help individuals identify these
phishing URLs in order to provide safer practices online.
DEPARTMENT OF COMPUTER SCIENCE
Types of Phishing Tactics
96% of phishing attacks arrive by email.
3% of phishing attacks is done over the telephone.
This is also known as vishing.
1% of phishing attacks is done via text message.
This is also known as smishing.
Email
Telephone
Text
Message
DEPARTMENT OF COMPUTER SCIENCE
Top Three Types of Data
Name,address,
email address
Treatment
information,
insurance claims
CREDENTIALS PERSONAL DAT
A MEDICAL
Passwords, usernames,
pin numbers, credit
card information
DEPARTMENT OF COMPUTER SCIENCE
Paypal
DEPARTMENT OF COMPUTER SCIENCE
30%of phishing emails
are opened by users,and
12%of these targeted
users click on the
malicious link or
attachment.
97%of the users are
unable to recognize a
sophisticated phishing
email.
2021Phishing Statistics
DEPARTMENT OF COMPUTER SCIENCE
Common Features of Phishing Emails
Sense of
Urgency
Unusual
Sender
T
oo Good to
Be True
Hyperlinks
DEPARTMENT OF COMPUTER SCIENCE
According to the FBI,phishing incidents
nearly doubled in frequency,from
114,702 incidents in 2019, to 241,324
incidents in 2020. The increase in remote
workcould be to blame.
As the internet becomes a major mode for
economic transactions and
communications, online trust and
cybercrimes have increasingly become an
important area of study.
DEPARTMENT OF COMPUTER SCIENCE
DATA
COLLECTION
DEPARTMENT OF COMPUTER SCIENCE
Data was collected from two
separate datasets.Phishing
URLs were pulled from
websites such as PhishT
ank
and OpenPhish and legitimate
URLs were pulled from
websites such as Alexa and
Common Crawl.
The were 545,895 instances in
total with a 72.1%baseline.
DEPARTMENT OF COMPUTER SCIENCE
PRE-PROCESSING
DEPARTMENT OF COMPUTER SCIENCE
Feature Extraction
Using a function from urllib library, protocol, domain, path, query, and
fragment were extracted from the URL and respective columns were created.
The protocol column was dropped as more sophisticated phishing URLs are
labeled secure with https:/.
DEPARTMENT OF COMPUTER SCIENCE
- =
.
?
@ ~
&
! +
*
, #
$
%
space
Feature Extraction
Lengthof URL,domain, path, query,and fragment
are extracted.
Quantity of specific characters in URL, domain,
path, query, and fragment are extracted. These
characters include:
65 T
otal Features Used in Model
DEPARTMENT OF COMPUTER SCIENCE
MODEL
SELECTION &
EVALUATION
DEPARTMENT OF COMPUTER SCIENCE
Models Tested
- Stochastic Gradient Descent Classifier
- Logistic Regression
- Support Vector Machine
- AdaBoost
- Gradient Boost
- Decision T
ree Classifier
- Bagging Classifier
- K-Nearest Neighbors Classifier
- Extra T
rees Classifier
- Random Forest Classifier
Baseline: 72.1%
GridSearchCV and RandomizedSearchCV tools
were used to optimize the highest-scoring result.
Once the best model was determined,
hyperparameter tuning continued to optimize our
model.
DEPARTMENT OF COMPUTER SCIENCE
Model Selection
MODEL
TRAINING TESTING USED FOR
SCORE SCORE DEPLOYMENT
k-Nearest
Neighbors
94.8% 93.2%
Decision Trees 97.6% 94.3%
Extra Trees 97.9% 94.4%
Random Forest 97.0% 94.5%
DEPARTMENT OF COMPUTER SCIENCE
Accuracy 94.5%
Recall 85.8%
Specificity 97.7%
Precision 93.6%
Model Evaluation
DEPARTMENT OF COMPUTER SCIENCE
CONCLUSIONS
DEPARTMENT OF COMPUTER SCIENCE
How to Avoid Phishing Attacks
ST
AY INFORMED
Learn about new phishing techniques
that are being developed to avoid
falling prey to one.
UTILIZE ‘FISHING FOR PHISHERS’
When in doubt, use the ‘Fishing for
Phishers’ app to verify the
authenticity of a website.
THINK BEFORE YOU CLICK
Never click on hyperlinks
without examining the hidden
URL.
1
2
3
DEPARTMENT OF COMPUTER SCIENCE
Thank you!
Any questions?
DEPARTMENT OF COMPUTER SCIENCE
References
“Phishing Statistics (Updated 2021):
50+ Important Phishing Stats.”
Tessian,17 May 2021,
www.tessian.com/blog/phishing-st
atistics-2020/.
KnowBe4. “What Is
Phishing?”Phishing,
www.phishing.org/what-is-
phishing.
“2021 DBIRMaster's Guide.
”
Verizon Business,
www.verizon.com/business/res
ources/reports/dbir/2021/mast
ers-guide/.
Slide T
emplate:
slidesgo.com
“History of Phishing.”
Cofense,28 May 2021,
cofense.com/knowledge-ce
nter/history-of-phishing/.
“The 2021 Ponemon Cost of
PHISHINGStudy:Proofpoint US.
”
Proofpoint,19 Aug.2021,
www.proofpoint.com/us/resources/a
nalyst-reports/ponemon-cost-of-phi
shing-study.

Phishing Detection Presentation.pptx

  • 1.
    MINI PROJECT –MSCP34 PHISHING DETECTION USING MACHINE LEARNING GUIDED BY: DR.CHANDRA MOULI P.V.S.S.R. HEAD OF THE DEPARTMENT DEPARTMENT OF COMPUTER SCIENCE PRESENTED BY: KAVITA –P211307 ANKIT KUMAR – P211303 M.Sc. COMPUTER SCIENCE, DEPARTMENT OF COMPUTER SCIENCE DEPARTMENT OF COMPUTER SCIENCE
  • 2.
    Presentation Outline DAT ACOLLECTION MODELING & DEPLOYMENT BACKGROUNDPRE-PROCESSING CONCLUSIONS Definition, History, Impacts Final Thoughts and Recommendations Dataset Descriptions Feature Engineering and EDA Ten Models Tested, Deployment in Streamlit Application DEPARTMENT OF COMPUTER SCIENCE
  • 3.
  • 4.
    What is phishing? Phishingis a form of cybercrime in which a target is contacted via email, telephone, or text message by an attacker disguising as a reputable entity or person. The attacker then lures individuals to counterfeit websites to trick recipients into providing sensitive data. The purpose of this project is to help individuals identify these phishing URLs in order to provide safer practices online. DEPARTMENT OF COMPUTER SCIENCE
  • 5.
    Types of PhishingTactics 96% of phishing attacks arrive by email. 3% of phishing attacks is done over the telephone. This is also known as vishing. 1% of phishing attacks is done via text message. This is also known as smishing. Email Telephone Text Message DEPARTMENT OF COMPUTER SCIENCE
  • 6.
    Top Three Typesof Data Name,address, email address Treatment information, insurance claims CREDENTIALS PERSONAL DAT A MEDICAL Passwords, usernames, pin numbers, credit card information DEPARTMENT OF COMPUTER SCIENCE
  • 7.
  • 8.
    30%of phishing emails areopened by users,and 12%of these targeted users click on the malicious link or attachment. 97%of the users are unable to recognize a sophisticated phishing email. 2021Phishing Statistics DEPARTMENT OF COMPUTER SCIENCE
  • 9.
    Common Features ofPhishing Emails Sense of Urgency Unusual Sender T oo Good to Be True Hyperlinks DEPARTMENT OF COMPUTER SCIENCE
  • 10.
    According to theFBI,phishing incidents nearly doubled in frequency,from 114,702 incidents in 2019, to 241,324 incidents in 2020. The increase in remote workcould be to blame. As the internet becomes a major mode for economic transactions and communications, online trust and cybercrimes have increasingly become an important area of study. DEPARTMENT OF COMPUTER SCIENCE
  • 11.
  • 12.
    Data was collectedfrom two separate datasets.Phishing URLs were pulled from websites such as PhishT ank and OpenPhish and legitimate URLs were pulled from websites such as Alexa and Common Crawl. The were 545,895 instances in total with a 72.1%baseline. DEPARTMENT OF COMPUTER SCIENCE
  • 13.
  • 14.
    Feature Extraction Using afunction from urllib library, protocol, domain, path, query, and fragment were extracted from the URL and respective columns were created. The protocol column was dropped as more sophisticated phishing URLs are labeled secure with https:/. DEPARTMENT OF COMPUTER SCIENCE
  • 15.
    - = . ? @ ~ & !+ * , # $ % space Feature Extraction Lengthof URL,domain, path, query,and fragment are extracted. Quantity of specific characters in URL, domain, path, query, and fragment are extracted. These characters include: 65 T otal Features Used in Model DEPARTMENT OF COMPUTER SCIENCE
  • 16.
  • 17.
    Models Tested - StochasticGradient Descent Classifier - Logistic Regression - Support Vector Machine - AdaBoost - Gradient Boost - Decision T ree Classifier - Bagging Classifier - K-Nearest Neighbors Classifier - Extra T rees Classifier - Random Forest Classifier Baseline: 72.1% GridSearchCV and RandomizedSearchCV tools were used to optimize the highest-scoring result. Once the best model was determined, hyperparameter tuning continued to optimize our model. DEPARTMENT OF COMPUTER SCIENCE
  • 18.
    Model Selection MODEL TRAINING TESTINGUSED FOR SCORE SCORE DEPLOYMENT k-Nearest Neighbors 94.8% 93.2% Decision Trees 97.6% 94.3% Extra Trees 97.9% 94.4% Random Forest 97.0% 94.5% DEPARTMENT OF COMPUTER SCIENCE
  • 19.
    Accuracy 94.5% Recall 85.8% Specificity97.7% Precision 93.6% Model Evaluation DEPARTMENT OF COMPUTER SCIENCE
  • 20.
  • 21.
    How to AvoidPhishing Attacks ST AY INFORMED Learn about new phishing techniques that are being developed to avoid falling prey to one. UTILIZE ‘FISHING FOR PHISHERS’ When in doubt, use the ‘Fishing for Phishers’ app to verify the authenticity of a website. THINK BEFORE YOU CLICK Never click on hyperlinks without examining the hidden URL. 1 2 3 DEPARTMENT OF COMPUTER SCIENCE
  • 22.
  • 23.
    References “Phishing Statistics (Updated2021): 50+ Important Phishing Stats.” Tessian,17 May 2021, www.tessian.com/blog/phishing-st atistics-2020/. KnowBe4. “What Is Phishing?”Phishing, www.phishing.org/what-is- phishing. “2021 DBIRMaster's Guide. ” Verizon Business, www.verizon.com/business/res ources/reports/dbir/2021/mast ers-guide/. Slide T emplate: slidesgo.com “History of Phishing.” Cofense,28 May 2021, cofense.com/knowledge-ce nter/history-of-phishing/. “The 2021 Ponemon Cost of PHISHINGStudy:Proofpoint US. ” Proofpoint,19 Aug.2021, www.proofpoint.com/us/resources/a nalyst-reports/ponemon-cost-of-phi shing-study.