SlideShare a Scribd company logo
1 of 4
Download to read offline
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/328541785
Phishing Website Detection using Machine Learning Algorithms
Article  in  International Journal of Computer Applications · October 2018
DOI: 10.5120/ijca2018918026
CITATIONS
12
READS
27,305
2 authors:
Some of the authors of this publication are also working on these related projects:
Reinforcement Learning for anomalous path detection using wearable View project
Phishing Website Detection using Machine Learning Algorithms View project
Rishikesh Mahajan
Somaiya Vidyavihar
1 PUBLICATION   12 CITATIONS   
SEE PROFILE
Irfan Siddavatam
Somaiya Vidyavihar
42 PUBLICATIONS   90 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Rishikesh Mahajan on 14 June 2019.
The user has requested enhancement of the downloaded file.
International Journal of Computer Applications (0975 – 8887)
Volume 181 – No. 23, October 2018
45
Phishing Website Detection using Machine Learning
Algorithms
Rishikesh Mahajan
MTECH Information Technology
K.J. Somaiya College of Engineering, Mumbai - 77
Irfan Siddavatam
Professor, Dept. Information Technology
K.J. Somaiya College of Engineering, Mumbai - 77
ABSTRACT
Phishing attack is a simplest way to obtain sensitive
information from innocent users. Aim of the phishers is to
acquire critical information like username, password and bank
account details. Cyber security persons are now looking for
trustworthy and steady detection techniques for phishing
websites detection. This paper deals with machine learning
technology for detection of phishing URLs by extracting and
analyzing various features of legitimate and phishing URLs.
Decision Tree, random forest and Support vector machine
algorithms are used to detect phishing websites. Aim of the
paper is to detect phishing URLs as well as narrow down to
best machine learning algorithm by comparing accuracy rate,
false positive and false negative rate of each algorithm.
Keywords
Phishing attack, Machine learning
1. INTRODUCTION
Nowadays Phishing becomes a main area of concern for
security researchers because it is not difficult to create the
fake website which looks so close to legitimate website.
Experts can identify fake websites but not all the users can
identify the fake website and such users become the victim of
phishing attack. Main aim of the attacker is to steal banks
account credentials. In United States businesses, there is a loss
of US$2billion per year because their clients become victim to
phishing [1]. In 3rd Microsoft Computing Safer Index Report
released in February 2014, it was estimated that the annual
worldwide impact of phishing could be as high as $5 billion
[2]. Phishing attacks are becoming successful because lack of
user awareness. Since phishing attack exploits the weaknesses
found in users, it is very difficult to mitigate them but it is
very important to enhance phishing detection techniques.
The general method to detect phishing websites by updating
blacklisted URLs, Internet Protocol (IP) to the antivirus
database which is also known as “blacklist" method. To evade
blacklists attackers uses creative techniques to fool users by
modifying the URL to appear legitimate via obfuscation and
many other simple techniques including: fast-flux, in which
proxies are automatically generated to host the web-page;
algorithmic generation of new URLs; etc. Major drawback of
this method is that, it cannot detect zero-hour phishing attack.
Heuristic based detection which includes characteristics that
are found to exist in phishing attacks in reality and can detect
zero-hour phishing attack, but the characteristics are not
guaranteed to always exist in such attacks and false positive
rate in detection is very high [3].
To overcome the drawbacks of blacklist and heuristics based
method, many security researchers now focused on machine
learning techniques. Machine learning technology consists of
a many algorithms which requires past data to make a
decision or prediction on future data. Using this technique,
algorithm will analyze various blacklisted and legitimate
URLs and their features to accurately detect the phishing
websites including zero- hour phishing websites.
2. DATASET
URLs of benign websites were collected from
www.alexa.com and The URLs of phishing websites were
collected from www.phishtank.com. The data set consists of
total 36,711 URLs which include 17058 benign URLs and
19653 phishing URLs. Benign URLs are labelled as “0” and
phishing URLs are labelled as “1”.
3. FEATURE EXTRACTION
We have implemented python program to extract features
from URL. Below are the features that we have extracted for
detection of phishing URLs.
1) Presence of IP address in URL: If IP address
present in URL then the feature is set to 1 else set to 0.
Most of the benign sites do not use IP address as an URL
to download a webpage. Use of IP address in URL
indicates that attacker is trying to steal sensitive
information.
2) Presence of @ symbol in URL: If @ symbol present
in URL then the feature is set to 1 else set to 0. Phishers
add special symbol @ in the URL leads the browser to
ignore everything preceding the “@” symbol and the real
address often follows the “@” symbol [4].
3) Number of dots in Hostname: Phishing URLs have
many dots in URL. For example
http://shop.fun.amazon.phishing.com, in this URL
phishing.com is an actual domain name, whereas use of
“amazon” word is to trick users to click on it. Average
number of dots in benign URLs is 3. If the number of
dots in URLs is more than 3 then the feature is set to 1
else to 0.
4) Prefix or Suffix separated by (-) to domain: If
domain name separated by dash (-) symbol then feature
is set to 1 else to 0. The dash symbol is rarely used in
legitimate URLs. Phishers add dash symbol (-) to the
domain name so that users feel that they are dealing with
a legitimate webpage. For example Actual site is
http://www.onlineamazon.com but phisher can create
another fake website like http://www.online-amazon.com
to confuse the innocent users.
5) URL redirection: If “//” present in URL path then
feature is set to 1 else to 0. The existence of “//” within
the URL path means that the user will be redirected to
another website [4].
6) HTTPS token in URL: If HTTPS token present in
International Journal of Computer Applications (0975 – 8887)
Volume 181 – No. 23, October 2018
46
URL then the feature is set to 1 else to 0. Phishers may
add the “HTTPS” token to the domain part of a URL in
order to trick users. For example, http://https-www-
paypal-it-mpp-home.soft-hair.com [4].
7) Information submission to Email: Phisher might
use “mail()” or “mailto:” functions to redirect the user’s
information to his personal email[4]
. If such functions are
present in the URL then feature is set to 1 else to 0.
8) URL Shortening Services “TinyURL”: TinyURL
service allows phisher to hide long phishing URL by
making it short. The goal is to redirect user to phishing
websites. If the URL is crafted using shortening services
(like bit.ly) then feature is set to 1 else 0
9) Length of Host name: Average length of the benign
URLs is found to be a 25, If URL’s length is greater than
25 then the feature is set to 1 else to 0
10) Presence of sensitive words in URL: Phishing sites
use sensitive words in its URL so that users feel that they
are dealing with a legitimate webpage. Below are the
words that found in many phishing URLs :- 'confirm',
'account', 'banking', 'secure', 'ebyisapi', 'webscr', 'signin',
'mail', 'install', 'toolbar', 'backup', 'paypal', 'password',
'username', etc;
11) Number of slash in URL: The number of slashes in
benign URLs is found to be a 5; if number of slashes in
URL is greater than 5 then the feature is set to 1 else to 0.
12) Presence of Unicode in URL: Phishers can make a
use of Unicode characters in URL to trick users to click
on it. For example the domain “xn--80ak6aa92e.com” is
equivalent to "аррӏе.com". Visible URL to user is
"аррӏе.com" but after clicking on this URL, user will
visit to “xn--80ak6aa92e.com” which is a phishing site.
13) Age of SSL Certificate: The existence of HTTPS is
very important in giving the impression of website
legitimacy [4]
. But minimum age of the SSL certificate of
benign website is between 1 year to 2 year.
14) URL of Anchor: We have extracted this feature by
crawling the source code oh the URL. URL of the anchor
is defined by <a> tag. If the <a> tag has a maximum
number of hyperlinks which are from the other domain
then the feature is set to 1 else to 0.
15) IFRAME: We have extracted this feature by crawling
the source code of the URL. This tag is used to add
another web page into existing main webpage. Phishers
can make use of the “iframe” tag and make it invisible
i.e. without frame borders [4]. Since border of inserted
webpage is invisible, user seems that the inserted web
page is also the part of the main web page and can enter
sensitive information.
16) Website Rank: We extracted the rank of websites and
compare it with the first One hundred thousand websites
of Alexa database. If rank of the website is greater than
10,0000 then feature is set to 1 else to 0.
4. MACHINE LEARNING ALGORITHM
Three machine learning classification model Decision Tree,
Random forest and Support vector machine has been selected
to detect phishing websites.
4.1 Decision Tree Algorithm [5]
One of the most widely used algorithm in machine learning
technology. Decision tree algorithm is easy to understand and
also easy to implement. Decision tree begins its work by
choosing best splitter from the available attributes for
classification which is considered as a root of the tree.
Algorithm continues to build tree until it finds the leaf node.
Decision tree creates training model which is used to predict
target value or class in tree representation each internal node
of the tree belongs to attribute and each leaf node of the tree
belongs to class label. In decision tree algorithm, gini index
and information gain methods are used to calculate these
nodes.
4.2 Random Forest Algorithm [6]
Random forest algorithm is one of the most powerful
algorithms in machine learning technology and it is based on
concept of decision tree algorithm. Random forest algorithm
creates the forest with number of decision trees. High number
of tree gives high detection accuracy.
Creation of trees are based on bootstrap method. In bootstrap
method features and samples of dataset are randomly selected
with replacement to construct single tree. Among randomly
selected features, random forest algorithm will choose best
splitter for the classification and like decision tree algorithm;
Random forest algorithm also uses gini index and information
gain methods to find the best splitter. This process will get
continue until random forest creates n number of trees.
Each tree in forest predicts the target value and then algorithm
will calculate the votes for each predicted target. Finally
random forest algorithm considers high voted predicted target
as a final prediction.
4.3 Support Vector Machine Algorithm [7]
Support vector machine is another powerful algorithm in
machine learning technology. In support vector machine
algorithm each data item is plotted as a point in n-dimensional
space and support vector machine algorithm constructs
separating line for classification of two classes, this separating
line is well known as hyperplane.
Support vector machine seeks for the closest points called as
support vectors and once it finds the closest point it draws a
line connecting to them. Support vector machine then
construct separating line which bisects and perpendicular to
the connecting line. In order to classify data perfectly the
margin should be maximum. Here the margin is a distance
between hyperplane and support vectors. In real scenario it is
not possible to separate complex and non linear data, to solve
this problem support vector machine uses kernel trick which
transforms lower dimensional space to higher dimensional
space.
International Journal of Computer Applications (0975 – 8887)
Volume 181 – No. 23, October 2018
47
Fig. 1 Detection accuracy comparison
5. IMPLEMENTATION AND RESULT
Scikit-learn tool has been used to import Machine learning
algorithms. Dataset is divided into training set and testing set
in 50:50, 70:30 and 90:10 ratios respectively. Each classifier
is trained using training set and testing set is used to evaluate
performance of classifiers. Performance of classifiers has been
evaluated by calculating classifier's accuracy score, false
negative rate and false positive rate.
Table 1: Classifier's performance
Dataset
Split
ratio
Classifiers
Accuracy
Score
False
Negative
Rate
False
Positive
Rate
50:50
Decision Tree 96.71 3.69 2.93
Random
Forest
96.72 3.69 2.91
Support
vector
machine
96.40 5.26 2.08
70:30
Decision Tree 96.80 3.43 2.99
Random
Forest
96.84 3.35 2.98
Support
vector
machine
96.40 5.13 2.17
90:10
Decision Tree 97.11 3.18 2.66
Random
Forest
97.14 3.14 2.61
Support
vector
machine
96.51 4.73 2.34
Result shows that Random forest algorith1m gives better
detection accuracy which is 97.14 with lowest false negative
rate than decision tree and support vector machine algorithms.
Result also shows that detection accuracy of phishing
websites increases as more dataset used as training dataset.
All classifiers perform well when 90% of data used as training
dataset.
Fig. 1 show the detection accuracy of all classifiers when
50%, 70% and 90% of data used as training dataset and graph
clearly shows that detection accuracy increases when 90% of
data used as training dataset and random forest detection
accuracy is maximum than other two classifiers.
6. CONCLUSION
This paper aims to enhance detection method to detect
phishing websites using machine learning technology. We
achieved 97.14% detection accuracy using random forest
algorithm with lowest false positive rate. Also result shows
that classifiers give better performance when we used more
data as training data.
In future hybrid technology will be implemented to detect
phishing websites more accurately, for which random forest
algorithm of machine learning technology and blacklist
method will be used.
7. REFERENCES
[1] Gunter Ollmann, “The Phishing Guide Understanding &
Preventing Phishing Attacks”, IBMInternet Security
Systems, 2007.
[2] https://resources.infosecinstitute.com/category/enterprise
/phishing/the-phishing-landscape/phishing-data-attack-
statistics/#gref
[3] Mahmoud Khonji, Youssef Iraqi, "Phishing Detection: A
Literature Survey IEEE, and Andrew Jones, 2013
[4] Mohammad R., Thabtah F. McCluskey L., (2015)
Phishing websites dataset. Available:
https://archive.ics.uci.edu/ml/datasets/Phishing+Websites
Accessed January 2016
[5] http://dataaspirant.com/2017/01/30/how-decision-tree-
algorithm-works/
[6] http://dataaspirant.com/2017/05/22/random-forest-
algorithm-machine-learing/
[7] https://www.kdnuggets.com/2016/07/support-vector-
machines-simple-explanation.html
[8] www.alexa.com
[9] www.phishtank.com
IJCATM : www.ijcaonline.org
View publication stats
View publication stats

More Related Content

What's hot

PHISHING DETECTION
PHISHING DETECTIONPHISHING DETECTION
PHISHING DETECTION
umme ayesha
 
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
cscpconf
 
Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...
csandit
 
A survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac techniqueA survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac technique
bhas_ani
 

What's hot (20)

Root conf digitalskimming-v4_arjunbm
Root conf digitalskimming-v4_arjunbmRoot conf digitalskimming-v4_arjunbm
Root conf digitalskimming-v4_arjunbm
 
PHISHING DETECTION
PHISHING DETECTIONPHISHING DETECTION
PHISHING DETECTION
 
IRJET- Advanced Phishing Identification Technique using Machine Learning
IRJET-  	  Advanced Phishing Identification Technique using Machine LearningIRJET-  	  Advanced Phishing Identification Technique using Machine Learning
IRJET- Advanced Phishing Identification Technique using Machine Learning
 
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
IRJET- Detecting the Phishing Websites using Enhance Secure AlgorithmIRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
 
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
 
Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...
 
Url filtration
Url filtrationUrl filtration
Url filtration
 
A survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac techniqueA survey on detection of website phishing using mcac technique
A survey on detection of website phishing using mcac technique
 
Phishing
PhishingPhishing
Phishing
 
IRJET- Phishing Website Detection based on Machine Learning
IRJET- Phishing Website Detection based on Machine LearningIRJET- Phishing Website Detection based on Machine Learning
IRJET- Phishing Website Detection based on Machine Learning
 
IRJET- Phishing Website Detection System
IRJET- Phishing Website Detection SystemIRJET- Phishing Website Detection System
IRJET- Phishing Website Detection System
 
Most Common Application Level Attacks
Most Common Application Level AttacksMost Common Application Level Attacks
Most Common Application Level Attacks
 
PPT on Phishing
PPT on PhishingPPT on Phishing
PPT on Phishing
 
Web Application Security Tips
Web Application Security TipsWeb Application Security Tips
Web Application Security Tips
 
Review of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attackReview of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attack
 
Teaching Your Staff About Phishing
Teaching Your Staff About PhishingTeaching Your Staff About Phishing
Teaching Your Staff About Phishing
 
IJSRED-V2I4P0
IJSRED-V2I4P0IJSRED-V2I4P0
IJSRED-V2I4P0
 
IRJET- Preventing Phishing Attack using Evolutionary Algorithms
IRJET-  	  Preventing Phishing Attack using Evolutionary AlgorithmsIRJET-  	  Preventing Phishing Attack using Evolutionary Algorithms
IRJET- Preventing Phishing Attack using Evolutionary Algorithms
 
Phishing
PhishingPhishing
Phishing
 
Analyzing the effectualness of Phishing Algorithms in Web Applications Inques...
Analyzing the effectualness of Phishing Algorithms in Web Applications Inques...Analyzing the effectualness of Phishing Algorithms in Web Applications Inques...
Analyzing the effectualness of Phishing Algorithms in Web Applications Inques...
 

Similar to Phishing_Website_Detection_using_Machine_Learning_.pdf

Report - Final_New_phishila
Report - Final_New_phishilaReport - Final_New_phishila
Report - Final_New_phishila
Ashwin Palani
 
Improving Phishing URL Detection Using Fuzzy Association Mining
Improving Phishing URL Detection Using Fuzzy Association MiningImproving Phishing URL Detection Using Fuzzy Association Mining
Improving Phishing URL Detection Using Fuzzy Association Mining
theijes
 
Artificial intelligence presentation slides.pptx
Artificial intelligence presentation slides.pptxArtificial intelligence presentation slides.pptx
Artificial intelligence presentation slides.pptx
rakhicse
 

Similar to Phishing_Website_Detection_using_Machine_Learning_.pdf (20)

IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
 
Phishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification AlgorithmsPhishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification Algorithms
 
Phishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningPhishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine Learning
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine Learning
 
IRJET - Chrome Extension for Detecting Phishing Websites
IRJET -  	  Chrome Extension for Detecting Phishing WebsitesIRJET -  	  Chrome Extension for Detecting Phishing Websites
IRJET - Chrome Extension for Detecting Phishing Websites
 
Report - Final_New_phishila
Report - Final_New_phishilaReport - Final_New_phishila
Report - Final_New_phishila
 
PUMMP: PHISHING URL DETECTION USING MACHINE LEARNING WITH MONOMORPHIC AND POL...
PUMMP: PHISHING URL DETECTION USING MACHINE LEARNING WITH MONOMORPHIC AND POL...PUMMP: PHISHING URL DETECTION USING MACHINE LEARNING WITH MONOMORPHIC AND POL...
PUMMP: PHISHING URL DETECTION USING MACHINE LEARNING WITH MONOMORPHIC AND POL...
 
PUMMP: Phishing URL Detection using Machine Learning with Monomorphic and Pol...
PUMMP: Phishing URL Detection using Machine Learning with Monomorphic and Pol...PUMMP: Phishing URL Detection using Machine Learning with Monomorphic and Pol...
PUMMP: Phishing URL Detection using Machine Learning with Monomorphic and Pol...
 
Phishing Detection using Decision Tree Model
Phishing Detection using Decision Tree ModelPhishing Detection using Decision Tree Model
Phishing Detection using Decision Tree Model
 
A literature survey on anti phishing
A literature survey on anti phishingA literature survey on anti phishing
A literature survey on anti phishing
 
Detection of Phishing Websites
Detection of Phishing Websites Detection of Phishing Websites
Detection of Phishing Websites
 
Improving Phishing URL Detection Using Fuzzy Association Mining
Improving Phishing URL Detection Using Fuzzy Association MiningImproving Phishing URL Detection Using Fuzzy Association Mining
Improving Phishing URL Detection Using Fuzzy Association Mining
 
IRJET - PHISCAN : Phishing Detector Plugin using Machine Learning
IRJET - PHISCAN : Phishing Detector Plugin using Machine LearningIRJET - PHISCAN : Phishing Detector Plugin using Machine Learning
IRJET - PHISCAN : Phishing Detector Plugin using Machine Learning
 
[IJET V2I5P15] Authors: V.Preethi, G.Velmayil
[IJET V2I5P15] Authors: V.Preethi, G.Velmayil[IJET V2I5P15] Authors: V.Preethi, G.Velmayil
[IJET V2I5P15] Authors: V.Preethi, G.Velmayil
 
A security note for web developers
A security note for web developersA security note for web developers
A security note for web developers
 
Low Cost Page Quality Factors To Detect Web Spam
Low Cost Page Quality Factors To Detect Web SpamLow Cost Page Quality Factors To Detect Web Spam
Low Cost Page Quality Factors To Detect Web Spam
 
Low Cost Page Quality Factors To Detect Web Spam
Low Cost Page Quality Factors To Detect Web SpamLow Cost Page Quality Factors To Detect Web Spam
Low Cost Page Quality Factors To Detect Web Spam
 
Low Cost Page Quality Factors To Detect Web Spam
Low Cost Page Quality Factors To Detect Web Spam Low Cost Page Quality Factors To Detect Web Spam
Low Cost Page Quality Factors To Detect Web Spam
 
Artificial intelligence presentation slides.pptx
Artificial intelligence presentation slides.pptxArtificial intelligence presentation slides.pptx
Artificial intelligence presentation slides.pptx
 
Study on Phishing Attacks and Antiphishing Tools
Study on Phishing Attacks and Antiphishing ToolsStudy on Phishing Attacks and Antiphishing Tools
Study on Phishing Attacks and Antiphishing Tools
 

Recently uploaded

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
meharikiros2
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
👉 Yavatmal Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...
👉 Yavatmal Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...👉 Yavatmal Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...
👉 Yavatmal Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl S...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .ppt
 
Electromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxElectromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptx
 

Phishing_Website_Detection_using_Machine_Learning_.pdf

  • 1. See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/328541785 Phishing Website Detection using Machine Learning Algorithms Article  in  International Journal of Computer Applications · October 2018 DOI: 10.5120/ijca2018918026 CITATIONS 12 READS 27,305 2 authors: Some of the authors of this publication are also working on these related projects: Reinforcement Learning for anomalous path detection using wearable View project Phishing Website Detection using Machine Learning Algorithms View project Rishikesh Mahajan Somaiya Vidyavihar 1 PUBLICATION   12 CITATIONS    SEE PROFILE Irfan Siddavatam Somaiya Vidyavihar 42 PUBLICATIONS   90 CITATIONS    SEE PROFILE All content following this page was uploaded by Rishikesh Mahajan on 14 June 2019. The user has requested enhancement of the downloaded file.
  • 2. International Journal of Computer Applications (0975 – 8887) Volume 181 – No. 23, October 2018 45 Phishing Website Detection using Machine Learning Algorithms Rishikesh Mahajan MTECH Information Technology K.J. Somaiya College of Engineering, Mumbai - 77 Irfan Siddavatam Professor, Dept. Information Technology K.J. Somaiya College of Engineering, Mumbai - 77 ABSTRACT Phishing attack is a simplest way to obtain sensitive information from innocent users. Aim of the phishers is to acquire critical information like username, password and bank account details. Cyber security persons are now looking for trustworthy and steady detection techniques for phishing websites detection. This paper deals with machine learning technology for detection of phishing URLs by extracting and analyzing various features of legitimate and phishing URLs. Decision Tree, random forest and Support vector machine algorithms are used to detect phishing websites. Aim of the paper is to detect phishing URLs as well as narrow down to best machine learning algorithm by comparing accuracy rate, false positive and false negative rate of each algorithm. Keywords Phishing attack, Machine learning 1. INTRODUCTION Nowadays Phishing becomes a main area of concern for security researchers because it is not difficult to create the fake website which looks so close to legitimate website. Experts can identify fake websites but not all the users can identify the fake website and such users become the victim of phishing attack. Main aim of the attacker is to steal banks account credentials. In United States businesses, there is a loss of US$2billion per year because their clients become victim to phishing [1]. In 3rd Microsoft Computing Safer Index Report released in February 2014, it was estimated that the annual worldwide impact of phishing could be as high as $5 billion [2]. Phishing attacks are becoming successful because lack of user awareness. Since phishing attack exploits the weaknesses found in users, it is very difficult to mitigate them but it is very important to enhance phishing detection techniques. The general method to detect phishing websites by updating blacklisted URLs, Internet Protocol (IP) to the antivirus database which is also known as “blacklist" method. To evade blacklists attackers uses creative techniques to fool users by modifying the URL to appear legitimate via obfuscation and many other simple techniques including: fast-flux, in which proxies are automatically generated to host the web-page; algorithmic generation of new URLs; etc. Major drawback of this method is that, it cannot detect zero-hour phishing attack. Heuristic based detection which includes characteristics that are found to exist in phishing attacks in reality and can detect zero-hour phishing attack, but the characteristics are not guaranteed to always exist in such attacks and false positive rate in detection is very high [3]. To overcome the drawbacks of blacklist and heuristics based method, many security researchers now focused on machine learning techniques. Machine learning technology consists of a many algorithms which requires past data to make a decision or prediction on future data. Using this technique, algorithm will analyze various blacklisted and legitimate URLs and their features to accurately detect the phishing websites including zero- hour phishing websites. 2. DATASET URLs of benign websites were collected from www.alexa.com and The URLs of phishing websites were collected from www.phishtank.com. The data set consists of total 36,711 URLs which include 17058 benign URLs and 19653 phishing URLs. Benign URLs are labelled as “0” and phishing URLs are labelled as “1”. 3. FEATURE EXTRACTION We have implemented python program to extract features from URL. Below are the features that we have extracted for detection of phishing URLs. 1) Presence of IP address in URL: If IP address present in URL then the feature is set to 1 else set to 0. Most of the benign sites do not use IP address as an URL to download a webpage. Use of IP address in URL indicates that attacker is trying to steal sensitive information. 2) Presence of @ symbol in URL: If @ symbol present in URL then the feature is set to 1 else set to 0. Phishers add special symbol @ in the URL leads the browser to ignore everything preceding the “@” symbol and the real address often follows the “@” symbol [4]. 3) Number of dots in Hostname: Phishing URLs have many dots in URL. For example http://shop.fun.amazon.phishing.com, in this URL phishing.com is an actual domain name, whereas use of “amazon” word is to trick users to click on it. Average number of dots in benign URLs is 3. If the number of dots in URLs is more than 3 then the feature is set to 1 else to 0. 4) Prefix or Suffix separated by (-) to domain: If domain name separated by dash (-) symbol then feature is set to 1 else to 0. The dash symbol is rarely used in legitimate URLs. Phishers add dash symbol (-) to the domain name so that users feel that they are dealing with a legitimate webpage. For example Actual site is http://www.onlineamazon.com but phisher can create another fake website like http://www.online-amazon.com to confuse the innocent users. 5) URL redirection: If “//” present in URL path then feature is set to 1 else to 0. The existence of “//” within the URL path means that the user will be redirected to another website [4]. 6) HTTPS token in URL: If HTTPS token present in
  • 3. International Journal of Computer Applications (0975 – 8887) Volume 181 – No. 23, October 2018 46 URL then the feature is set to 1 else to 0. Phishers may add the “HTTPS” token to the domain part of a URL in order to trick users. For example, http://https-www- paypal-it-mpp-home.soft-hair.com [4]. 7) Information submission to Email: Phisher might use “mail()” or “mailto:” functions to redirect the user’s information to his personal email[4] . If such functions are present in the URL then feature is set to 1 else to 0. 8) URL Shortening Services “TinyURL”: TinyURL service allows phisher to hide long phishing URL by making it short. The goal is to redirect user to phishing websites. If the URL is crafted using shortening services (like bit.ly) then feature is set to 1 else 0 9) Length of Host name: Average length of the benign URLs is found to be a 25, If URL’s length is greater than 25 then the feature is set to 1 else to 0 10) Presence of sensitive words in URL: Phishing sites use sensitive words in its URL so that users feel that they are dealing with a legitimate webpage. Below are the words that found in many phishing URLs :- 'confirm', 'account', 'banking', 'secure', 'ebyisapi', 'webscr', 'signin', 'mail', 'install', 'toolbar', 'backup', 'paypal', 'password', 'username', etc; 11) Number of slash in URL: The number of slashes in benign URLs is found to be a 5; if number of slashes in URL is greater than 5 then the feature is set to 1 else to 0. 12) Presence of Unicode in URL: Phishers can make a use of Unicode characters in URL to trick users to click on it. For example the domain “xn--80ak6aa92e.com” is equivalent to "аррӏе.com". Visible URL to user is "аррӏе.com" but after clicking on this URL, user will visit to “xn--80ak6aa92e.com” which is a phishing site. 13) Age of SSL Certificate: The existence of HTTPS is very important in giving the impression of website legitimacy [4] . But minimum age of the SSL certificate of benign website is between 1 year to 2 year. 14) URL of Anchor: We have extracted this feature by crawling the source code oh the URL. URL of the anchor is defined by <a> tag. If the <a> tag has a maximum number of hyperlinks which are from the other domain then the feature is set to 1 else to 0. 15) IFRAME: We have extracted this feature by crawling the source code of the URL. This tag is used to add another web page into existing main webpage. Phishers can make use of the “iframe” tag and make it invisible i.e. without frame borders [4]. Since border of inserted webpage is invisible, user seems that the inserted web page is also the part of the main web page and can enter sensitive information. 16) Website Rank: We extracted the rank of websites and compare it with the first One hundred thousand websites of Alexa database. If rank of the website is greater than 10,0000 then feature is set to 1 else to 0. 4. MACHINE LEARNING ALGORITHM Three machine learning classification model Decision Tree, Random forest and Support vector machine has been selected to detect phishing websites. 4.1 Decision Tree Algorithm [5] One of the most widely used algorithm in machine learning technology. Decision tree algorithm is easy to understand and also easy to implement. Decision tree begins its work by choosing best splitter from the available attributes for classification which is considered as a root of the tree. Algorithm continues to build tree until it finds the leaf node. Decision tree creates training model which is used to predict target value or class in tree representation each internal node of the tree belongs to attribute and each leaf node of the tree belongs to class label. In decision tree algorithm, gini index and information gain methods are used to calculate these nodes. 4.2 Random Forest Algorithm [6] Random forest algorithm is one of the most powerful algorithms in machine learning technology and it is based on concept of decision tree algorithm. Random forest algorithm creates the forest with number of decision trees. High number of tree gives high detection accuracy. Creation of trees are based on bootstrap method. In bootstrap method features and samples of dataset are randomly selected with replacement to construct single tree. Among randomly selected features, random forest algorithm will choose best splitter for the classification and like decision tree algorithm; Random forest algorithm also uses gini index and information gain methods to find the best splitter. This process will get continue until random forest creates n number of trees. Each tree in forest predicts the target value and then algorithm will calculate the votes for each predicted target. Finally random forest algorithm considers high voted predicted target as a final prediction. 4.3 Support Vector Machine Algorithm [7] Support vector machine is another powerful algorithm in machine learning technology. In support vector machine algorithm each data item is plotted as a point in n-dimensional space and support vector machine algorithm constructs separating line for classification of two classes, this separating line is well known as hyperplane. Support vector machine seeks for the closest points called as support vectors and once it finds the closest point it draws a line connecting to them. Support vector machine then construct separating line which bisects and perpendicular to the connecting line. In order to classify data perfectly the margin should be maximum. Here the margin is a distance between hyperplane and support vectors. In real scenario it is not possible to separate complex and non linear data, to solve this problem support vector machine uses kernel trick which transforms lower dimensional space to higher dimensional space.
  • 4. International Journal of Computer Applications (0975 – 8887) Volume 181 – No. 23, October 2018 47 Fig. 1 Detection accuracy comparison 5. IMPLEMENTATION AND RESULT Scikit-learn tool has been used to import Machine learning algorithms. Dataset is divided into training set and testing set in 50:50, 70:30 and 90:10 ratios respectively. Each classifier is trained using training set and testing set is used to evaluate performance of classifiers. Performance of classifiers has been evaluated by calculating classifier's accuracy score, false negative rate and false positive rate. Table 1: Classifier's performance Dataset Split ratio Classifiers Accuracy Score False Negative Rate False Positive Rate 50:50 Decision Tree 96.71 3.69 2.93 Random Forest 96.72 3.69 2.91 Support vector machine 96.40 5.26 2.08 70:30 Decision Tree 96.80 3.43 2.99 Random Forest 96.84 3.35 2.98 Support vector machine 96.40 5.13 2.17 90:10 Decision Tree 97.11 3.18 2.66 Random Forest 97.14 3.14 2.61 Support vector machine 96.51 4.73 2.34 Result shows that Random forest algorith1m gives better detection accuracy which is 97.14 with lowest false negative rate than decision tree and support vector machine algorithms. Result also shows that detection accuracy of phishing websites increases as more dataset used as training dataset. All classifiers perform well when 90% of data used as training dataset. Fig. 1 show the detection accuracy of all classifiers when 50%, 70% and 90% of data used as training dataset and graph clearly shows that detection accuracy increases when 90% of data used as training dataset and random forest detection accuracy is maximum than other two classifiers. 6. CONCLUSION This paper aims to enhance detection method to detect phishing websites using machine learning technology. We achieved 97.14% detection accuracy using random forest algorithm with lowest false positive rate. Also result shows that classifiers give better performance when we used more data as training data. In future hybrid technology will be implemented to detect phishing websites more accurately, for which random forest algorithm of machine learning technology and blacklist method will be used. 7. REFERENCES [1] Gunter Ollmann, “The Phishing Guide Understanding & Preventing Phishing Attacks”, IBMInternet Security Systems, 2007. [2] https://resources.infosecinstitute.com/category/enterprise /phishing/the-phishing-landscape/phishing-data-attack- statistics/#gref [3] Mahmoud Khonji, Youssef Iraqi, "Phishing Detection: A Literature Survey IEEE, and Andrew Jones, 2013 [4] Mohammad R., Thabtah F. McCluskey L., (2015) Phishing websites dataset. Available: https://archive.ics.uci.edu/ml/datasets/Phishing+Websites Accessed January 2016 [5] http://dataaspirant.com/2017/01/30/how-decision-tree- algorithm-works/ [6] http://dataaspirant.com/2017/05/22/random-forest- algorithm-machine-learing/ [7] https://www.kdnuggets.com/2016/07/support-vector- machines-simple-explanation.html [8] www.alexa.com [9] www.phishtank.com IJCATM : www.ijcaonline.org View publication stats View publication stats