IJSRED-V2I4P0

International Journal of Scientific Research and Engineering Development – Volume 2 Issue 4, July – Aug 2019
Available at www.ijsred.com
ISSN : 2581-7175 ©IJSRED: All Rights are Reserved Page 109
Detection of Legitimate Websites
Mahezabeen N Ilkal
Department of Computer science and Engineering Secab Institute of Engineering and Technology,Bijapur ,Karnataka
(zabeenbakshi@gmail.com)
-------------------------------------************************-----------------------------------
Abstract:
Phishing is defined as mimicking a trusted company's website aiming to take private information of a
victim. Different solutions have proposed in order to eliminate phishing. However, we can’t expect
magic from single bullet to eliminate phishing threat completely. Data mining is a promising technique
used to detect phishing attacks. In this paper, an intelligent system to detect phishing attacks is presented.
We used different data mining techniques to decide categories of websites: legitimate or phishing.
Different classifiers were used in order to construct accurate intelligent system for phishing website
detection.
Keywords—Phishing, WHOIS, cybercrime
-----------------------------------***********************---------------------------------------
I.INTRODUCTION
Phishing is a cybercrime where attacker
imitates trusted entity (real organization/person)
through email, social media platform, or other
communication mediums where the attackers
send malicious links or attachments in order to
gain personal data like bank details. Victims will
have money loss or identity. According to a
survey, more than half (58%) of organizations
had seen an increase in phishing attacks in the
past years. Phishing can be from e-mail messages
claiming to come from recognized sources asks
to verify your account, enter your personal
information, or make a payment (details to access
your bank account), fraud hijacking a website
domain name , fraud google docs phishing and
many others. Data gained through the internet
can be informative, but some of the contents can
be very negative and inappropriate and can be
related to harmful phishing attacks that the user
need to avoid. Phishing is a fraudulent criminal
attempt to gain the sensitive /personal
information like user names, passwords and
credit card details by disguising as a trustworthy
entity in an electronic communication. Typically
carried out by instant messaging or email
spoofing and it often makes the users to enter
personal information , the look and feel of which
are identical to the legitimate site making victims
to lose their identity or financial loss.
Phishing is an a type of social engineering
techniques used to deceive the web users. Users
are often tricked by communications purporting
to be from trusted parties such as social media,
online banks, online payment portals or IT
(Information Technology) administrators.
Phishing is a luring techniques used by phishing
artist in the intention of exploiting the personal
details of unsuspected users. Phishing website is
a mock website that looks similar in appearance
but different in destination. The unsuspected
users post their data thinking that these websites
come from trusted financial institutions. Several
anti-phishing techniques emerge continuously
but phishers come with new technique by
breaking all the anti-phishing mechanisms.
Hence there is a need for efficient mechanism for
the prediction of phishing website.
Phishing types :
Spear phishing : Phishing attackers gather
the victim related personal information and
use that personal information to target the
RESEARCH ARTICLE OPEN ACCESS

victim and increase their probability of
success.
Clone phishing : Clone phishing is creating
a duplicated or a clone type of a
authenticated legitimate trusted entity which
is almost identical and use it for phishing
attack is called clone phishing .The clone
phishing attackers send its attachment or
link though email within the email address
spoofed to appear to come from the original
sender and is replaced with a malicious
version for criminal phishing attack. It may
claim that it has link to the original entity .
The clone phishing could be used to pivot
(indirectly) from a previously infected
machine and gain a foothold on another
machine, by exploiting the social trust
associated with the inferred connection due
to both parties receiving the original email.
Whaling : Whaling directed specifically at
senior executives and other high-profile
targets. The content will be crafted to target
an upper manager and the person's role in
the company.
Link manipulation : Link manipulation
phishing attackers use some form of
technical deception designed to make a link
in an email (and the spoofed website it leads
to) appear to belong to the spoofed
organization. Common tricks used by
attackers are
misspelled URLs or the use of sub domains
Filter evasion : Instead of text, phishing
attackers use images to make it hard for anti
phishing filters to detect the text commonly
used in phishing emails.
Website forgery : In website forgery
phishers use JavaScript commands in order
to alter the address bar of the website they
lead to. Either they place a picture of a
legitimate URL over the address bar, or by
closing the original bar and opening up a
new one with the legitimate URL.
II.LITERATURE SURVEY
1. Mayank Pandey and Vadlamani Ravi[1],
The paper proposed a model for email
classification that exploits twenty three
keywords taken from the email body using
classification algorithm. The proposed
model is tested like neural net, genetic
programming support sector machine,
logistic regression, multilayer perception
and decision tree. The best classification
accuracy of 98.12% is achieved from genetic
programming.
2. Marco Cova, Christopher Kruegel, Giovanni
Vigna[2], This paper is phishing email
detection using Bayesian classifier,
evaluated in terms of accuracy, precision,
time, error and recall. This classifier
accuracy is 96.46%.
3. Lew May Form, Kang LengChiew, San
Nah Szeand Wei King Tiong[3], The paper
used SVM support vector machine classifier
applied a set of 9n behavioural based and
structured based features. This classifier
accuracy is 97.25% but its weakness is in its
relatively very small training data set(50%
spam and 50% ham from 1000 emails).
4. Tareek M. Pattewar, Sunil B. Rathod[4],
The paper proposed an integrated email
classification algorithm by combining
Bayasian classifier and phishing website
detection using Decision tree(4.5). This
integrated classifier gave an accuracy of
95.54% which is a improved accuracy
compared to Bayesian classifier which gave
94.86% of accuracy.
5. In [5] JAIN MAO, WENQIAN TIAN and
ZHENGAI LIANG has proposed a system
uses page component similarity for phishing
detection. It analyses URL tokens to higher
the prediction accuracy phishing pages
keeping its CSS style similar to their
destination pages. It prototyped phishing
alarm as an extension to the google chrome
browser and illustrated its efficiency in
evaluation using real world-phishing
samples.
6. ZOU FUTAI, PEI BEI and PANLI[6]: For
web phishing deploys graph mining
techniques. It identifies some potential
phishing which cannot be detected by URL
analyses. It uses the visiting relation
between website and user.
7. XIN MEI CHOO, KANG LENG CHEW
and NADIANATRA MUSA[7]: This system
utilises support vector machine classifier.
This technique extracts and form the feature

set for a web page. It has two phases training
phase and testing phase, It extracts the
feature set during the training phase and
predicts legitimate or a phishing during
testing phase.
8. Varshaani Ramdas, V Y Kulkarni and R A
Rane[8] proposed a system using Novel
algorithm for phishing website detection and
can identify number of phishing websites.
Since It executes number of tests such as
Alexa ranking test, Blacklist search test , and
different URL features test. But this
methodology is only suited for http URL’s.
9. JUN HU, HANBING VAN and YUCHUN
JI[9] proposed a system based on the
analysis of legitimate website server log
information to detect the phishing URL’s.
Whenever a victim opens the phishing
URL’s, the phishing websites will refer to
the legal websites asking for resources. Then
the logs will be recorded by the legitimate
website servers and from these logs phishing
sites can be detected.
10. Samuel NARCHAL, NIDHI SINGH and
GIOVANNI ARMANO[10] proposed Off-
the- Hook application for phishing URL’s
detection. Off-The-Hook has several
important properties such as accuracy, brand
independence, speed of decision, good
language independence resilience to
dynamic phish and resilience to evolution in
phishing techniques.
III PROBLEM STATEMENT
Phishing websites largely encourage the
growth of cybercrime and restricts the
development of web services. As a result
there has been strong need of stopping these
internet criminal activities by the developing
robust anti phishing solutions.
OBJECTIVE
Phishing is a popular form among attackers
because it is much easier to cheat someone
into clicking a malicious link which looks
like legitimate link than the cheater tries to
break through a computer’s security system
and access all confidential information.
Detecting and stopping phishing websites is
really a complex and dynamic problem
involving many factors and criteria. Phishing is
defined as mimicking a trusted company's
website aiming to take private information of a
victim. Different solutions have proposed in
order to eliminate phishing. However, we can’t
expect magic from single bullet to eliminate
phishing threat completely. Data mining is a
promising technique used to detect phishing
attacks. In this paper, an intelligent system to
detect phishing attacks is presented. We used
different data mining techniques to decide
categories of websites: legitimate or phishing.
Different classifiers were used in order to
construct accurate intelligent system for
phishing website detection. The project
objective is to identify and classify various types
of spam data or unwanted data from various web
pages and therefore improving the overall user
experience and security for online applications.
IV EXISTING SYSTEM
Many phishing detection system have been
developed to stop the cyber crime . • Few are to
detect and block the phishing websites manually
in time. • While developing websites enhancing
the security • Using various spam filters to
block phishing emails. •Installing online anti-
phishing software in computers.
One of the existing detection technique is
Blacklist-based technique withlow false alarm
probability, but it cannot detectthe websites that
are not in the black listdatabase.Because the life
cycle of phishingwebsites is too short and the
establishment ofblack list has a long lag time,
the accuracy ofblacklist prediction is not too
high.
V PROPOSED SYSTEM
Phishing is a fraud technique used by
cybercriminals in the intention of exploring
financial or personal information of the victims.
Several Anti-phishing techniques are invented
but phishers also come with new ways of
breaking them and fooling victims. Therefore
much more efficient technique is needed for the
prediction of malicious URLs.
This project employs machine learning
techniques for developing the prediction task
and for the result exploration, supervised

learning algorithms like Decision tree, Random
forest are used.
Steps involved in this project
1. Feature Extraction
Feature extraction is a process where
dimensionality reduction takes place, and an
initial set of raw variables is reduced to more
manageable features for processing, This
involves extracting Uniform Resource
Locator URL’s features
2. Pre-processing of data
Pre-processing of data involves removing
unnecessary data for training the model.
3. Training the model
i. Decision Tree algorithm
Training the model with Decision Tree
algorithm and calculating the accuracy
ii. Random Forest algorithm
Calculating the accuracy
Choosing the best fit for the problem
4. Evaluating and Testing the model
Any Classification Algorithm can be used
such as SVM (Support Vector Machine),
KNN (K Nearest Neighbors), Naive Bayes
etc but we are testing only with Decision
Tree C5.0 and Random Forest because many
citations on this project have stated that
Random Forest is the best fit.
5. Evaluating the model and Testing the
model - By providing input (URL) either
from desktop app or web-app, classifying
whether it is legitimate or phishing.
The various features considered are :
1. Long URL to Hide the Suspicious Part
If the length of the URL is greater than or
equal to 54 characters then the URL
classified as phishing.
2. URL’s having “@” Symbol
Using “@” symbol in the URL leads the
browser to ignore everything preceding the
“@” symbol and the real address often
follows the “@” symbol.
3. Redirecting using “//”
The existence of “//” within the URL path
means that the user will be redirected to
another website. Examine the location where
the “//” appears. Find that if the URL starts
with “HTTP”, that means the “//” should
appear in the sixth position. However, if the
URL employs “HTTPS” then the “//” should
appear in seventh position.
4. Sub-Domain and Multi Sub-Domains
The legitimate URL link has two dots in the
URL since we can ignore typing “www.”. If the
number of dots is equal to three then the URL is
classified as “Suspicious” since it has one sub
domain. However, if the dots are greater than
three it is classified as phishing since it will
have multiple sub-domains.
5. Adding Prefix or Suffix Separated by (-) to the
Domain
The dash symbol is rarely used in legitimate URLs.
Phishers tend to add prefixes or suffixes separated
by(-) to the domain name so that users feel that
they are dealing with a legitimate web page.
6. Using the IP Address
If an IP address is used as an alternative of the
domain name in the URL, users can be sure that
someone is trying to steal their personal
information. Sometimes, the IP address is even
transformed into hexadecimal code.
7. Using URL Shortening Services “Tiny URL”
URL shortening is a method on the “World Wide
Web” in which a URL may be made considerably
smaller in length and still lead to the required web
page. This is accomplished by means of an “HTTP
Redirect” on a domain name that is short, which
links to the web page that has a long URL.
8. The Existence of “HTTPS” Token in the
Domain Part of the URL
The phishers may add the “HTTPS” token to the
domain part of a URL in order to trick users. For
example, http://https-www-paypal-it-webapps-mpp-
home.soft-hair.com/.
9. Abnormal URL
This feature can be extracted from WHOIS
database. For a legitimate website, identity is
typically part of its URL.

10. Google Index
This feature examines whether a website is in
Google’s index or not. When a site is indexed
by Google, it is displayed on search results
(Webmaster resources, 2014). Usually, phishing
web pages are merely accessible for a short
period and as a result, many phishing web pages
may not be found on the Google index.
11. Website Traffic
This feature measures the popularity of the
website by determining the number of visitors
and the number of pages they visit. However,
since phishing websites live for a short period of
time, they may not be recognized by the Alexa
database (Alexa the Web Information
Company., 1996). Furthermore, if the domain
has no traffic or is not recognized by the Alexa
database, it is classified as “Phishing”.
Otherwise, it is classified as “Suspicious”.
12. Domain Registration Length - 1 year
Based on the fact that a phishing website lives
for a short period of time, we believe that
trustworthy domains are regularly paid for
several years in advance. In the data set, it can
be found that the longest fraudulent domains
have been used for one year only.
13. Domain Registration Length - 6 months
This feature can be extracted from WHOIS
database (Whois 2005). Most phishing websites
live for a short period of time. By reviewing our
data set, we find that the minimum age of the
legitimate domain is 6 months.
14. DNS Record
For phishing websites, either the claimed
identity is not recognized by the WHOIS
database (Whois 2005) or no records founded
for the host name (Pan and Ding 2006). If the
DNS record is empty or not found then the
website is classified as “Phishing”, otherwise it
is classified as “Legitimate”.
15. Statistical-Reports Based Feature
Several parties such as PhishTank (PhishTank
Stats, 2010-2012), and StopBadware
(StopBadware, 2010-2012) formulate numerous
statistical reports on phishing websites at every
given period of time some are monthly and
others are quarterly.
Figure 1 Data flow diagram Module Description
Data reading: This stage involves conversion
of csv data file into the raw data frame on which
analysis and processing can be done.
Data pre-processing: Pre-processing of data
involves removing unnecessary data for training
the model, data pre-processing is data mining
technique which transforms raw data into an
understandable format.
Feature extraction: Feature extraction is a
process where dimensionality reduction takes
place, and an initial set of raw variables is
reduced to more manageable groups (features)
for processing, while maintaining accuracy and
completely describing the original data set.
Data visualization:
Data visualization is a techniques used to
communicate data or information by encoding it
as visual objects (e.g., points, lines or bars)
contained in graphics. The main goal is to
communicate data clearly and efficiently to
users graphically.
VI AREA OF DOMAIN
MACHINE LEARNING
Machine learning and Artificial Intelligence are
branches of computer science. Machine
learning centralizes on developing the

International Journal of Scientific Research and
ISSN : 2581-7175
computational programs is an application of
artificial intelligence (AI) that allows the
system to automatically learn and
improvise from experience without using
explicit instructions instead relying on
patterns and inference in order to perform a
specific task effectively. Many supervised
learning algorithms have been employed in
different artificial real applications
successfully .Some popular machine
learning techniques such as back
propagation neural network (BPNN), radial
basis function network (RBFN), support
vector machine (SVM), naïve Bayes
classifier (NB), decision tree (C4.5), random
forest (RF), and k-Nearest neighbor
Decision Tree (C4.5) and Random Forest
(RF) :
Decision tree is the process of dividing the
datasets into different categories by adding label
.It is one of the most broadly utilized and
practical strategies for inductive induction. The
instances are classified by sorting them based on
feature values evaluation. A node in the tree
relates to a feature in an instance that need to be
classified and predicts a value that each branch
of the tree represents. The most common
algorithm among the other decision trees is C4.5
algorithm which uses set of if-then rules to
improve readability and interpretation. Another
popular decision tree, which can be used for
both classification and regression is
Forest (RF) classifier. It is a supervised
machine learning algorithm used in data mining
builds multiple decision trees(created by making
use of different samples from the same dataset)
independently trained on selected training
datasets and ensemble them together. The trees
are created randomly by making use of different
sub sets of the same dataset, and its features are
also taken randomly for the creation of tree.
Therefore, Random Forest usually achieves
more accurate classification and stable
prediction compared to a single tree.
VII.METHODLOGY
In our proposal a learning based approach is
used to classify websites into three classes.
1.Legitimate URL’s
International Journal of Scientific Research and Engineering Development – Volume 2 Issue 4,
©IJSRED: All Rights are Reserved
is an application of
(AI) that allows the
system to automatically learn and
vise from experience without using
explicit instructions instead relying on
patterns and inference in order to perform a
. Many supervised
learning algorithms have been employed in
different artificial real applications
lly .Some popular machine
learning techniques such as back-
propagation neural network (BPNN), radial
basis function network (RBFN), support
vector machine (SVM), naïve Bayes
classifier (NB), decision tree (C4.5), random
Nearest neighbor (kNN).
Decision Tree (C4.5) and Random Forest
is the process of dividing the
datasets into different categories by adding label
.It is one of the most broadly utilized and
practical strategies for inductive induction. The
are classified by sorting them based on
feature values evaluation. A node in the tree
relates to a feature in an instance that need to be
classified and predicts a value that each branch
of the tree represents. The most common
ision trees is C4.5
then rules to
improve readability and interpretation. Another
popular decision tree, which can be used for
both classification and regression is Random
classifier. It is a supervised
arning algorithm used in data mining
builds multiple decision trees(created by making
use of different samples from the same dataset)
independently trained on selected training
datasets and ensemble them together. The trees
se of different
sub sets of the same dataset, and its features are
also taken randomly for the creation of tree.
Therefore, Random Forest usually achieves
more accurate classification and stable
prediction compared to a single tree.
r proposal a learning based approach is
used to classify websites into three classes.
2.Suspicious URL’s
3.Phishing URL’s
Our methodology only analyses the uniform
resource locator(URL) itself and no need of
accessing the website content
reduces the runtime latency and probability of
exposing users to the browser based
vulnerability by employing learning algorithms ,
our methodology gives better performance on
generality and coverage compared with
blacklisting services.
SYSTEM ARCHITECTURE
Figure 2: Steps to achieve the results.
VIII.CONCLUSION
Any classification algorithm such as KNN,
Naïve bayes, SVM can be used to detect
phishing but in this project I am testing only
with decision tree C5.0 and random forest
algorithms because many scholars on anti
phishing projects has stated that random forest is
the best fit. Many researches have done in the
field of cyber security and web content
detection, classification and filtering as seen in
the papers referred during the literature survey.
Different approaches and algorithms are also
developed for solving this problem. These are
currently followed and provide descent level of
performance in real time applications.
REFERENCES
[1] Mayank Pandey and Vadlamani Ravi,
“Detecting phishing e-mails using Text and
Data mining”, IEEE International Conference
on Computational Intelligence and Comput
Research 2012.
[2] Sunil B. Rathod, Tareek M. Pattewar,
“Content Based Spam Detection in Email
using Bayesian Classifier”, IEEE ICCSP
conference, 2015.
[3] Lew May Form, Kang LengChiew,
San Nah Szeand
, July – Aug 2019
www.ijsred.com
Page 114
Our methodology only analyses the uniform
resource locator(URL) itself and no need of
accessing the website contents. Therefore It
reduces the runtime latency and probability of
exposing users to the browser based
vulnerability by employing learning algorithms ,
our methodology gives better performance on
generality and coverage compared with
STEM ARCHITECTURE
: Steps to achieve the results.
Any classification algorithm such as KNN,
, SVM can be used to detect
phishing but in this project I am testing only
with decision tree C5.0 and random forest
algorithms because many scholars on anti-
phishing projects has stated that random forest is
the best fit. Many researches have done in the
field of cyber security and web content
detection, classification and filtering as seen in
the papers referred during the literature survey.
Different approaches and algorithms are also
developed for solving this problem. These are
provide descent level of
performance in real time applications.
[1] Mayank Pandey and Vadlamani Ravi,
mails using Text and
Data mining”, IEEE International Conference
on Computational Intelligence and Computing
[2] Sunil B. Rathod, Tareek M. Pattewar,
“Content Based Spam Detection in Email
using Bayesian Classifier”, IEEE ICCSP
[3] Lew May Form, Kang LengChiew,
Wei King Tiong,

“Phishing Email Detection Technique by
using Hybrid Features”, IT in Asia (CITA),
9th International Conference, 2015.
[4] Tareek M. Pattewar, Sunil B. Rathod,
“A Comparative Performance Evaluation
of Content Based Spam and Malicious
URL Detection in E-mail”, IEEE
International Conference on Computer
Graphics, Vision and Information Security
(CGVIS), 2015.
[5] JIAN MAO1,WENQIAN TIAN1, PEI
LI1, TAO WEI2, AND ZHENKAI LIANG3
Phishing-Alarm: Robust and Efficient
Phishing Detection via Page Component
Similarity.
[6] Zou Futai, Gang Yuxiang, Pei Bei, Pan
Li, Li Linsen Web Phishing Detection
Based on Graph Mining.
[7] Xin mei choo, kanglengchiew,
dayanghananiabangibrahim,nadianatramusa,
san nah sze, wei king tiong feature-based
phishing detection technique.
[8] Varsharani Ramdas Hawanna, V. Y.
Kulkarni and R. A. Rane A Novel Algorithm
to Detect Phishing URLs.
[9] Jun Hu, XiangzhuZhang,Yuchun Ji,
Hanbing Yan, Li Ding, Jia Li and Huiming
Meng Detecting Phishing Websites Based
on the Study of the Financial Industry
Webserver Logs.
[10] Samuel Marchal, Giovanni Armano
and Nidhi Singh Off the-Hook: An Efficient
and Usable.

IJSRED-V2I4P0

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to IJSRED-V2I4P0

Similar to IJSRED-V2I4P0 (20)

More from IJSRED

More from IJSRED (20)

Recently uploaded

Recently uploaded (20)

IJSRED-V2I4P0