SlideShare a Scribd company logo
Classification with R
By
M.S. Najima Begum,
Reg.No :15US10
Department of Computer Science,
ANJA College, Sivakasi
Abstract
Classification is one of the mechanisms to label
the data. The tools and methods applied differ
according to the size of the dataset. Here, I have
used two methods, Machine learning and Deep
learning to text(to detect phishing web sites) and
image data sets(to label cifar-10 images)
respectively.
System Specifications
Hardware Requirements
 Intel Pentium 2.10 GHz / 500 GB / 2GB
Software Requirements
 Windows 8.1 / Rstudio 3.4.3 / Rtools / Keras / Tensorflow
Keras – Interface between R and Python to implement deep
learning models
Tensorflow - Backend for Keras in R to implement deep learning
models(CPU & GPU Compatibility)
Literature Review International Journal of Advance Foundation and Research in Computer (IJAFRC), Volume 3, Issue 4,
April – 2016. ISSN : 2348 – 4853, Impact Factor – 1.317, “Link Guard Algorithm approch on
Phishing Detection and Control”.
Abstract:
Phishing is a new type of network attack where the attacker creates a replica of an existing
Web page to fool users (e.g., by using specially designed e-mails or instant messages) into submitting
personal, financial, or password data to what they think is their service provides’ Web site. In this
research paper, we proposed a new end-host based anti-phishing algorithm, which we call Link
Guard, by utilizing the generic characteristics of the hyperlinks in phishing attacks. These
characteristics are derived by analyzing the phishing data archive provided by the Anti-Phishing
Working Group (APWG). Because it is based on the generic characteristics of phishing attacks, Link
Guard can detect not only known but also unknown phishing attacks. We have implemented Link
Guard in Windows XP. Our experiments verified that Link Guard is effective to detect and prevent
both known and unknown phishing attacks with minimal false negatives. Link Guard successfully
detects 195 out of the 203 phishing attacks. Our experiments also showed that Link Guard is light
weighted and can detect and prevent phishing attacks in real time. Index Terms: Hyperlink, Link
Guard algorithm. Network security, Phishing attacks.
 International Journal of Engineering and Techniques - Volume 2, Issue 5, Sep – October 2016. “Automated
Phishing Website Detection Using URL Features and Machine Learning Technique ”
Abstract
Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs
host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims
of scams, and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a
timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists
cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of
malicious URL detectors, machine learning techniques have been explored with increasing attention in recent
years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL
Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as
a machine learning task, and categorize and review the contributions of literature studies that addresses different
dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely
and comprehensive survey for a range of different audiences, not only for machine learning researchers and
engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them
understand the state of the art and facilitate their ownresearchandpracticalapplications.Wealsodiscusspractical
issues in system design, open research challenges, and point out some important directions for future research.
Index Terms—Malicious URL Detection, Machine Learning, Online Learning, Internet security, Cybersecurit
Proposed Work- Phishing Websites
Phishing is an unlawful activity of making gullible people to reveal their
insightful information into fake websites. The Aim of these phishing websites is to
acquire confidential information such as usernames, passwords, banking
credentials and some other personal information. Phishing website looks similar to
legitimate website. Therefore people cannot make difference among them. Today,
users are heavily relying on the internet for online purchasing, ticket booking, bill
payments, etc. As technology advances, the phishing approaches being used are
also getting progressed and hence it stimulates anti-phishing methods to be
upgraded.
There are many algorithms used to identify the Phishing Websites which use
the maximum of 30 parameters. Here, I’ve tried to prove that the minimal effective
parameters are sufficient for the detection of phishing websites. By using those
minimal effective parameters, we would be able to identify the phishing websites.
Machine Learning
 Machine Learning is the practice of using algorithms
to parse data, learn from it, and then make a
determination or prediction about something in the
world.
 Regardless of learning style or function, all
combinations of machine learning algorithms consist
of the following:
R is rapidly becoming the leading language in
data science and statistics. Today, R is the tool of
choice for data science professionals in every
industry and field.
It is the best for statistical, data analysis and
machine learning .
R
Comparision of Link Guard and
Random Forest Algorithms
Random Forest Link Guard
It is one of the classification
method
It is also one of the classification
method
The result accuracy of this
algorithm is 99.7%
The result accuracy of this
algorithm is 99.1%
It uses both low false negative
(FN) and low false positive(FP)
rates
It uses low false negative (FN)
only.
To train the dataset, it uses
Vector representation.
To train the dataset, it uses
Pattern matching.
It uses regression It uses end-host based approach
Attributes Used
• @attribute having_IP_Address { -1,1 }
• @attribute URL_Length { 1,0,-1 }
• @attribute Shortining_Service { 1,-1 }
• @attribute having_At_Symbol { 1,-1 }
• @attribute double_slash_redirecting { -
1,1 }
• @attribute Prefix_Suffix { -1,1 }
• @attribute having_Sub_Domain { -
1,0,1 }
• @attribute SSLfinal_State { -1,1,0 }
• @attribute Domain_registeration_length
{ -1,1 }
• @attribute Favicon { 1,-1 }
• @attribute port { 1,-1 }
• @attribute HTTPS_token { -1,1 }
• @attribute Request_URL { 1,-1 }
• @attribute URL_of_Anchor { -1,0,1 }
• @attribute Links_in_tags { 1,-1,0 }
• @attribute SFH { -1,1,0 }
• @attribute Submitting_to_email { -1,1 }
• @attribute Abnormal_URL { -1,1 }
• @attribute Redirect { 0,1 }
• @attribute on_mouseover { 1,-1 }
• @attribute RightClick { 1,-1 }
• @attribute popUpWindow { 1,-1 }
• @attribute Iframe { 1,-1 }
• @attribute age_of_domain { -1,1 }
• @attribute DNSRecord { -1,1 }
• @attribute web_traffic { -1,0,1 }
• @attribute Page_Rank { -1,1 }
• @attribute Google_Index { 1,-1 }
• @attribute Links_pointing_to_page {
1,0,-1 }
• @attribute Statistical_report { -1,1 }
Here, I’ve found that a maximum
of 30 attributes are used to detect the
Phishing websites. Among these, I’ve
tried to find the most important and the
minimal effective parameters to classify
the phishing websites.
Decision Tree
Decision tree is the most powerful and
popular algorithm for classification and
prediction.
By applying this algorithm, the most effective
attribute(s) can be found out to detect the
phishing website.
Dataset
Dataset collected for this task is from
https://archive.ics.uci.edu/ml/datasets.html
https://www.phishtank.com/
Libraries rpart
R provides a library named ‘rpart’ which
represents ‘Recursive Partitioning’ to perform the
decision tree operations.
 rpart.plot
It also provides a library named ‘rpart.plot’
which represents ‘Recursive Partitioning-plot’ to
produce the Graphical Representation of a Decision
tree model.
Code to find the Minimal Effective attributes
#import package
library(rpart)
library(rpart.plot)
#Load data
psite <- read.csv("G:MLDecision
TreeDatasetsPhishingweb.csv")
#Fit Model
mod <- rpart(Result~., data = psite[1:1200,])
summary(mod)
rpart.plot(mod, type= 4, extra= 101)
p <- predict(mod, psite[,1:9])
table(p,psite$Result)
Output
Variable Importance
SFH popUpWindow SSLfinal_State
47 20 19
URL_of_Anchor age_of_domain web_traaffic
5 4 3
Request_URL URL_Length
1 1
Decision Tree
Server Form Handler Verification
 In the Decision tree, Server Form Handler(SFH) is set to be
root. It indicates that SFH plays a vital role in detecting
phishing websites.
 The importance of SFH variable is 47.
 So, I tried to prove that, the SFH is a Minimal effective
parameter to identify the phishing websites.
 For that, the SFH is extracted from the Link. If SFH occurs
the FP(False Positive) value is set to be 1. else set to be -1. If
possibilities of SFH in the Link is founded FP value is set to
be 0.
Code
library(party)
library(rpart.plot)
#Load data
sites <- read.csv("G:MLSFHds.csv")
#Fit Model
model <- rpart(Result~., data = sites[1:100,])
summary(model)
rpart.plot(model, type= 4, extra= 101)
ps <- predict(model, psite[,1:2])
table(ps,sites$Result)
Output
Decision Tree
Variable importance
SFH
100
PopUp_Window Verification
 In the Decision tree, the attribute SFH has importance of 100.
 The above tree explains that , If the Link or URL has the SFH(Server Form
Handler), then definitely it is a Phishing website.
 There also some exceptions that the phishing websites sometimes don’t
have SFH in their websites. To overcome that problem, I tried the next
important variable PopUp_Window
 Importance of PopUp_Window is 20.
 For that, the PopUp_Window is extracted from the Link. If PopUp_Window
is available, the FP(False Positive) value is set to be 1. else set to be -1. If
possibilities of PopUpWindow in the Link is found, FP value is set to be 0.
Output
Decision Tree
From the above classification method, I have
identified the minimal effective parameters to
detect the Phishing websites. This increases the
effectiveness of the algorithm. This speeds up the
detection process.
Online transaction systems can use this algorithm
to protect their users from the phishing sites
while redirecting to their transaction page.
Deep Learning
Instead of organizing data to run through
predefined equations, deep learning sets up
basic parameters about the data and trains the
computer to learn on its own by recognizing
patterns using many layers of processing.
 Deep learning requires large amounts of labeled
data.
 Deep learning requires substantial computing
power. (High-performance GPUs combined with
clusters or cloud computing is preferable)
 Most deep learning methods use neural network
architectures, which is why deep learning models
are often referred to as deep neural networks.
Machine learning vs Deep learning
In machine learning, we , manually choose
features and a classifier to sort images.
With deep learning, feature extraction and
modeling steps are automatic.
Classification using Deep Learning
In the previous model, I have used text dataset.
The size of that dataset is less.
Image datasets are normally larger. For those
larger datasets, the training process is easy in
Deep Learning. Here, I have taken Cifar-10
dataset to classify the images.
Cifar-10 Dataset
The CIFAR-10 dataset consists of 60000 32x32 colour images in
10 classes, with 6000 images per class.
There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test
batch, each with 10000 images. The test batch contains exactly 1000
randomly-selected images from each class. The training batches
contain the remaining images in random order, but some training
batches may contain more images from one class than another.
The Classes are
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
Dataset Collection
The Cifar-10 dataset is collected from
https://www.cs.toronto.edu/~kriz/cifar.html
Deep Learning Architectures
 RNN – Recurrent Neural Networks
 Speech Recognition, Handwriting recognition
 LSTM / GRU
 Natural Language text Compression, Gesture recognition, Image captioning
 CNN- Convolutional Neural Networks
 Image recognition, Video analysis, Natural Language processing
 DBN – Deep Belief Networks
 Image recognition, Information retrieval, natural language understanding,
failure prediction
 DSN – Deep Stacking Networks
 Image recognition, Continuous Speech recognition
Here I have chosed CNN for Cifar-10 Image recognition.
APIs used with R for CNN
Keras
 Keras provides a high-level neural networks API developed with a focus
on enabling fast experimentation. Keras has the following key features:
 Allows the same code to run on CPU or on GPU,.
 User-friendly API which makes it easy to quickly prototype deep learning
models.
 Supports arbitrary network architectures: multi-input or multi-output
models, layer sharing, model sharing, etc.
 Is capable of running on top of multiple back-ends including Tensorflow,
CNTK or Theano.
Tensorflow
 TensorFlow is an open source software library for numerical
computation using data flow graphs. Nodes in the graph represent
mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) communicated between them.
 The flexible architecture allows you to deploy computation to one or
more CPUs or GPUs in a desktop, server, or mobile device with a single
API.
 The TensorFlow API is composed of a set of Python modules that enable
constructing and executing TensorFlow graphs. The tensorflow
package provides access to the complete TensorFlow API from within R.
Scaling data#TRAINING DATA
train_x<-cifar$train$x/255
#convert a vector class to binary class matrix
#converting the target variable to once hot encoded vectors using keras
inbuilt function to_categorical()
train_y<-to_categorical(cifar$train$y,num_classes = 10)
#TEST DATA
test_x<-cifar$test$x/255
test_y<-to_categorical(cifar$test$y,num_classes=10)
CNN Architecture for classifying
Cifar-10
#a linear stack of layers
model<-keras_model_sequential()
#configuring the Model
model %>%
#defining a 2-D convolution layer
layer_conv_2d(filter=32,kernel_size=c(3,3),padding="same",
input_shape=c(32,32,3) ) %>%
layer_activation("relu") %>%
#another 2-D convolution layer
layer_conv_2d(filter=32 ,kernel_size=c(3,3)) %>% layer_activation("relu") %>%
#dropout layer to avoid overfitting
layer_dropout(0.25) %>%
layer_conv_2d(filter=32 , kernel_size=c(3,3),padding="same") %>%
layer_activation("relu") %>% layer_conv_2d(filter=32,kernel_size=c(3,3) )
%>% layer_activation("relu") %>%
layer_max_pooling_2d(pool_size=c(2,2)) %>%
layer_dropout(0.25) %>%
#flatten the input
layer_flatten() %>%
layer_dense(512) %>%
layer_activation("relu") %>%
layer_dropout(0.5) %>%
#output layer-10 classes-10 units
layer_dense(10) %>%
#applying softmax nonlinear activation function to the output layer #to calculate
cross-entropy
layer_activation("softmax")
Problem Faced
 Difficult to collect dataset for phishing
websites.
 Difficult to extract the elements from the URL
 System specification is not enough to execute
the deep learning models
 Difficulties in installing Keras and Tensorflow
APIs in R.
Future Enhancement
I’m trying to execute the Cifar-10 image
recognition deep learning model in GPU
system.
I’m trying to predict the gold price prediction
and heart disease prediction models using
deep learning.
Bibliography
 http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/
 https://rpubs.com/
 https://towardsdatascience.com/deploy-tensorflow-models-9813b5a705d5/
 https://www.datacamp.com/courses/machine-learning-with-tree-based-models-in-r
 https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-science-scratch/
 https://github.com/rishy/phishing-websites
 https://www.tensorflow.org/tutorials/deep_cnn
 https://keras.rstudio.com/articles/
 https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.array.html
 https://www.r-bloggers.com/deep-learning-in-r-2/
 https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-
Networks/
 https://www.quora.com/What-is-a-convolutional-neural-
networkhttps://in.mathworks.com/discovery/deep-learning.html
Classification with R

More Related Content

What's hot

Classification Model to Detect Malicious URL via Behaviour Analysis
Classification Model to Detect Malicious URL via Behaviour AnalysisClassification Model to Detect Malicious URL via Behaviour Analysis
Classification Model to Detect Malicious URL via Behaviour Analysis
Editor IJCATR
 
IJSRED-V2I4P0
IJSRED-V2I4P0IJSRED-V2I4P0
IJSRED-V2I4P0
IJSRED
 
IRJET- Detecting Malicious URLS using Machine Learning Techniques: A Comp...
IRJET-  	  Detecting Malicious URLS using Machine Learning Techniques: A Comp...IRJET-  	  Detecting Malicious URLS using Machine Learning Techniques: A Comp...
IRJET- Detecting Malicious URLS using Machine Learning Techniques: A Comp...
IRJET Journal
 
Paper id 71201915
Paper id 71201915Paper id 71201915
Paper id 71201915
IJRAT
 
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET Journal
 
Review of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attackReview of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attack
journalBEEI
 
Network paperthesis2
Network paperthesis2Network paperthesis2
Network paperthesis2Dhara Shah
 
Malicious Url Detection Using Machine Learning
Malicious Url Detection Using Machine LearningMalicious Url Detection Using Machine Learning
Malicious Url Detection Using Machine Learning
securityxploded
 
Phishing
PhishingPhishing
Phishing
Programmer
 
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
M. Atif Qureshi
 
csmalware_malware
csmalware_malwarecsmalware_malware
csmalware_malwareJoshua Saxe
 
Comparative analysis of efficiency of fibonacci random number generator algor...
Comparative analysis of efficiency of fibonacci random number generator algor...Comparative analysis of efficiency of fibonacci random number generator algor...
Comparative analysis of efficiency of fibonacci random number generator algor...
Alexander Decker
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
Trey Grainger
 
Exploration of gaps in Bitly's spam detection and relevant countermeasures
Exploration of gaps in Bitly's spam detection and relevant countermeasuresExploration of gaps in Bitly's spam detection and relevant countermeasures
Exploration of gaps in Bitly's spam detection and relevant countermeasures
Cybersecurity Education and Research Centre
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
COMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODS
COMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODSCOMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODS
COMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODS
IJCI JOURNAL
 
Spam Wars
Spam WarsSpam Wars
Spam Wars
Maurice Green
 
Js3616841689
Js3616841689Js3616841689
Js3616841689
IJERA Editor
 

What's hot (18)

Classification Model to Detect Malicious URL via Behaviour Analysis
Classification Model to Detect Malicious URL via Behaviour AnalysisClassification Model to Detect Malicious URL via Behaviour Analysis
Classification Model to Detect Malicious URL via Behaviour Analysis
 
IJSRED-V2I4P0
IJSRED-V2I4P0IJSRED-V2I4P0
IJSRED-V2I4P0
 
IRJET- Detecting Malicious URLS using Machine Learning Techniques: A Comp...
IRJET-  	  Detecting Malicious URLS using Machine Learning Techniques: A Comp...IRJET-  	  Detecting Malicious URLS using Machine Learning Techniques: A Comp...
IRJET- Detecting Malicious URLS using Machine Learning Techniques: A Comp...
 
Paper id 71201915
Paper id 71201915Paper id 71201915
Paper id 71201915
 
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
 
Review of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attackReview of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attack
 
Network paperthesis2
Network paperthesis2Network paperthesis2
Network paperthesis2
 
Malicious Url Detection Using Machine Learning
Malicious Url Detection Using Machine LearningMalicious Url Detection Using Machine Learning
Malicious Url Detection Using Machine Learning
 
Phishing
PhishingPhishing
Phishing
 
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
 
csmalware_malware
csmalware_malwarecsmalware_malware
csmalware_malware
 
Comparative analysis of efficiency of fibonacci random number generator algor...
Comparative analysis of efficiency of fibonacci random number generator algor...Comparative analysis of efficiency of fibonacci random number generator algor...
Comparative analysis of efficiency of fibonacci random number generator algor...
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Exploration of gaps in Bitly's spam detection and relevant countermeasures
Exploration of gaps in Bitly's spam detection and relevant countermeasuresExploration of gaps in Bitly's spam detection and relevant countermeasures
Exploration of gaps in Bitly's spam detection and relevant countermeasures
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
COMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODS
COMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODSCOMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODS
COMPARATIVE ANALYSIS OF ANOMALY BASED WEB ATTACK DETECTION METHODS
 
Spam Wars
Spam WarsSpam Wars
Spam Wars
 
Js3616841689
Js3616841689Js3616841689
Js3616841689
 

Similar to Classification with R

IRJET - Chrome Extension for Detecting Phishing Websites
IRJET -  	  Chrome Extension for Detecting Phishing WebsitesIRJET -  	  Chrome Extension for Detecting Phishing Websites
IRJET - Chrome Extension for Detecting Phishing Websites
IRJET Journal
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
vivatechijri
 
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
IRJET- Detecting the Phishing Websites using Enhance Secure AlgorithmIRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
IRJET Journal
 
Classification with R
Classification with RClassification with R
Classification with R
Najima Begum
 
Phishing Website Detection Paradigm using XGBoost
Phishing Website Detection Paradigm using XGBoostPhishing Website Detection Paradigm using XGBoost
Phishing Website Detection Paradigm using XGBoost
IRJET Journal
 
HIGH ACCURACY PHISHING DETECTION
HIGH ACCURACY PHISHING DETECTIONHIGH ACCURACY PHISHING DETECTION
HIGH ACCURACY PHISHING DETECTION
IRJET Journal
 
Detecting Phishing using Machine Learning
Detecting Phishing using Machine LearningDetecting Phishing using Machine Learning
Detecting Phishing using Machine Learning
ijtsrd
 
PHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINKPHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINK
RajeshRavi44
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine Learning
IRJET Journal
 
Phishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningPhishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine Learning
IRJET Journal
 
Phishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification AlgorithmsPhishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification Algorithms
IRJET Journal
 
IRJET- Advanced Phishing Identification Technique using Machine Learning
IRJET-  	  Advanced Phishing Identification Technique using Machine LearningIRJET-  	  Advanced Phishing Identification Technique using Machine Learning
IRJET- Advanced Phishing Identification Technique using Machine Learning
IRJET Journal
 
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNINGDETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
IRJET Journal
 
IRJET- Phishing Website Detection System
IRJET- Phishing Website Detection SystemIRJET- Phishing Website Detection System
IRJET- Phishing Website Detection System
IRJET Journal
 
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORKMALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
ijcseit
 
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORKMALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
ijcseit
 
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
cscpconf
 
Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...
csandit
 
A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...
gerogepatton
 
Detection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning AlgorithmDetection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning Algorithm
IRJET Journal
 

Similar to Classification with R (20)

IRJET - Chrome Extension for Detecting Phishing Websites
IRJET -  	  Chrome Extension for Detecting Phishing WebsitesIRJET -  	  Chrome Extension for Detecting Phishing Websites
IRJET - Chrome Extension for Detecting Phishing Websites
 
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.A Hybrid Approach For Phishing Website Detection Using Machine Learning.
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
 
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
IRJET- Detecting the Phishing Websites using Enhance Secure AlgorithmIRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
 
Classification with R
Classification with RClassification with R
Classification with R
 
Phishing Website Detection Paradigm using XGBoost
Phishing Website Detection Paradigm using XGBoostPhishing Website Detection Paradigm using XGBoost
Phishing Website Detection Paradigm using XGBoost
 
HIGH ACCURACY PHISHING DETECTION
HIGH ACCURACY PHISHING DETECTIONHIGH ACCURACY PHISHING DETECTION
HIGH ACCURACY PHISHING DETECTION
 
Detecting Phishing using Machine Learning
Detecting Phishing using Machine LearningDetecting Phishing using Machine Learning
Detecting Phishing using Machine Learning
 
PHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINKPHISHING URL DETECTION AND MALICIOUS LINK
PHISHING URL DETECTION AND MALICIOUS LINK
 
Detecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine LearningDetecting Phishing Websites Using Machine Learning
Detecting Phishing Websites Using Machine Learning
 
Phishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine LearningPhishing Website Detection Using Machine Learning
Phishing Website Detection Using Machine Learning
 
Phishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification AlgorithmsPhishing Website Detection using Classification Algorithms
Phishing Website Detection using Classification Algorithms
 
IRJET- Advanced Phishing Identification Technique using Machine Learning
IRJET-  	  Advanced Phishing Identification Technique using Machine LearningIRJET-  	  Advanced Phishing Identification Technique using Machine Learning
IRJET- Advanced Phishing Identification Technique using Machine Learning
 
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNINGDETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
 
IRJET- Phishing Website Detection System
IRJET- Phishing Website Detection SystemIRJET- Phishing Website Detection System
IRJET- Phishing Website Detection System
 
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORKMALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
 
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORKMALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
 
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
KNOWLEDGE BASE COMPOUND APPROACH AGAINST PHISHING ATTACKS USING SOME PARSING ...
 
Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...Knowledge base compound approach against phishing attacks using some parsing ...
Knowledge base compound approach against phishing attacks using some parsing ...
 
A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...A Comparative Analysis of Different Feature Set on the Performance of Differe...
A Comparative Analysis of Different Feature Set on the Performance of Differe...
 
Detection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning AlgorithmDetection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning Algorithm
 

Recently uploaded

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Classification with R

  • 1. Classification with R By M.S. Najima Begum, Reg.No :15US10 Department of Computer Science, ANJA College, Sivakasi
  • 2. Abstract Classification is one of the mechanisms to label the data. The tools and methods applied differ according to the size of the dataset. Here, I have used two methods, Machine learning and Deep learning to text(to detect phishing web sites) and image data sets(to label cifar-10 images) respectively.
  • 3. System Specifications Hardware Requirements  Intel Pentium 2.10 GHz / 500 GB / 2GB Software Requirements  Windows 8.1 / Rstudio 3.4.3 / Rtools / Keras / Tensorflow Keras – Interface between R and Python to implement deep learning models Tensorflow - Backend for Keras in R to implement deep learning models(CPU & GPU Compatibility)
  • 4. Literature Review International Journal of Advance Foundation and Research in Computer (IJAFRC), Volume 3, Issue 4, April – 2016. ISSN : 2348 – 4853, Impact Factor – 1.317, “Link Guard Algorithm approch on Phishing Detection and Control”. Abstract: Phishing is a new type of network attack where the attacker creates a replica of an existing Web page to fool users (e.g., by using specially designed e-mails or instant messages) into submitting personal, financial, or password data to what they think is their service provides’ Web site. In this research paper, we proposed a new end-host based anti-phishing algorithm, which we call Link Guard, by utilizing the generic characteristics of the hyperlinks in phishing attacks. These characteristics are derived by analyzing the phishing data archive provided by the Anti-Phishing Working Group (APWG). Because it is based on the generic characteristics of phishing attacks, Link Guard can detect not only known but also unknown phishing attacks. We have implemented Link Guard in Windows XP. Our experiments verified that Link Guard is effective to detect and prevent both known and unknown phishing attacks with minimal false negatives. Link Guard successfully detects 195 out of the 203 phishing attacks. Our experiments also showed that Link Guard is light weighted and can detect and prevent phishing attacks in real time. Index Terms: Hyperlink, Link Guard algorithm. Network security, Phishing attacks.
  • 5.  International Journal of Engineering and Techniques - Volume 2, Issue 5, Sep – October 2016. “Automated Phishing Website Detection Using URL Features and Machine Learning Technique ” Abstract Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams, and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their ownresearchandpracticalapplications.Wealsodiscusspractical issues in system design, open research challenges, and point out some important directions for future research. Index Terms—Malicious URL Detection, Machine Learning, Online Learning, Internet security, Cybersecurit
  • 6. Proposed Work- Phishing Websites Phishing is an unlawful activity of making gullible people to reveal their insightful information into fake websites. The Aim of these phishing websites is to acquire confidential information such as usernames, passwords, banking credentials and some other personal information. Phishing website looks similar to legitimate website. Therefore people cannot make difference among them. Today, users are heavily relying on the internet for online purchasing, ticket booking, bill payments, etc. As technology advances, the phishing approaches being used are also getting progressed and hence it stimulates anti-phishing methods to be upgraded. There are many algorithms used to identify the Phishing Websites which use the maximum of 30 parameters. Here, I’ve tried to prove that the minimal effective parameters are sufficient for the detection of phishing websites. By using those minimal effective parameters, we would be able to identify the phishing websites.
  • 7. Machine Learning  Machine Learning is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world.  Regardless of learning style or function, all combinations of machine learning algorithms consist of the following:
  • 8.
  • 9. R is rapidly becoming the leading language in data science and statistics. Today, R is the tool of choice for data science professionals in every industry and field. It is the best for statistical, data analysis and machine learning . R
  • 10. Comparision of Link Guard and Random Forest Algorithms Random Forest Link Guard It is one of the classification method It is also one of the classification method The result accuracy of this algorithm is 99.7% The result accuracy of this algorithm is 99.1% It uses both low false negative (FN) and low false positive(FP) rates It uses low false negative (FN) only. To train the dataset, it uses Vector representation. To train the dataset, it uses Pattern matching. It uses regression It uses end-host based approach
  • 11. Attributes Used • @attribute having_IP_Address { -1,1 } • @attribute URL_Length { 1,0,-1 } • @attribute Shortining_Service { 1,-1 } • @attribute having_At_Symbol { 1,-1 } • @attribute double_slash_redirecting { - 1,1 } • @attribute Prefix_Suffix { -1,1 } • @attribute having_Sub_Domain { - 1,0,1 } • @attribute SSLfinal_State { -1,1,0 } • @attribute Domain_registeration_length { -1,1 } • @attribute Favicon { 1,-1 } • @attribute port { 1,-1 } • @attribute HTTPS_token { -1,1 } • @attribute Request_URL { 1,-1 } • @attribute URL_of_Anchor { -1,0,1 } • @attribute Links_in_tags { 1,-1,0 } • @attribute SFH { -1,1,0 } • @attribute Submitting_to_email { -1,1 } • @attribute Abnormal_URL { -1,1 } • @attribute Redirect { 0,1 } • @attribute on_mouseover { 1,-1 } • @attribute RightClick { 1,-1 } • @attribute popUpWindow { 1,-1 } • @attribute Iframe { 1,-1 } • @attribute age_of_domain { -1,1 } • @attribute DNSRecord { -1,1 } • @attribute web_traffic { -1,0,1 } • @attribute Page_Rank { -1,1 } • @attribute Google_Index { 1,-1 } • @attribute Links_pointing_to_page { 1,0,-1 } • @attribute Statistical_report { -1,1 }
  • 12. Here, I’ve found that a maximum of 30 attributes are used to detect the Phishing websites. Among these, I’ve tried to find the most important and the minimal effective parameters to classify the phishing websites.
  • 13. Decision Tree Decision tree is the most powerful and popular algorithm for classification and prediction. By applying this algorithm, the most effective attribute(s) can be found out to detect the phishing website.
  • 14. Dataset Dataset collected for this task is from https://archive.ics.uci.edu/ml/datasets.html https://www.phishtank.com/
  • 15. Libraries rpart R provides a library named ‘rpart’ which represents ‘Recursive Partitioning’ to perform the decision tree operations.  rpart.plot It also provides a library named ‘rpart.plot’ which represents ‘Recursive Partitioning-plot’ to produce the Graphical Representation of a Decision tree model.
  • 16. Code to find the Minimal Effective attributes #import package library(rpart) library(rpart.plot) #Load data psite <- read.csv("G:MLDecision TreeDatasetsPhishingweb.csv") #Fit Model mod <- rpart(Result~., data = psite[1:1200,]) summary(mod) rpart.plot(mod, type= 4, extra= 101) p <- predict(mod, psite[,1:9]) table(p,psite$Result)
  • 18. Variable Importance SFH popUpWindow SSLfinal_State 47 20 19 URL_of_Anchor age_of_domain web_traaffic 5 4 3 Request_URL URL_Length 1 1
  • 20. Server Form Handler Verification  In the Decision tree, Server Form Handler(SFH) is set to be root. It indicates that SFH plays a vital role in detecting phishing websites.  The importance of SFH variable is 47.  So, I tried to prove that, the SFH is a Minimal effective parameter to identify the phishing websites.  For that, the SFH is extracted from the Link. If SFH occurs the FP(False Positive) value is set to be 1. else set to be -1. If possibilities of SFH in the Link is founded FP value is set to be 0.
  • 21. Code library(party) library(rpart.plot) #Load data sites <- read.csv("G:MLSFHds.csv") #Fit Model model <- rpart(Result~., data = sites[1:100,]) summary(model) rpart.plot(model, type= 4, extra= 101) ps <- predict(model, psite[,1:2]) table(ps,sites$Result)
  • 24. PopUp_Window Verification  In the Decision tree, the attribute SFH has importance of 100.  The above tree explains that , If the Link or URL has the SFH(Server Form Handler), then definitely it is a Phishing website.  There also some exceptions that the phishing websites sometimes don’t have SFH in their websites. To overcome that problem, I tried the next important variable PopUp_Window  Importance of PopUp_Window is 20.  For that, the PopUp_Window is extracted from the Link. If PopUp_Window is available, the FP(False Positive) value is set to be 1. else set to be -1. If possibilities of PopUpWindow in the Link is found, FP value is set to be 0.
  • 27. From the above classification method, I have identified the minimal effective parameters to detect the Phishing websites. This increases the effectiveness of the algorithm. This speeds up the detection process. Online transaction systems can use this algorithm to protect their users from the phishing sites while redirecting to their transaction page.
  • 28. Deep Learning Instead of organizing data to run through predefined equations, deep learning sets up basic parameters about the data and trains the computer to learn on its own by recognizing patterns using many layers of processing.
  • 29.  Deep learning requires large amounts of labeled data.  Deep learning requires substantial computing power. (High-performance GPUs combined with clusters or cloud computing is preferable)  Most deep learning methods use neural network architectures, which is why deep learning models are often referred to as deep neural networks.
  • 30. Machine learning vs Deep learning In machine learning, we , manually choose features and a classifier to sort images. With deep learning, feature extraction and modeling steps are automatic.
  • 31. Classification using Deep Learning In the previous model, I have used text dataset. The size of that dataset is less. Image datasets are normally larger. For those larger datasets, the training process is easy in Deep Learning. Here, I have taken Cifar-10 dataset to classify the images.
  • 32. Cifar-10 Dataset The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. The Classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
  • 33. Dataset Collection The Cifar-10 dataset is collected from https://www.cs.toronto.edu/~kriz/cifar.html
  • 34. Deep Learning Architectures  RNN – Recurrent Neural Networks  Speech Recognition, Handwriting recognition  LSTM / GRU  Natural Language text Compression, Gesture recognition, Image captioning  CNN- Convolutional Neural Networks  Image recognition, Video analysis, Natural Language processing  DBN – Deep Belief Networks  Image recognition, Information retrieval, natural language understanding, failure prediction  DSN – Deep Stacking Networks  Image recognition, Continuous Speech recognition Here I have chosed CNN for Cifar-10 Image recognition.
  • 35. APIs used with R for CNN Keras  Keras provides a high-level neural networks API developed with a focus on enabling fast experimentation. Keras has the following key features:  Allows the same code to run on CPU or on GPU,.  User-friendly API which makes it easy to quickly prototype deep learning models.  Supports arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, etc.  Is capable of running on top of multiple back-ends including Tensorflow, CNTK or Theano.
  • 36. Tensorflow  TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.  The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.  The TensorFlow API is composed of a set of Python modules that enable constructing and executing TensorFlow graphs. The tensorflow package provides access to the complete TensorFlow API from within R.
  • 37. Scaling data#TRAINING DATA train_x<-cifar$train$x/255 #convert a vector class to binary class matrix #converting the target variable to once hot encoded vectors using keras inbuilt function to_categorical() train_y<-to_categorical(cifar$train$y,num_classes = 10) #TEST DATA test_x<-cifar$test$x/255 test_y<-to_categorical(cifar$test$y,num_classes=10)
  • 38. CNN Architecture for classifying Cifar-10 #a linear stack of layers model<-keras_model_sequential() #configuring the Model model %>% #defining a 2-D convolution layer layer_conv_2d(filter=32,kernel_size=c(3,3),padding="same", input_shape=c(32,32,3) ) %>% layer_activation("relu") %>% #another 2-D convolution layer layer_conv_2d(filter=32 ,kernel_size=c(3,3)) %>% layer_activation("relu") %>%
  • 39. #dropout layer to avoid overfitting layer_dropout(0.25) %>% layer_conv_2d(filter=32 , kernel_size=c(3,3),padding="same") %>% layer_activation("relu") %>% layer_conv_2d(filter=32,kernel_size=c(3,3) ) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size=c(2,2)) %>% layer_dropout(0.25) %>% #flatten the input layer_flatten() %>% layer_dense(512) %>% layer_activation("relu") %>% layer_dropout(0.5) %>% #output layer-10 classes-10 units layer_dense(10) %>% #applying softmax nonlinear activation function to the output layer #to calculate cross-entropy layer_activation("softmax")
  • 40. Problem Faced  Difficult to collect dataset for phishing websites.  Difficult to extract the elements from the URL  System specification is not enough to execute the deep learning models  Difficulties in installing Keras and Tensorflow APIs in R.
  • 41. Future Enhancement I’m trying to execute the Cifar-10 image recognition deep learning model in GPU system. I’m trying to predict the gold price prediction and heart disease prediction models using deep learning.
  • 42. Bibliography  http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/  https://rpubs.com/  https://towardsdatascience.com/deploy-tensorflow-models-9813b5a705d5/  https://www.datacamp.com/courses/machine-learning-with-tree-based-models-in-r  https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-science-scratch/  https://github.com/rishy/phishing-websites  https://www.tensorflow.org/tutorials/deep_cnn  https://keras.rstudio.com/articles/  https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.array.html  https://www.r-bloggers.com/deep-learning-in-r-2/  https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural- Networks/  https://www.quora.com/What-is-a-convolutional-neural- networkhttps://in.mathworks.com/discovery/deep-learning.html