Classification with R

Classification with R
By
M.S. Najima Begum,
Reg.No :15US10
Department of Computer Science,
ANJA College, Sivakasi

Abstract
Classification is one of the mechanisms to label
the data. The tools and methods applied differ
according to the size of the dataset. Here, I have
used two methods, Machine learning and Deep
learning to text(to detect phishing web sites) and
image data sets(to label cifar-10 images)
respectively.

System Specifications
Hardware Requirements
 Intel Pentium 2.10 GHz / 500 GB / 2GB
Software Requirements
 Windows 8.1 / Rstudio 3.4.3 / Rtools / Keras / Tensorflow
Keras – Interface between R and Python to implement deep
learning models
Tensorflow - Backend for Keras in R to implement deep learning
models(CPU & GPU Compatibility)

Literature Review International Journal of Advance Foundation and Research in Computer (IJAFRC), Volume 3, Issue 4,
April – 2016. ISSN : 2348 – 4853, Impact Factor – 1.317, “Link Guard Algorithm approch on
Phishing Detection and Control”.
Abstract:
Phishing is a new type of network attack where the attacker creates a replica of an existing
Web page to fool users (e.g., by using specially designed e-mails or instant messages) into submitting
personal, financial, or password data to what they think is their service provides’ Web site. In this
research paper, we proposed a new end-host based anti-phishing algorithm, which we call Link
Guard, by utilizing the generic characteristics of the hyperlinks in phishing attacks. These
characteristics are derived by analyzing the phishing data archive provided by the Anti-Phishing
Working Group (APWG). Because it is based on the generic characteristics of phishing attacks, Link
Guard can detect not only known but also unknown phishing attacks. We have implemented Link
Guard in Windows XP. Our experiments verified that Link Guard is effective to detect and prevent
both known and unknown phishing attacks with minimal false negatives. Link Guard successfully
detects 195 out of the 203 phishing attacks. Our experiments also showed that Link Guard is light
weighted and can detect and prevent phishing attacks in real time. Index Terms: Hyperlink, Link
Guard algorithm. Network security, Phishing attacks.

 International Journal of Engineering and Techniques - Volume 2, Issue 5, Sep – October 2016. “Automated
Phishing Website Detection Using URL Features and Machine Learning Technique ”
Abstract
Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs
host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims
of scams, and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a
timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists
cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of
malicious URL detectors, machine learning techniques have been explored with increasing attention in recent
years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL
Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as
a machine learning task, and categorize and review the contributions of literature studies that addresses different
dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely
and comprehensive survey for a range of different audiences, not only for machine learning researchers and
engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them
understand the state of the art and facilitate their ownresearchandpracticalapplications.Wealsodiscusspractical
issues in system design, open research challenges, and point out some important directions for future research.
Index Terms—Malicious URL Detection, Machine Learning, Online Learning, Internet security, Cybersecurit

Proposed Work- Phishing Websites
Phishing is an unlawful activity of making gullible people to reveal their
insightful information into fake websites. The Aim of these phishing websites is to
acquire confidential information such as usernames, passwords, banking
credentials and some other personal information. Phishing website looks similar to
legitimate website. Therefore people cannot make difference among them. Today,
users are heavily relying on the internet for online purchasing, ticket booking, bill
payments, etc. As technology advances, the phishing approaches being used are
also getting progressed and hence it stimulates anti-phishing methods to be
upgraded.
There are many algorithms used to identify the Phishing Websites which use
the maximum of 30 parameters. Here, I’ve tried to prove that the minimal effective
parameters are sufficient for the detection of phishing websites. By using those
minimal effective parameters, we would be able to identify the phishing websites.

Machine Learning
 Machine Learning is the practice of using algorithms
to parse data, learn from it, and then make a
determination or prediction about something in the
world.
 Regardless of learning style or function, all
combinations of machine learning algorithms consist
of the following:

R is rapidly becoming the leading language in
data science and statistics. Today, R is the tool of
choice for data science professionals in every
industry and field.
It is the best for statistical, data analysis and
machine learning .
R

Comparision of Link Guard and
Random Forest Algorithms
Random Forest Link Guard
It is one of the classification
method
It is also one of the classification
method
The result accuracy of this
algorithm is 99.7%
The result accuracy of this
algorithm is 99.1%
It uses both low false negative
(FN) and low false positive(FP)
rates
It uses low false negative (FN)
only.
To train the dataset, it uses
Vector representation.
To train the dataset, it uses
Pattern matching.
It uses regression It uses end-host based approach

Attributes Used
• @attribute having_IP_Address { -1,1 }
• @attribute URL_Length { 1,0,-1 }
• @attribute Shortining_Service { 1,-1 }
• @attribute having_At_Symbol { 1,-1 }
• @attribute double_slash_redirecting { -
1,1 }
• @attribute Prefix_Suffix { -1,1 }
• @attribute having_Sub_Domain { -
1,0,1 }
• @attribute SSLfinal_State { -1,1,0 }
• @attribute Domain_registeration_length
{ -1,1 }
• @attribute Favicon { 1,-1 }
• @attribute port { 1,-1 }
• @attribute HTTPS_token { -1,1 }
• @attribute Request_URL { 1,-1 }
• @attribute URL_of_Anchor { -1,0,1 }
• @attribute Links_in_tags { 1,-1,0 }
• @attribute SFH { -1,1,0 }
• @attribute Submitting_to_email { -1,1 }
• @attribute Abnormal_URL { -1,1 }
• @attribute Redirect { 0,1 }
• @attribute on_mouseover { 1,-1 }
• @attribute RightClick { 1,-1 }
• @attribute popUpWindow { 1,-1 }
• @attribute Iframe { 1,-1 }
• @attribute age_of_domain { -1,1 }
• @attribute DNSRecord { -1,1 }
• @attribute web_traffic { -1,0,1 }
• @attribute Page_Rank { -1,1 }
• @attribute Google_Index { 1,-1 }
• @attribute Links_pointing_to_page {
1,0,-1 }
• @attribute Statistical_report { -1,1 }

Here, I’ve found that a maximum
of 30 attributes are used to detect the
Phishing websites. Among these, I’ve
tried to find the most important and the
minimal effective parameters to classify
the phishing websites.

Decision Tree
Decision tree is the most powerful and
popular algorithm for classification and
prediction.
By applying this algorithm, the most effective
attribute(s) can be found out to detect the
phishing website.

Dataset
Dataset collected for this task is from
https://archive.ics.uci.edu/ml/datasets.html
https://www.phishtank.com/

Libraries rpart
R provides a library named ‘rpart’ which
represents ‘Recursive Partitioning’ to perform the
decision tree operations.
 rpart.plot
It also provides a library named ‘rpart.plot’
which represents ‘Recursive Partitioning-plot’ to
produce the Graphical Representation of a Decision
tree model.

Code to find the Minimal Effective attributes
#import package
library(rpart)
library(rpart.plot)
#Load data
psite <- read.csv("G:MLDecision
TreeDatasetsPhishingweb.csv")
#Fit Model
mod <- rpart(Result~., data = psite[1:1200,])
summary(mod)
rpart.plot(mod, type= 4, extra= 101)
p <- predict(mod, psite[,1:9])
table(p,psite$Result)

Variable Importance
SFH popUpWindow SSLfinal_State
47 20 19
URL_of_Anchor age_of_domain web_traaffic
5 4 3
Request_URL URL_Length
1 1

Server Form Handler Verification
 In the Decision tree, Server Form Handler(SFH) is set to be
root. It indicates that SFH plays a vital role in detecting
phishing websites.
 The importance of SFH variable is 47.
 So, I tried to prove that, the SFH is a Minimal effective
parameter to identify the phishing websites.
 For that, the SFH is extracted from the Link. If SFH occurs
the FP(False Positive) value is set to be 1. else set to be -1. If
possibilities of SFH in the Link is founded FP value is set to
be 0.

Code
library(party)
library(rpart.plot)
#Load data
sites <- read.csv("G:MLSFHds.csv")
#Fit Model
model <- rpart(Result~., data = sites[1:100,])
summary(model)
rpart.plot(model, type= 4, extra= 101)
ps <- predict(model, psite[,1:2])
table(ps,sites$Result)

Decision Tree
Variable importance
SFH
100

PopUp_Window Verification
 In the Decision tree, the attribute SFH has importance of 100.
 The above tree explains that , If the Link or URL has the SFH(Server Form
Handler), then definitely it is a Phishing website.
 There also some exceptions that the phishing websites sometimes don’t
have SFH in their websites. To overcome that problem, I tried the next
important variable PopUp_Window
 Importance of PopUp_Window is 20.
 For that, the PopUp_Window is extracted from the Link. If PopUp_Window
is available, the FP(False Positive) value is set to be 1. else set to be -1. If
possibilities of PopUpWindow in the Link is found, FP value is set to be 0.

From the above classification method, I have
identified the minimal effective parameters to
detect the Phishing websites. This increases the
effectiveness of the algorithm. This speeds up the
detection process.
Online transaction systems can use this algorithm
to protect their users from the phishing sites
while redirecting to their transaction page.

Deep Learning
Instead of organizing data to run through
predefined equations, deep learning sets up
basic parameters about the data and trains the
computer to learn on its own by recognizing
patterns using many layers of processing.

 Deep learning requires large amounts of labeled
data.
 Deep learning requires substantial computing
power. (High-performance GPUs combined with
clusters or cloud computing is preferable)
 Most deep learning methods use neural network
architectures, which is why deep learning models
are often referred to as deep neural networks.

Machine learning vs Deep learning
In machine learning, we , manually choose
features and a classifier to sort images.
With deep learning, feature extraction and
modeling steps are automatic.

Classification using Deep Learning
In the previous model, I have used text dataset.
The size of that dataset is less.
Image datasets are normally larger. For those
larger datasets, the training process is easy in
Deep Learning. Here, I have taken Cifar-10
dataset to classify the images.

Cifar-10 Dataset
The CIFAR-10 dataset consists of 60000 32x32 colour images in
10 classes, with 6000 images per class.
There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test
batch, each with 10000 images. The test batch contains exactly 1000
randomly-selected images from each class. The training batches
contain the remaining images in random order, but some training
batches may contain more images from one class than another.
The Classes are
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

Dataset Collection
The Cifar-10 dataset is collected from
https://www.cs.toronto.edu/~kriz/cifar.html

Deep Learning Architectures
 RNN – Recurrent Neural Networks
 Speech Recognition, Handwriting recognition
 LSTM / GRU
 Natural Language text Compression, Gesture recognition, Image captioning
 CNN- Convolutional Neural Networks
 Image recognition, Video analysis, Natural Language processing
 DBN – Deep Belief Networks
 Image recognition, Information retrieval, natural language understanding,
failure prediction
 DSN – Deep Stacking Networks
 Image recognition, Continuous Speech recognition
Here I have chosed CNN for Cifar-10 Image recognition.

APIs used with R for CNN
Keras
 Keras provides a high-level neural networks API developed with a focus
on enabling fast experimentation. Keras has the following key features:
 Allows the same code to run on CPU or on GPU,.
 User-friendly API which makes it easy to quickly prototype deep learning
models.
 Supports arbitrary network architectures: multi-input or multi-output
models, layer sharing, model sharing, etc.
 Is capable of running on top of multiple back-ends including Tensorflow,
CNTK or Theano.

Tensorflow
 TensorFlow is an open source software library for numerical
computation using data flow graphs. Nodes in the graph represent
mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) communicated between them.
 The flexible architecture allows you to deploy computation to one or
more CPUs or GPUs in a desktop, server, or mobile device with a single
API.
 The TensorFlow API is composed of a set of Python modules that enable
constructing and executing TensorFlow graphs. The tensorflow
package provides access to the complete TensorFlow API from within R.

Scaling data#TRAINING DATA
train_x<-cifar$train$x/255
#convert a vector class to binary class matrix
#converting the target variable to once hot encoded vectors using keras
inbuilt function to_categorical()
train_y<-to_categorical(cifar$train$y,num_classes = 10)
#TEST DATA
test_x<-cifar$test$x/255
test_y<-to_categorical(cifar$test$y,num_classes=10)

CNN Architecture for classifying
Cifar-10
#a linear stack of layers
model<-keras_model_sequential()
#configuring the Model
model %>%
#defining a 2-D convolution layer
layer_conv_2d(filter=32,kernel_size=c(3,3),padding="same",
input_shape=c(32,32,3) ) %>%
layer_activation("relu") %>%
#another 2-D convolution layer
layer_conv_2d(filter=32 ,kernel_size=c(3,3)) %>% layer_activation("relu") %>%

#dropout layer to avoid overfitting
layer_dropout(0.25) %>%
layer_conv_2d(filter=32 , kernel_size=c(3,3),padding="same") %>%
layer_activation("relu") %>% layer_conv_2d(filter=32,kernel_size=c(3,3) )
%>% layer_activation("relu") %>%
layer_max_pooling_2d(pool_size=c(2,2)) %>%
#flatten the input
layer_flatten() %>%
layer_dense(512) %>%
layer_activation("relu") %>%
#output layer-10 classes-10 units
layer_dense(10) %>%
#applying softmax nonlinear activation function to the output layer #to calculate
cross-entropy
layer_activation("softmax")

Problem Faced
 Difficult to collect dataset for phishing
websites.
 Difficult to extract the elements from the URL
 System specification is not enough to execute
the deep learning models
 Difficulties in installing Keras and Tensorflow
APIs in R.

Future Enhancement
I’m trying to execute the Cifar-10 image
recognition deep learning model in GPU
system.
I’m trying to predict the gold price prediction
and heart disease prediction models using
deep learning.

Bibliography
 http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/
 https://rpubs.com/
 https://towardsdatascience.com/deploy-tensorflow-models-9813b5a705d5/
 https://www.datacamp.com/courses/machine-learning-with-tree-based-models-in-r
 https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-science-scratch/
 https://github.com/rishy/phishing-websites
 https://www.tensorflow.org/tutorials/deep_cnn
 https://keras.rstudio.com/articles/
 https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.array.html
 https://www.r-bloggers.com/deep-learning-in-r-2/
 https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-
Networks/
 https://www.quora.com/What-is-a-convolutional-neural-
networkhttps://in.mathworks.com/discovery/deep-learning.html

Classification with R

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Classification with R

Similar to Classification with R (20)

Recently uploaded

Recently uploaded (20)

Classification with R