SlideShare a Scribd company logo
1 of 9
Genetic Programming Ensemble to improve k-NN based
network intrusion detection system
Authors
Imran Ahmed Malik Mrs. Amrita Prasad
Student at Sharda University Greater Noida Astt. Professor at Sharda University
Email:imran409.im@gmail.com Email: amrita.prasad@sharda.ac.in
Abstract—From the start of network system, security
threats usually known as intrusions has become very
important and critical issue in networks, data and
information systems. In order to overcome these threats
every time a detection systemwas needed because of drastic
growth in networks. Because of the growth of system,
attackers became stronger and every time compromises the
security of system. Hence a need of Intrusion Detection
system became very important and essential tool in network
security. Detection and prevention of such attacks called
intrusions mainly depends on the capability and efficiency
of Intrusion Detection System (IDS). As with increase in
network scalability with high pace, The need for light
weight Intrusion Detection System with high detection rate
is a requirement. Therefore, many ensemble mechanisms
has been proposed by using many methodologies’, these
methodologies have their own benefits and short comings..
In this paper we propose ensemble of k-NN classifiers using
genetic programming. KDD CUP 1999 dataset is used for
intrusion detection. The KDD CUP 1999 dataset contains
4,900,000 records. Each record contains 41 features and is
labeled as either normal or an attack. There are 22 types of
attacks and these attacks are classified into four categories
namely DoS, Probe , R2L and U2R.Since the dataset is not
accurate we perform some preprocessing on dataset .The
first step in preprocessing is removing redundancy ,after
removing redundancy we perform the data normalization
using statistical methods in the range [0 1]. After
normalization we extract the features using PCA feature
selection. The PCA feature selection is applied on the
dataset and 300 features are extracted using Cartesian
product. The dataset is then applied to Ensembler of
classifiers . The ensembler of classifiers classifies the input
data into five categories. Out of these five categories one is
normal and other four are attacks. The ensembler of
classifiers using genetic programming gives an accuracy of
99.97%.
Keywords—Intrusion detection; Anomaly detection;
Misuse detection; KDD Cup 99; Ensemble Approaches.
Genetic Programming, PCA
I. INTRODUCTION
In the past two decades with the rapid progress in the
Internet based technology, new application areas for
computer network have emerged. All of these application
areas made the network an attractive target for the abuse and
a big vulnerability for the community. A fun to do job or a
challenge to win action for some people became a
nightmare for the others. In many cases malicious acts made
this nightmare to become a reality.
In addition to the hacking, new entities like worms, Trojans
and viruses introduced more panic into the networked
society. As the current situation is a relatively new
phenomenon, network defenses are weak [1]. However, due
to the popularity of the computer networks, their
connectivity and our ever growing dependency on them,
realization of the threat can have devastating consequences.
Securing such an important infrastructure has become the
priority one research area for many researchers.
Aim of this paper is to work Intrusion Detection Systems
(IDS) and to analyze some current problems that exist in
this research area. In comparison to some mature and well
settled research areas, IDS is a young field of research.
However, due to its mission critical nature, it has attracted
significant attention towards itself. Density of research on
this subject is constantly rising and everyday more
researchers are engaged in this field of work. The threat of a
new wave of cyber or network attacks is not just a
probability that should be considered, but it is an accepted
fact that can occur at any time. The current trend for the IDS
is far from a reliable protective system, but instead the main
idea is to make it possible to detect novel network attacks.
There are two major approaches for detecting intrusions,
signature-based and anomaly-based intrusion detection
[3][4]. In the first approach, attack patterns or the behavior
of the intruder is modeled (attack signature is modeled).
Here the system will signal the intrusion once a match is
detected. However, in the second approach normal behavior
of the network is modeled. In this approach, the systemwill
raise the alarm once the behavior of the network does not
match with its normal behavior. There is another Intrusion
Detection (ID) approach that is called specification-based
intrusion detection. In this approach, the normal behavior
(expected behavior) of the host is specified and
consequently modeled. In this approach, as a direct price for
the security, freedom of operation for the host is limited. In
this paper, these approaches will be briefly discussed and
compared.
The idea of having an intruder accessing the systemwithout
even being able to notice it is the worst nightmare for any
network security officer. Since the current ID technology is
not accurate enough to provide a reliable detection, heuristic
methodologies can be a way out. As for the last line of
defense, and in order to reduce the number of undetected
intrusions, heuristic methods such as Honey Pots (HP) can
be deployed [5][6]. HPs can be installed on any systemand
act as trap or decoy for a resource.
Another major problem in this research area is the speed of
detection. Computer networks have a dynamic nature in a
sense that information and data within them are
continuously changing. Therefore, detecting an intrusion
accurately and promptly, the system has to operate in real
time. Operating in real time is not just to perform the
detection in real time, but is to adapt to the new dynamics in
the network. Real time operating IDS is an active research
area pursued by many researchers. Most of the research
works are aimed to introduce the most time efficient
methodologies. The goal is to make the implemented
methods suitable for the real time implementation. .
The main emphasis of this paper is on the detection part of
the intrusion detection. In this thesis we train the classifier
using KDD CUP 1999 dataset. Since the original dataset
around 4 million of records in which there are 22 types of
attacks. These attacks are categorized into four classes and
are DoS, Probe, R2L and U2R. The dataset contains a large
number of redundant values. In order to make it correct we
remove the redundancy from the dataset after removing the
redundancy we perform the value conversions i.e from text
data to numeric data . After value conversion we perform
the normalization on the obtained dataset using statistical
normalization. The normalization is done in the range [0,1].
After normalization the dataset is send for PCA feature
extraction, PCA extracts 300 features using Cartesian
product. This dataset is now correct and accurate. The
dataset so obtained is used to train classifiers. After
classifiers learning, test is carried out using KDD test data.
Each classifiers are combined in a fashion manner and data
is given to each classifier. The output prediction values are
combined. After combination genetic programming is
applied ,based on the fitness function the genetic
programming select the optimized classifiers. The optimized
classifiers so obtained are ensembled using majority voting.
The accuracy of the obtained ensemble is approximately
99.9%.
II. BACKGROUND
In this section we will discuss the mechanisms that are used
for this research. The section will be based on Genetic
Programming, k-NN classifiers, Dataset Description and
Ensemble mechanism.
Genetic Programming
In order to produce a subsequent populace three GP
operators namely replication, mutation and crossover will be
utilized for GP process. These operators aid in meeting to
optimal solution. The optimal composite classifier is
anticipated at the conclude of GP process.
GP’s are heuristic find software design projected to simulate
procedures in usual system. GP fit in to the larger class of
evolutionary software design that produce resolutions to
optimize setbacks employing disparate methods inspired by
usual progress such as inheritance, mutation, selection and
crossover. These are adaptive heuristic find software design
postulated on the evolutionary thoughts of usual selection
and genetic. The frank believed of these evolutionary
software design is to rouse procedure in usual arrangement
vital for evolution. GP’s are utilized for numerical and
computational optimization and established on discover the
evolutionary aspects of models of communal systems. GP
way is utilized to optimize the set of indices derived from
convoluted web theory. Genetic software designs are find
software design established on the technicians of usual
selection and usual genetics. They join survival of the fittest
amongst thread constructions alongside structured yet
randomized data transactions to form a find software design
alongside a little sort of innovative flair of human search.
The GP performs a balanced find on assorted nodes and
there is a demand to retain populace diversity discovery so
that each vital data cannot be capitulated because there is a
outstanding demand to focus on fit servings of the
population. Reproduction in GP is described as the
procedure of producing offspring. The use of GP’s has been
utilized to supplement web established approaches. GP to be
utilized to optimize a set of indices derived fromconvoluted
web theory. The early necessity of a GP is a set of
resolutions embodied by chromosomes shouted population.
The resolutions removed from one populace can be utilized
to form a new population. This can be more increased that
the new populace will be larger than the aged one. The best
resolutions are selected to form new offspring. These
resolutions are selected on the basis of their fitness i.e. the
most suitable offspring will become chances to reproduce.
GP’s are utilized for Search, Optimization, and Contraption
Learning. GP’s are extremely public method for
optimization and are oftentimes prosperous in real requests
and to those interested in meta-heuristics. Evolutionary
software design are utilized to resolve setbacks that do not
by now have a well-defined effectual solution. Genetic
software design have been utilized to resolve optimization
problem.
Nearest neighbor Classifier
Among the various methods of supervised statistical pattern
recognition, the Nearest Neighbor rule achieves consistently
high performance, without a priori assumptions about the
distributions from which the training examples are drawn. It
involves a training set of both positive and negative cases.A
new sample is classified by calculating the distance to the
nearest training case; the sign of that point then determines
the classification of the sample [11][12]. The k-NN
classifier extends this idea by taking the k nearest points and
assigning the sign of the majority. It is common to select k
small and odd to break ties (typically 1, 3 or 5). Larger k
values help reduce the effects of noisy points within the
training data set, and the choice of k is often performed
through cross-validation.
The nearest neighbor machine learning algorithms
dependent of the position of instances present in the input
data. The newly encountered samples are classified based
on the data already stored in the database. The new samples
are classified based on the closet sample determined by the
Euclidien Distance [13]. The decision is determined by the
closest k example. To assign correct class to the data sample
an optimal mapping function f(x) is used. For the
classification problems which have only two classes the data
is classified in two classes i.e either C1 or C2. The output of
the Nearest Neighbor classifier is determined by the ROC
curve, receiver operating characteristic (ROC), or ROC
curve, is a graphical plot that illustrates the performance of
a binary classifier system as its discrimination threshold is
varied. The curve is created by plotting the true positive rate
(TPR) against the false positive rate (FPR) at various
threshold settings. The true-positive rate is also known as
sensitivity or the sensitivity index d', known as "d-prime" in
signal detection and biomedical informatics, or recall in
machine learning. Figure below shows the ROC Curve for
GP based NN Classifier.
KDD CUP 99 Data Set Descriptions
KDD’99 is the the most widely used data set for intrusion
detection procedures . The data set is corrected and is
crafted established on the data collected in DARPA’98 IDS
evaluation programs. The DARPA’98 contains 4 gigabytes
of compressed binary data of 7 weeks of overall web
traffic, which is processed into concerning 5 million
records of connection, every single record contains 100
bytes of traffic data. 2 million connection records has been
examined in two weeks. The KDD training dataset
encompasses of concerning 4,900,000 solitary connection
vectors,every single vector of that contains 41 features and
is classified as normal or an attack, alongside precisely one
particular type of attack. The attacks fall in of the pursuing
four groups:
1) DoS : in this attack the attacker makes a little bit
process computing or makes the resource
unavailable or too maximum to grasp highest
demands, or may completely denies users
admission to a contraption.
2) U2R: This type of attack the vulnerability in order
to become the administrator of the victim
computer . This is achieved by passwords
sniffing , a lexicon attack, or communal
engineering..
3) R2L: occurs after an attacker who has the skill to
dispatch packets to a contraption above a web but
who does not have an report on that contraption
exploits a little vulnerability to gain innate
admission as a user of that contraption.
4) Probing Attack: in this type of attack we collects
information regarding a broad network of
machines . And the information is used to
compromise its security controls.
The KDD’99 CUP dataset features can be categorized into
main three categories:
1) Basic features: in this group, the attributes of a
TCP/IP connection are extracted . Among all
these features many leads to a delay in the
detection.
2) Traffic features: it encompasses those features
which were computed using window interval This
group is divided into two categories:
a) same host features: The main aim is to
examine the connection established in past 2
second having exact destination host which the
current connection is holding. The statistics of
protocol service, behavior etc is calculated
b) “same service” features: The main aim is to
examine the connection established in past 2
second having same service which the current
connection is holding..
Above mentioned two kinds of “traffic” are mainly based
on time . Though, there are countless sluggish aggressions
which scan the ports employing a far high time period
than 2 seconds, e,g, one in each single minute. The
consequences of these aggressions may not present
intrusion outlines alongside a period window of time two
seconds. In order to resolve this setback, the “same host”
and “same service” features are calculated and established
on the connection window containing 100 connections as a
replacement of period window of two seconds. These are
termed asconnection-based traffic features.
3) Content features: R2L as well as U2R attack
types don’t bare sequential patterns as in most
Pribing and DoS attacks. On the other hand DoS
and Probing type attacks mainly contains
countless connections to a small host(s) in a very
short spain of time; R2L as well as U2R attacks
are mainly embedded in the data servings of the
TCP/IP packets, these involve highly a single
connection. In order to notice these kinds of
attacks , a little features able to gaze for dubious
deeds in the data serving, e.g., number of
floundered login attempts.
Data Preprocessing
KDD Cup 1999 dataset contains a large number of
redundant data, therefore to make it correct we remove the
redundancy from dataset. After removing the redundancy
we convert the text data into numeric data. After conversion
of data we perform normalization on the obtained data.
Statistical normalization is applied on the dataset so
obtained. The statistical normalization convert the mean to
zero and variance to one. The Statistical normalization is
defined as;
Xi =
𝑣𝑖−µ𝑖
𝜎
Where µ is mean of n values and σ is standard deviation
Standard deviation is beneficially applicable in large
amount of data because dataset should follow normal
distribution. This normalization method does not scale the
value in range of -1 to 1 but instead in the range of [0,1].
After normalization the data is send to PCA for feature
extraction.
Principal component analysis(PCA)
Principal constituent scrutiny (PCA) is one of the most
priceless after math which was demanded from linear
algebra. Principal constituent scrutiny is utilized plentifully
in each and every forms of scrutiny – starting from
neuroscience up to computer graphics - its characteristics
like facile, non-parametric method of eliminating redundant
data from mystifying data sets makes it so verstile. PCA can
provides a roadmap for how to cut a convoluted data set to a
lower dimension to expose the fromperiod to period hidden,
clear dynamics that oftentimes underlie it with negligible
supplementary manipulation. As a little of the web
characteristics have higher possibilities to be encompassed in
web intrusions, we have used PCA way to recognize these
characteristics.The PCA was used above the training dataset
in order to delineate the features that give most oftentimes in
a machinery of an attack.
According to the obtained aftermath, we have selected
three features out of 41 utilized to delineate every single
connection of KDD99Cup dataset.The goalwas to select the
smallest probable number of the features as maintaining
elevated detection rate of intrusions. In such a method
detection might be gave as a real-time one. Selected features
and their explications are gave 41 features from KDD99Cup
dataset and their explication are gave as well as selected
features up to dimension of 300 that we have obtained
employing the alike PCA method as delineated before.
Classifier Ensemble
Dietterich described a lot of combination methods based on
machine learning. Sharky pointed out that the limiting point
i.e factor for combining of classifier is due to lack of
awareness of full rule of available modular structure,
because of little agreement as describing and classifying
various class of classifiers. The comprehensive
categorization scheme of classifier ensemble is shown.
1. Voting classifier ensembles
2. Classifier ensembles by manipulating training samples
3. Homogeneous classifier ensembles
4. Recursive partition ensembles
5. Heterogeneous classifier ensembles
The Ensemble method used in this thesis is Voting.
Voting classifier ensembles
The three main categories are as:
1. A simple voting scheme: in this scheme each
individual classifier is an equally weighted votes.
The input is assigned to a high majority voted
classifier.
2. Weighted electing scheme: every single poll
receives a heaviness, proportional to approximated
generalization, presentation of the corresponding
classifier. This scheme has higher presentation than
easy electing.
3. The weighted majority algorithm: is similar to
weighted voting but the difference is how weights
are generated.
In this thesis the majority voting mechanismis used for
the ensemble of classifier to produce a composite
classifier
III. PROPOSED WORK
Fig 1 shows the proposed modelfor intrusion detection using
Kdd cup 1999 dataset. Using Kdd Cup 1999 k-NN classifiers
are trained firstly. After training Classifiers ,each classifier is
tested by apply Kdd cup 1999 dataset as test data.The output
of each classifier is combined then. After combining the
output prediction of each classifier , genetic programming is
then applied. The fitness function defined in genetic
programming is applied which selects the optimized
classifiers and these classifiers are then ensembled using
majority voting. The fitness function defined in genetic
programming is sumof six variables and is defined as;
F= records+ numfolds +K_value +Time+model+accuracy;
The counseled way encompasses two stages. In the early
one, the training period, a set of laws for noticing intruders is
generated employing web audit data offline. In the
subsequent period, the best laws, i.e. the laws alongside the
highest fitness benefits, are utilized for intrusion detection in
the real-time environment.
Fig 1: Diagram of proposed model
As a little of the web characteristics have higher
possibilities to be encompassed in web intrusions, we have
used PCA way to recognize these characteristics. The PCA
was used above the training dataset in order to delineate the
features that give most oftentimes in a machinery of an
attack.
According to the obtained aftermath, we have selected
three features out of 41 utilized to delineate every single
connection of KDD99Cup dataset.The goalwas to select the
smallest probable number of the features as maintaining
elevated detection rate of intrusions. In such a method
detection might be gave as a real-time one. Selected features
and their explications are gave 41 features from KDD99Cup
dataset and their explication are gave as well as selected
features up to dimension of 300 that we have obtained
employing the alike PCA method as delineated before.
The algorithm for producing new laws is gave as follows.
The early pace is initialization of an early populace after
every single gene is given a random value. Next the
parameters of genetic Programming (crossover and mutation
rate, size of populace, conclude of progress of rules) are
enumerated and the web audit data is being loaded. The
working of algorithm is shown in figure 2;
After that the early populace is being evolved for a
number of generations.In every single creation the quality of
every single law, i.e. fitness worth, is being computed
according to the fitness purpose, next a number of laws
alongside the highest fitness benefits are being selected and
at the conclude the genetic operators (crossover and
mutation) are gave alongside a precise probability. The
output of the algorithm are laws for intrusion detection For
the intention of this work, two subsets of KDD99Cup
datasets for training and assessing are derived. Every single
connection has the corresponding marking that states
whether it is a normal connection or a precise kind of an
attack. The subset utilized for training encompassed assorted
aggressions and normal connections. The most of the
connections selected are normal, that is usually the case in
real-world networks.
Fig 2: Flow chart of Genetic Programming
Genetic software design is utilized to inductively produce a
GP classifier as a K-Nearest Neighbor established ensemble
for the task of data classification. Nearest Neighbor, in fact,
can be elucidated as constitution of purposes whereas the
purpose set is the set of attribute examinations and the
terminal set are the classes. The purpose set can be obtained
by changing every single attribute into an attribute-test
function. At the commencing the fitness of every single
individual (K) is evaluated. Then, at every single creation,
every single tree experiences one of the genetic operators
(reproduction, crossover, mutation) reliant on the probability
test. If crossover is requested, the friend of the present
individual is selected as the acquaintance possessing the best
fitness,and the offspring is generated. The present K nearest
acquaintance is next substituted by the best of the two
offspring if the fitness of the last is larger than that of the
former. The evaluation of the fitness of every single k-
classifier is computed on the whole training data. Later the
killing of the number of generations described by the user,
the individual alongside the best fitness embodies the
classifier. After optimizing the classifiers we use majority
voting mechanism to combine the classifiers. Based on this
mechanism the classifier with highest weighted vote is
selected for classification. As a result of this the shortcoming
of one classifier is overwhelmed by the one with higher
accuracy.
IV. RELATED WORK
MS Hoque, et al in “An implementation of intrusion
detection arrangement employing genetic algorithm” [13]
There are assorted ways being utilized in intrusion
detections, but unfortunately each of the arrangements so
distant is not completely flawless. So, the quest of
improvement continues. In this progression, here we present
an Intrusion Detection Arrangement (IDS), by requesting
genetic algorithm (GA) to effectually notice assorted kinds
of web intrusions. Parameters and progress procedures for
GA are debated in features and implemented.we present and
requested an Intrusion Detection Arrangement by requesting
genetic algorithm to effectually notice assorted kinds of web
intrusions. To apply and compute the presentation of our
arrangement we utilized the average KDD99 benchmark
dataset and obtained reasonable detection rate. To compute
the fitness of a chromosome we utilized the average
deviation equation alongside distance. S Owais, et al in
“Survey: Employing Genetic Algorithm Way in Intrusion
Detection Arrangements Techniques” [14] Intrusion
detection arrangements (IDS) proposal methods for
modeling and knowing normal and harsh arrangement
behavior. GAs can be prosperously utilized to tune the
membership purposes utilized by the IDS. In this paper a
survey were gave ways established on IDS, and on
requesting of GAs (GAs) on IDS.GA as evolutionary
algorithms was prosperously utilized in disparate kinds of
IDS. Employing GA returned amazing aftermath, the best
fitness worth was extremely closely to the flawless fitness
value. GA is a randomization find method frequently utilized
for optimization problem. GA was prosperously able to
produce a ideal alongside the wanted characteristics of
elevated correct detection rate and low fake affirmative rate
for IDS . And it utilized prosperously in IDS to discriminate
the normal deed and the intruded deeds, and clustering GAs
are a enthusing method for the detection of malicious
intrusions into computer systems. Z Bankovic, et al in
“Improving web protection employing genetic algorithm
approach” [15] In this work we have comprehended a misuse
detection arrangement established on genetic algorithm(GA)
approach. For evolving and assessing new laws for intrusion
detection the KDD99Cup training and assessing dataset were
used. To be able to procedure web data in real period, we
have used main constituent scutiny (PCA) to remove the
most vital features of the data. In that method we were able
to retain the elevated level of detection rates of aggressions
as speeding up the processing of the data.In this work we
have used genetic algorithm way to intrusion detection.
Multimedia implementation of the counseled way is
presented. Genetic algorithm was utilized to attain
association laws for intrusion detection as main constituent
scutiny was utilized to recognize the most vital features of
web connections. P LaRoche, et al in “Genetic Software
design Instituted WiFi Data Link Layer Attack Detection”
[16] This paper presents a genetic software design
established detection arrangement for Data Link layer
aggressions on a WiFi network. We discover the use of two
disparate fitness purposes in order to accomplish both a
elevated detection rate and a low fake affirmative rate.
Aftermath display that the detection arrangement
industrialized can accomplish a detection rate above 90%
and a fake affirmative rate below 1%.Our upcoming work
will be to discover the use of larger data sets for training and
assessing our L-GP established IDS. This will permit us to
confirm the effectiveness of our work above larger webs as
well as a varied number and length of DoS attacks. Also, we
design on requesting the alike way delineated here on
supplementary WiFi aggressions, alongside the aim of
growing an IDS that can be utilized to notice a collection of
attacks. G Folino, et al in “GP Ensemble for Distributed
Intrusion Detection Systems” [17] In this paper an intrusion
detection algorithm established on GP ensembles is
proposed. The algorithm runs on a distributed hybrid multi-
isle model-based nature to monitor security-related attention
inside a network. Every single isle encompasses a cellular
genetic plan whose target is to produce a decision-tree
predictor, trained on the innate data stored in the node. A
distributed intrusion detection algorithm established on the
ensemble paradigm has been counseled and the suitability of
genetic software design as a constituent learner of the
ensemble has been investigated. Experimental aftermath
display the applicability of the way for this kind of problems.
Upcoming scutiny aims at spreading the method after
pondering not batch data sets but data streams that change
online on every single node of the network. W Lu, et al in
“Detecting New Forms of Web Intrusion Employing Genetic
Programming” [18] In this paper, a law progress way
established on Genetic Software design (GP) for noticing
novel aggressions on webs is gave and four genetic
operators, namely reproduction, mutation, crossover, and
dropping condition operators, are utilized to evolve new
rules. New laws are utilized to notice novel or recognized
web attacks. A training and assessing dataset counseled by
DARPA is utilized to evolve and assess these new rules. The
facts of believed implementation displays that a law
generated by GP has a low fake affirmative rate (FPR), a low
fake negative rate and a elevated rate of noticing unfamiliar
attacks. Moreover, the law center composed of new laws has
elevated detection rate alongside low FPR. An alternative to
the DARPA evaluation way is additionally investigated.In
this paper, we have gave and assessed a GP-based way for
noticing recognized or novel aggressions on a network. The
facts of believed implementation displays that new laws
generated by GP have the possible skill to notice novel forms
of attacks.Though,the detection consequence is not good for
a little runs because the selection of crossover and mutation
points in corresponding procedures is random. In
supplement, selecting the probability of genetic operators
selection is experience based. In our implementation, the
probability of mutation and crossover are 0.01 and 0.6,
respectively. Wei Li, et al in “Using Genetic Algorithm for
Web Intrusion Detection” [19] This paper describes a
method of requesting Genetic Algorithm (GA) to web
Intrusion Detection Arrangements (IDSs). A brief overview
of the Intrusion Detection System, genetic algorithm, and
connected detection methods is presented. Parameters and
progress procedure for GA are debated in detail. Unlike
supplementary implementations of the alike setback, this
implementation considers both temporal and spatial data of
web connections in encoding the web connection data into
laws in IDS.Future work includes crafting a average
examination data set for the genetic algorithm counseled in
this paper and requesting it to a examination environment.
Methodical specification of parameters to ponder for genetic
algorithm ought to be ambitious across the experiments.
Joining vision from disparate protection sensors into a
average law center is one more enthusing span in this work.
Srinivas Mukkamala, et al in “Modeling Intrusion Detection
Arrangements Employing Linear Genetic Software design
Approach” [20] This paper investigates the suitability of
linear genetic software design (LGP) method to ideal
effectual intrusion detection arrangements, as contrasting its
presentation alongside manmade neural webs and prop
vector machines. Due to rising events of cyber aggressions
and, constructing competent intrusion detection
arrangements (IDSs) are vital for protecting data
arrangements protection,and yet it stays an elusive aim and a
outstanding challenge. We additionally examine key feature
indentification for constructing effectual and competent
IDSs. We note, though, that the difference in accuracy
figures incline to be extremely tiny and could not be
statistically momentous, exceptionally in think of the fact
that the 5 classes of outlines differ in their sizes
tremendously. Extra definitive conclusions can merely be
made afterward analyzing extra comprehensive sets of web
traffic data. Mark Crosbie, et al in “Applying Genetic
Software design to Intrusion Detection” [21] This paper
presents a possible resolution to the intrusion detection
setback in cmnputer security. It uses a combination of work
in the fields of Manmade Attendance and computer security.
It displays how an intrusion detection arrangement can be
requested employing self-governing agents, and how these
agents can be crafted employing Genetic This prototype
progress work has increased countless inquiries, main amid
that is how to make statements concerning tile effectiveness
of our intrusion detector. How can we be sure it will notice, a
specific intrusion? Can we compute a probability a priori of
its effectiveness.
V. RESULTS AND ANALYSIS
The algorithm creates k-NN classifier from training data,
while trying to maximize the probability of detection and
reduce errors for each class in training data. Model was
created using the KDD data set. One decision tree was
created using the KDD training data subset and the second
one using the KDD testing data subset. The decision tree
created using the KDD training data subset was tested on
the KDD testing data subset and vice versa. After creating
the k-NN models for the ,Probe, DoS, U2R and R2L attack
categories, optimized rules were extracted using the Nearest
Neighbor rules utility.
The total number of instances of KDD CUP 1999 taken in
this model is around one million because of less avalibility
of computer memory. The confusion matrix of the Kdd
dataset is shown in table below.
Table 1: Confusion matrix
Confusion matrix of KDD CUP 1999 Dataset
Normal Probe DoS R2L U2R
Normal 43224 30 20 5 0
Probe 90 27135 10 0 0
DoS 10 25 27180 0 0
R2L 30 0 30 2178 0
U2R 0 0 0 0 33
The performance of the ensembler is determined by
analyzing the following terms:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑡𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Recall =
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Precision =
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Specifity =
𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 +𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Table below shows the accuracy, precision, recall and
specificity of each class obtained in this thesis;
Table 2: Values of different parameters obtained
Class Precision Specificity Recall Accuracy
Normal 0.997 0.997 0.998 0.432
Probe 0.997 0.999 0.996 0.271
DoS 0.997 0.999 0.998 0.271
R2L 0.997 0.999 0.973 0.022
U2R 1.000 1.000 1.000 0.003
From above two tables, the overall accuracy calculated is
approximately 99.8%. The accuracy of various individual
k-NN classifiers are shown in the table below:
Table 3: Accuracy of Single k-NN classifier for
different values of k.
Valus of K Accuracy
14 96.26225582
10 93.83818346
18 95.29526776
20 98.20442344
40 91.71358899
30 97.92837342
The table above shows that the single classifier has always
low accuracy then that of the ensembler one. From this
comparision we can say that the ensembler designed in this
thesis gives a better accuracy rate then the individual ones.
As shown in figure below our approach can obtain the
largest area under ROC (0.99976) curve as well as the
lowest false alarm rate when all the intrusion attacks can be
correctly classified in the ROC graph, it further supports the
robust performance of our approach. The extensive
experimental results in this paper have shown the successful
classification of sophisticated intrusion attacks and normal
network traffic.
Fig 3: ROC curve for GP based Classifier showing
0.99976 area under the curve
VI. CONCLUSION
To addresses these all issues we proposed a system i.e
ensemble using genetic program which will have a better
performance as compared to others. In this paper we
ensemble only NN classifiers. Few carried out on
resembling heterogeneous type to make out good results.
This paper has address many issues which are creating
trouble for designing effective classifiers. The paper
discussed the GP in detailed and how can be a classifier of
better performance be developed. In short in ensemble of
classifiers using genetic programming a lot of human expert
requirement has been decreased and an automatic system
has been developed.These results demonstrate that classifier
models trained using any four folds acquire sufficient
information to achieve high detection rates if the fifth fold
are employed for testing. Algorithms tested in the literature
were able to achieve only approximately a detection rate of
80% for the U2R and the R2L attack categories. If KDD
training and testing data subsets are merged and re-sampled
with 99.89% of records in the training data subset of records
in the testing data subset, the detection rates for the U2R
and R2L attack categories rise to 99%. This clearly indicates
that the original KDD training and testing data subsets
represent dissimilar target hypotheses for the U2R and the
R2L attack categories.
Intrusion detection based on statistical pattern recognition
approaches has attracted a wide range of interest over the
last decade in response to the growing demand of reliable
and intelligent intrusion detection systems (IDS), which are
required to detect sophisticated and polymorphous intrusion
attacks. In this work We have presented a novel intrusion
detection approach that uses Genetic programming based
Ensemble approach for detecting intrusion detection. The
experimental results demonstrate that the GP base Ensemble
Classifier is effective for reducing false alarm information
such that the widespread IDS systems can be implemented
using our approach considering both accuracy and
interpretability. In future Feature selection can be used not
only to alleviate the curse of dimensionality and minimize
classification errors, but also to improve the interpretability
of Ensemble-based classifiers. Our future work will focus
on reducing features for the classifiers by methods of
feature selection. Also, the work will be continued to study
the fitness function of the genetic algorithm to manipulate
more parameters of the fuzzy inference module, even
concentrating on fuzzy rules themselves
VII. REFERENCES
[1]. Kabiri, Peyman, and Ali A. Ghorbani. "Research on
Intrusion Detection and Response: A Survey." IJ
Network Security 1, no. 2 (2005): 84-102.
[2]. Schnackenberg, Dan, Kelly Djahandari, and Dan
Sterne. "Infrastructure for intrusion detection and
response." In DARPA Information Survivability
Conference and Exposition, 2000. DISCEX'00.
Proceedings, vol. 2, pp. 3-11. IEEE, 2000.
[3]. Garcia-Teodoro, Pedro, J. Diaz-Verdejo, Gabriel
Maciá-Fernández, and Enrique Vázquez.
"Anomaly-based network intrusion detection:
Techniques,systems and challenges." computers &
security 28, no. 1 (2009): 18-28.
[4]. Wu, Handong, Stephen Schwab, and Robert Lom
Peckham. "Signature based network intrusion
detection system and method." U.S. Patent
7,424,744, issued September 9, 2008.
[5]. Kreibich, Christian, and Jon Crowcroft.
"Honeycomb: creating intrusion detection
signatures using honeypots." ACM SIGCOMM
Computer Communication Review 34, no. 1
(2004): 51-56.
[6]. Roesch, Martin. "Snort: Lightweight Intrusion
Detection for Networks." In LISA, vol. 99, no. 1,
pp. 229-238. 1999.
[7]. Rokach, Lior. "Ensemble-based classifiers."
Artificial Intelligence Review 33, no. 1-2 (2010): 1-
39.
[8]. Ruta, Dymitr, and Bogdan Gabrys. "An overview of
classifier fusion methods." Computing and
Information systems 7, no. 1 (2000): 1-10.
[9]. Polikar, Robi. "Ensemble based systems in decision
making." Circuits and Systems Magazine, IEEE 6,
no. 3 (2006): 21-45.
[10]. Džeroski, Saso, and Bernard Ženko. "Is combining
classifiers with stacking better than selecting the
best one?." Machine learning 54, no. 3 (2004): 255-
273.
[11]. Beyer, Kevin, Jonathan Goldstein, Raghu
Ramakrishnan, and Uri Shaft. "When is “nearest
neighbor” meaningful?." In Database Theory—
ICDT’99, pp. 217-235. Springer Berlin Heidelberg,
1999.
[12]. Liao, Yihua, and V. Rao Vemuri. "Use of k-nearest
neighbor classifier for intrusion detection."
Computers & Security 21, no. 5 (2002): 439-448.
[13]. Gower, John Clifford. "Properties of Euclidean and
non-Euclidean distance matrices." Linear Algebra
and its Applications 67 (1985): 81-97.
[14]. Hoque, Mohammad Sazzadul, Md Mukit, Md
Bikas, and Abu Naser. "An implementation of
intrusion detection systemusing genetic algorithm."
arXiv preprint arXiv:1204.1336 (2012).
[15]. Owais, Suhail, Vaclav Snasel, Pavel Kromer, and
Ajith Abraham. "Survey: using genetic algorithm
approach in intrusion detection systems
techniques." In Computer Information Systems and
Industrial Management Applications, 2008.
CISIM'08. 7th, pp. 300-307. IEEE, 2008.
[16]. Bankovic, Zorana, Dušan Stepanovic, Slobodan
Bojanic, and Octavio Nieto-Taladriz. "Improving
network security using genetic algorithm
approach." Computers & Electrical Engineering 33,
no. 5 (2007): 438-451.
[17]. LaRoche, Patrick, and A. Nur Zincir-Heywood.
"Genetic programming based wifi data link layer
attack detection." In null, pp. 285-292. IEEE, 2006.
[18]. Folino, Gianluigi, Clara Pizzuti, and Giandomenico
Spezzano. "GP ensemble for distributed intrusion
detection systems." In Pattern Recognition and Data
Mining, pp. 54-62. Springer Berlin Heidelberg,
2005.
[19]. Lu, Wei, and Issa Traore. "Detecting new forms of
network intrusion using genetic programming."
Computational Intelligence 20, no. 3 (2004): 475-
494.
[20]. Li, Wei. "Using genetic algorithm for network
intrusion detection." Proceedings of the United
States Department of Energy Cyber Security Group
(2004): 1-8.
[21]. Mukkamala, Srinivas, Andrew H. Sung, and Ajith
Abraham. "Modeling intrusion detection systems
using linear genetic programming approach." In
Innovations in applied artificial intelligence, pp.
633-642. Springer Berlin Heidelberg, 2004.
[22]. Crosbie, Mark, and Gene Spafford. "Applying
genetic programming to intrusion detection." In
Working Notes for the AAAI Symposium on
Genetic Programming, pp. 1-8. Cambridge, MA:
MIT Press, 1995.

More Related Content

What's hot

Automatic Insider Threat Detection in E-mail System using N-gram Technique
Automatic Insider Threat Detection in E-mail System using N-gram TechniqueAutomatic Insider Threat Detection in E-mail System using N-gram Technique
Automatic Insider Threat Detection in E-mail System using N-gram TechniqueIRJET Journal
 
Measuring Information Security: Understanding And Selecting Appropriate Metrics
Measuring Information Security: Understanding And Selecting Appropriate MetricsMeasuring Information Security: Understanding And Selecting Appropriate Metrics
Measuring Information Security: Understanding And Selecting Appropriate MetricsCSCJournals
 
Data Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage DetectionData Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage DetectionIOSR Journals
 
Internet of things-based photovoltaics parameter monitoring system using Node...
Internet of things-based photovoltaics parameter monitoring system using Node...Internet of things-based photovoltaics parameter monitoring system using Node...
Internet of things-based photovoltaics parameter monitoring system using Node...IJECEIAES
 
Secured Scheduling Technique of Network Resource Management in Vehicular Comm...
Secured Scheduling Technique of Network Resource Management in Vehicular Comm...Secured Scheduling Technique of Network Resource Management in Vehicular Comm...
Secured Scheduling Technique of Network Resource Management in Vehicular Comm...Gagan Bansal
 
An overview of internet of things
An overview of internet of thingsAn overview of internet of things
An overview of internet of thingsTELKOMNIKA JOURNAL
 
Internet service providers responsibilities in botnet mitigation: a Nigerian ...
Internet service providers responsibilities in botnet mitigation: a Nigerian ...Internet service providers responsibilities in botnet mitigation: a Nigerian ...
Internet service providers responsibilities in botnet mitigation: a Nigerian ...IJECEIAES
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET Journal
 
A FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCES
A FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCESA FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCES
A FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCESijmpict
 
State regulation of the IoT in the Russian Federation: Fundamentals and chall...
State regulation of the IoT in the Russian Federation: Fundamentals and chall...State regulation of the IoT in the Russian Federation: Fundamentals and chall...
State regulation of the IoT in the Russian Federation: Fundamentals and chall...IJECEIAES
 
Machine Learning Algorithms Applied to System Security: A Systematic Review
Machine Learning Algorithms Applied to System Security: A Systematic ReviewMachine Learning Algorithms Applied to System Security: A Systematic Review
Machine Learning Algorithms Applied to System Security: A Systematic ReviewAssociate Professor in VSB Coimbatore
 
Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere IJECEIAES
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceAbhishek Upadhyay
 
IRJET - Fake News Detection: A Survey
IRJET -  	  Fake News Detection: A SurveyIRJET -  	  Fake News Detection: A Survey
IRJET - Fake News Detection: A SurveyIRJET Journal
 
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...IRJET Journal
 
1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...
1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...
1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...Alexander Decker
 

What's hot (18)

Automatic Insider Threat Detection in E-mail System using N-gram Technique
Automatic Insider Threat Detection in E-mail System using N-gram TechniqueAutomatic Insider Threat Detection in E-mail System using N-gram Technique
Automatic Insider Threat Detection in E-mail System using N-gram Technique
 
Measuring Information Security: Understanding And Selecting Appropriate Metrics
Measuring Information Security: Understanding And Selecting Appropriate MetricsMeasuring Information Security: Understanding And Selecting Appropriate Metrics
Measuring Information Security: Understanding And Selecting Appropriate Metrics
 
Data Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage DetectionData Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage Detection
 
DLD_SYNOPSIS
DLD_SYNOPSISDLD_SYNOPSIS
DLD_SYNOPSIS
 
Internet of things-based photovoltaics parameter monitoring system using Node...
Internet of things-based photovoltaics parameter monitoring system using Node...Internet of things-based photovoltaics parameter monitoring system using Node...
Internet of things-based photovoltaics parameter monitoring system using Node...
 
Secured Scheduling Technique of Network Resource Management in Vehicular Comm...
Secured Scheduling Technique of Network Resource Management in Vehicular Comm...Secured Scheduling Technique of Network Resource Management in Vehicular Comm...
Secured Scheduling Technique of Network Resource Management in Vehicular Comm...
 
An overview of internet of things
An overview of internet of thingsAn overview of internet of things
An overview of internet of things
 
Internet service providers responsibilities in botnet mitigation: a Nigerian ...
Internet service providers responsibilities in botnet mitigation: a Nigerian ...Internet service providers responsibilities in botnet mitigation: a Nigerian ...
Internet service providers responsibilities in botnet mitigation: a Nigerian ...
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation Forest
 
A FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCES
A FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCESA FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCES
A FRAMEWORK TO DEFENSE AGAINST INSIDER ATTACKS ON INFORMATION SOURCES
 
State regulation of the IoT in the Russian Federation: Fundamentals and chall...
State regulation of the IoT in the Russian Federation: Fundamentals and chall...State regulation of the IoT in the Russian Federation: Fundamentals and chall...
State regulation of the IoT in the Russian Federation: Fundamentals and chall...
 
A45010107
A45010107A45010107
A45010107
 
Machine Learning Algorithms Applied to System Security: A Systematic Review
Machine Learning Algorithms Applied to System Security: A Systematic ReviewMachine Learning Algorithms Applied to System Security: A Systematic Review
Machine Learning Algorithms Applied to System Security: A Systematic Review
 
Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
 
IRJET - Fake News Detection: A Survey
IRJET -  	  Fake News Detection: A SurveyIRJET -  	  Fake News Detection: A Survey
IRJET - Fake News Detection: A Survey
 
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
 
1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...
1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...
1.[1 9]a genetic algorithm based elucidation for improving intrusion detectio...
 

Viewers also liked

Een oriëntatie bouw volgen bij JES Opleidingen
Een oriëntatie bouw volgen bij JES Opleidingen Een oriëntatie bouw volgen bij JES Opleidingen
Een oriëntatie bouw volgen bij JES Opleidingen Duchka Walraet
 
CURRICULUM_VITAE (1) (1)
CURRICULUM_VITAE (1) (1)CURRICULUM_VITAE (1) (1)
CURRICULUM_VITAE (1) (1)ZUBAIR prince
 
Macaguifama Eventos Centenario del Colmado Quilez La Fuente
Macaguifama Eventos Centenario del Colmado Quilez La FuenteMacaguifama Eventos Centenario del Colmado Quilez La Fuente
Macaguifama Eventos Centenario del Colmado Quilez La FuenteSommelier Faustino Muñoz Soria
 
trabajo cambio climatico
trabajo cambio climaticotrabajo cambio climatico
trabajo cambio climaticolaclasedemiriam
 
Os sabores regionais da culinária francesa
Os sabores regionais da culinária francesaOs sabores regionais da culinária francesa
Os sabores regionais da culinária francesaEttoreTedeschi
 
Dinamica clasica de particulas y sistemas marion español
Dinamica clasica de particulas y sistemas  marion españolDinamica clasica de particulas y sistemas  marion español
Dinamica clasica de particulas y sistemas marion españoldieco de souza
 

Viewers also liked (8)

TsunamiEvacuationPresentation
TsunamiEvacuationPresentationTsunamiEvacuationPresentation
TsunamiEvacuationPresentation
 
Een oriëntatie bouw volgen bij JES Opleidingen
Een oriëntatie bouw volgen bij JES Opleidingen Een oriëntatie bouw volgen bij JES Opleidingen
Een oriëntatie bouw volgen bij JES Opleidingen
 
Mix-dBrochure-1
Mix-dBrochure-1Mix-dBrochure-1
Mix-dBrochure-1
 
CURRICULUM_VITAE (1) (1)
CURRICULUM_VITAE (1) (1)CURRICULUM_VITAE (1) (1)
CURRICULUM_VITAE (1) (1)
 
Macaguifama Eventos Centenario del Colmado Quilez La Fuente
Macaguifama Eventos Centenario del Colmado Quilez La FuenteMacaguifama Eventos Centenario del Colmado Quilez La Fuente
Macaguifama Eventos Centenario del Colmado Quilez La Fuente
 
trabajo cambio climatico
trabajo cambio climaticotrabajo cambio climatico
trabajo cambio climatico
 
Os sabores regionais da culinária francesa
Os sabores regionais da culinária francesaOs sabores regionais da culinária francesa
Os sabores regionais da culinária francesa
 
Dinamica clasica de particulas y sistemas marion español
Dinamica clasica de particulas y sistemas  marion españolDinamica clasica de particulas y sistemas  marion español
Dinamica clasica de particulas y sistemas marion español
 

Similar to rpaper

Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...Drjabez
 
A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...
A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...
A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...IJNSA Journal
 
Data Mining Techniques for Providing Network Security through Intrusion Detec...
Data Mining Techniques for Providing Network Security through Intrusion Detec...Data Mining Techniques for Providing Network Security through Intrusion Detec...
Data Mining Techniques for Providing Network Security through Intrusion Detec...IJAAS Team
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...ijceronline
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...ijceronline
 
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIERATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIERCSEIJJournal
 
Attack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest ClassifierAttack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest ClassifierCSEIJJournal
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project finalCraig Cannon
 
IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...
IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...
IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...IRJET Journal
 
A survey of Network Intrusion Detection using soft computing Technique
A survey of Network Intrusion Detection using soft computing TechniqueA survey of Network Intrusion Detection using soft computing Technique
A survey of Network Intrusion Detection using soft computing Techniqueijsrd.com
 
Real Time Intrusion Detection System Using Computational Intelligence and Neu...
Real Time Intrusion Detection System Using Computational Intelligence and Neu...Real Time Intrusion Detection System Using Computational Intelligence and Neu...
Real Time Intrusion Detection System Using Computational Intelligence and Neu...ijtsrd
 
FORTIFICATION OF HYBRID INTRUSION DETECTION SYSTEM USING VARIANTS OF NEURAL ...
FORTIFICATION OF HYBRID INTRUSION  DETECTION SYSTEM USING VARIANTS OF NEURAL ...FORTIFICATION OF HYBRID INTRUSION  DETECTION SYSTEM USING VARIANTS OF NEURAL ...
FORTIFICATION OF HYBRID INTRUSION DETECTION SYSTEM USING VARIANTS OF NEURAL ...IJNSA Journal
 
An Approach of Automatic Data Mining Algorithm for Intrusion Detection and P...
An Approach of Automatic Data Mining Algorithm for Intrusion  Detection and P...An Approach of Automatic Data Mining Algorithm for Intrusion  Detection and P...
An Approach of Automatic Data Mining Algorithm for Intrusion Detection and P...IOSR Journals
 
Intrusion Detection System Using Machine Learning: An Overview
Intrusion Detection System Using Machine Learning: An OverviewIntrusion Detection System Using Machine Learning: An Overview
Intrusion Detection System Using Machine Learning: An OverviewIRJET Journal
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Editor IJARCET
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Editor IJARCET
 
Yolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda Chiramba
 
Implementation of Secured Network Based Intrusion Detection System Using SVM ...
Implementation of Secured Network Based Intrusion Detection System Using SVM ...Implementation of Secured Network Based Intrusion Detection System Using SVM ...
Implementation of Secured Network Based Intrusion Detection System Using SVM ...IRJET Journal
 

Similar to rpaper (20)

Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
 
A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...
A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...
A PROPOSED MODEL FOR DIMENSIONALITY REDUCTION TO IMPROVE THE CLASSIFICATION C...
 
Data Mining Techniques for Providing Network Security through Intrusion Detec...
Data Mining Techniques for Providing Network Security through Intrusion Detec...Data Mining Techniques for Providing Network Security through Intrusion Detec...
Data Mining Techniques for Providing Network Security through Intrusion Detec...
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
 
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIERATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
ATTACK DETECTION AVAILING FEATURE DISCRETION USING RANDOM FOREST CLASSIFIER
 
Attack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest ClassifierAttack Detection Availing Feature Discretion using Random Forest Classifier
Attack Detection Availing Feature Discretion using Random Forest Classifier
 
A45010107
A45010107A45010107
A45010107
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...
IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...
IRJET- An Intrusion Detection Framework based on Binary Classifiers Optimized...
 
C3602021025
C3602021025C3602021025
C3602021025
 
A survey of Network Intrusion Detection using soft computing Technique
A survey of Network Intrusion Detection using soft computing TechniqueA survey of Network Intrusion Detection using soft computing Technique
A survey of Network Intrusion Detection using soft computing Technique
 
Real Time Intrusion Detection System Using Computational Intelligence and Neu...
Real Time Intrusion Detection System Using Computational Intelligence and Neu...Real Time Intrusion Detection System Using Computational Intelligence and Neu...
Real Time Intrusion Detection System Using Computational Intelligence and Neu...
 
FORTIFICATION OF HYBRID INTRUSION DETECTION SYSTEM USING VARIANTS OF NEURAL ...
FORTIFICATION OF HYBRID INTRUSION  DETECTION SYSTEM USING VARIANTS OF NEURAL ...FORTIFICATION OF HYBRID INTRUSION  DETECTION SYSTEM USING VARIANTS OF NEURAL ...
FORTIFICATION OF HYBRID INTRUSION DETECTION SYSTEM USING VARIANTS OF NEURAL ...
 
An Approach of Automatic Data Mining Algorithm for Intrusion Detection and P...
An Approach of Automatic Data Mining Algorithm for Intrusion  Detection and P...An Approach of Automatic Data Mining Algorithm for Intrusion  Detection and P...
An Approach of Automatic Data Mining Algorithm for Intrusion Detection and P...
 
Intrusion Detection System Using Machine Learning: An Overview
Intrusion Detection System Using Machine Learning: An OverviewIntrusion Detection System Using Machine Learning: An Overview
Intrusion Detection System Using Machine Learning: An Overview
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194
 
Yolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda chiramba Survey Paper
Yolinda chiramba Survey Paper
 
Implementation of Secured Network Based Intrusion Detection System Using SVM ...
Implementation of Secured Network Based Intrusion Detection System Using SVM ...Implementation of Secured Network Based Intrusion Detection System Using SVM ...
Implementation of Secured Network Based Intrusion Detection System Using SVM ...
 

rpaper

  • 1. Genetic Programming Ensemble to improve k-NN based network intrusion detection system Authors Imran Ahmed Malik Mrs. Amrita Prasad Student at Sharda University Greater Noida Astt. Professor at Sharda University Email:imran409.im@gmail.com Email: amrita.prasad@sharda.ac.in Abstract—From the start of network system, security threats usually known as intrusions has become very important and critical issue in networks, data and information systems. In order to overcome these threats every time a detection systemwas needed because of drastic growth in networks. Because of the growth of system, attackers became stronger and every time compromises the security of system. Hence a need of Intrusion Detection system became very important and essential tool in network security. Detection and prevention of such attacks called intrusions mainly depends on the capability and efficiency of Intrusion Detection System (IDS). As with increase in network scalability with high pace, The need for light weight Intrusion Detection System with high detection rate is a requirement. Therefore, many ensemble mechanisms has been proposed by using many methodologies’, these methodologies have their own benefits and short comings.. In this paper we propose ensemble of k-NN classifiers using genetic programming. KDD CUP 1999 dataset is used for intrusion detection. The KDD CUP 1999 dataset contains 4,900,000 records. Each record contains 41 features and is labeled as either normal or an attack. There are 22 types of attacks and these attacks are classified into four categories namely DoS, Probe , R2L and U2R.Since the dataset is not accurate we perform some preprocessing on dataset .The first step in preprocessing is removing redundancy ,after removing redundancy we perform the data normalization using statistical methods in the range [0 1]. After normalization we extract the features using PCA feature selection. The PCA feature selection is applied on the dataset and 300 features are extracted using Cartesian product. The dataset is then applied to Ensembler of classifiers . The ensembler of classifiers classifies the input data into five categories. Out of these five categories one is normal and other four are attacks. The ensembler of classifiers using genetic programming gives an accuracy of 99.97%. Keywords—Intrusion detection; Anomaly detection; Misuse detection; KDD Cup 99; Ensemble Approaches. Genetic Programming, PCA I. INTRODUCTION In the past two decades with the rapid progress in the Internet based technology, new application areas for computer network have emerged. All of these application areas made the network an attractive target for the abuse and a big vulnerability for the community. A fun to do job or a challenge to win action for some people became a nightmare for the others. In many cases malicious acts made this nightmare to become a reality. In addition to the hacking, new entities like worms, Trojans and viruses introduced more panic into the networked society. As the current situation is a relatively new phenomenon, network defenses are weak [1]. However, due to the popularity of the computer networks, their connectivity and our ever growing dependency on them, realization of the threat can have devastating consequences. Securing such an important infrastructure has become the priority one research area for many researchers. Aim of this paper is to work Intrusion Detection Systems (IDS) and to analyze some current problems that exist in this research area. In comparison to some mature and well settled research areas, IDS is a young field of research. However, due to its mission critical nature, it has attracted significant attention towards itself. Density of research on this subject is constantly rising and everyday more researchers are engaged in this field of work. The threat of a new wave of cyber or network attacks is not just a probability that should be considered, but it is an accepted fact that can occur at any time. The current trend for the IDS is far from a reliable protective system, but instead the main idea is to make it possible to detect novel network attacks. There are two major approaches for detecting intrusions, signature-based and anomaly-based intrusion detection [3][4]. In the first approach, attack patterns or the behavior
  • 2. of the intruder is modeled (attack signature is modeled). Here the system will signal the intrusion once a match is detected. However, in the second approach normal behavior of the network is modeled. In this approach, the systemwill raise the alarm once the behavior of the network does not match with its normal behavior. There is another Intrusion Detection (ID) approach that is called specification-based intrusion detection. In this approach, the normal behavior (expected behavior) of the host is specified and consequently modeled. In this approach, as a direct price for the security, freedom of operation for the host is limited. In this paper, these approaches will be briefly discussed and compared. The idea of having an intruder accessing the systemwithout even being able to notice it is the worst nightmare for any network security officer. Since the current ID technology is not accurate enough to provide a reliable detection, heuristic methodologies can be a way out. As for the last line of defense, and in order to reduce the number of undetected intrusions, heuristic methods such as Honey Pots (HP) can be deployed [5][6]. HPs can be installed on any systemand act as trap or decoy for a resource. Another major problem in this research area is the speed of detection. Computer networks have a dynamic nature in a sense that information and data within them are continuously changing. Therefore, detecting an intrusion accurately and promptly, the system has to operate in real time. Operating in real time is not just to perform the detection in real time, but is to adapt to the new dynamics in the network. Real time operating IDS is an active research area pursued by many researchers. Most of the research works are aimed to introduce the most time efficient methodologies. The goal is to make the implemented methods suitable for the real time implementation. . The main emphasis of this paper is on the detection part of the intrusion detection. In this thesis we train the classifier using KDD CUP 1999 dataset. Since the original dataset around 4 million of records in which there are 22 types of attacks. These attacks are categorized into four classes and are DoS, Probe, R2L and U2R. The dataset contains a large number of redundant values. In order to make it correct we remove the redundancy from the dataset after removing the redundancy we perform the value conversions i.e from text data to numeric data . After value conversion we perform the normalization on the obtained dataset using statistical normalization. The normalization is done in the range [0,1]. After normalization the dataset is send for PCA feature extraction, PCA extracts 300 features using Cartesian product. This dataset is now correct and accurate. The dataset so obtained is used to train classifiers. After classifiers learning, test is carried out using KDD test data. Each classifiers are combined in a fashion manner and data is given to each classifier. The output prediction values are combined. After combination genetic programming is applied ,based on the fitness function the genetic programming select the optimized classifiers. The optimized classifiers so obtained are ensembled using majority voting. The accuracy of the obtained ensemble is approximately 99.9%. II. BACKGROUND In this section we will discuss the mechanisms that are used for this research. The section will be based on Genetic Programming, k-NN classifiers, Dataset Description and Ensemble mechanism. Genetic Programming In order to produce a subsequent populace three GP operators namely replication, mutation and crossover will be utilized for GP process. These operators aid in meeting to optimal solution. The optimal composite classifier is anticipated at the conclude of GP process. GP’s are heuristic find software design projected to simulate procedures in usual system. GP fit in to the larger class of evolutionary software design that produce resolutions to optimize setbacks employing disparate methods inspired by usual progress such as inheritance, mutation, selection and crossover. These are adaptive heuristic find software design postulated on the evolutionary thoughts of usual selection and genetic. The frank believed of these evolutionary software design is to rouse procedure in usual arrangement vital for evolution. GP’s are utilized for numerical and computational optimization and established on discover the evolutionary aspects of models of communal systems. GP way is utilized to optimize the set of indices derived from convoluted web theory. Genetic software designs are find software design established on the technicians of usual selection and usual genetics. They join survival of the fittest amongst thread constructions alongside structured yet randomized data transactions to form a find software design alongside a little sort of innovative flair of human search. The GP performs a balanced find on assorted nodes and there is a demand to retain populace diversity discovery so that each vital data cannot be capitulated because there is a outstanding demand to focus on fit servings of the population. Reproduction in GP is described as the procedure of producing offspring. The use of GP’s has been utilized to supplement web established approaches. GP to be utilized to optimize a set of indices derived fromconvoluted web theory. The early necessity of a GP is a set of resolutions embodied by chromosomes shouted population. The resolutions removed from one populace can be utilized to form a new population. This can be more increased that the new populace will be larger than the aged one. The best resolutions are selected to form new offspring. These resolutions are selected on the basis of their fitness i.e. the
  • 3. most suitable offspring will become chances to reproduce. GP’s are utilized for Search, Optimization, and Contraption Learning. GP’s are extremely public method for optimization and are oftentimes prosperous in real requests and to those interested in meta-heuristics. Evolutionary software design are utilized to resolve setbacks that do not by now have a well-defined effectual solution. Genetic software design have been utilized to resolve optimization problem. Nearest neighbor Classifier Among the various methods of supervised statistical pattern recognition, the Nearest Neighbor rule achieves consistently high performance, without a priori assumptions about the distributions from which the training examples are drawn. It involves a training set of both positive and negative cases.A new sample is classified by calculating the distance to the nearest training case; the sign of that point then determines the classification of the sample [11][12]. The k-NN classifier extends this idea by taking the k nearest points and assigning the sign of the majority. It is common to select k small and odd to break ties (typically 1, 3 or 5). Larger k values help reduce the effects of noisy points within the training data set, and the choice of k is often performed through cross-validation. The nearest neighbor machine learning algorithms dependent of the position of instances present in the input data. The newly encountered samples are classified based on the data already stored in the database. The new samples are classified based on the closet sample determined by the Euclidien Distance [13]. The decision is determined by the closest k example. To assign correct class to the data sample an optimal mapping function f(x) is used. For the classification problems which have only two classes the data is classified in two classes i.e either C1 or C2. The output of the Nearest Neighbor classifier is determined by the ROC curve, receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or the sensitivity index d', known as "d-prime" in signal detection and biomedical informatics, or recall in machine learning. Figure below shows the ROC Curve for GP based NN Classifier. KDD CUP 99 Data Set Descriptions KDD’99 is the the most widely used data set for intrusion detection procedures . The data set is corrected and is crafted established on the data collected in DARPA’98 IDS evaluation programs. The DARPA’98 contains 4 gigabytes of compressed binary data of 7 weeks of overall web traffic, which is processed into concerning 5 million records of connection, every single record contains 100 bytes of traffic data. 2 million connection records has been examined in two weeks. The KDD training dataset encompasses of concerning 4,900,000 solitary connection vectors,every single vector of that contains 41 features and is classified as normal or an attack, alongside precisely one particular type of attack. The attacks fall in of the pursuing four groups: 1) DoS : in this attack the attacker makes a little bit process computing or makes the resource unavailable or too maximum to grasp highest demands, or may completely denies users admission to a contraption. 2) U2R: This type of attack the vulnerability in order to become the administrator of the victim computer . This is achieved by passwords sniffing , a lexicon attack, or communal engineering.. 3) R2L: occurs after an attacker who has the skill to dispatch packets to a contraption above a web but who does not have an report on that contraption exploits a little vulnerability to gain innate admission as a user of that contraption. 4) Probing Attack: in this type of attack we collects information regarding a broad network of machines . And the information is used to compromise its security controls. The KDD’99 CUP dataset features can be categorized into main three categories: 1) Basic features: in this group, the attributes of a TCP/IP connection are extracted . Among all these features many leads to a delay in the detection. 2) Traffic features: it encompasses those features which were computed using window interval This group is divided into two categories: a) same host features: The main aim is to examine the connection established in past 2 second having exact destination host which the current connection is holding. The statistics of protocol service, behavior etc is calculated b) “same service” features: The main aim is to examine the connection established in past 2 second having same service which the current connection is holding.. Above mentioned two kinds of “traffic” are mainly based on time . Though, there are countless sluggish aggressions which scan the ports employing a far high time period than 2 seconds, e,g, one in each single minute. The consequences of these aggressions may not present intrusion outlines alongside a period window of time two seconds. In order to resolve this setback, the “same host” and “same service” features are calculated and established on the connection window containing 100 connections as a replacement of period window of two seconds. These are termed asconnection-based traffic features.
  • 4. 3) Content features: R2L as well as U2R attack types don’t bare sequential patterns as in most Pribing and DoS attacks. On the other hand DoS and Probing type attacks mainly contains countless connections to a small host(s) in a very short spain of time; R2L as well as U2R attacks are mainly embedded in the data servings of the TCP/IP packets, these involve highly a single connection. In order to notice these kinds of attacks , a little features able to gaze for dubious deeds in the data serving, e.g., number of floundered login attempts. Data Preprocessing KDD Cup 1999 dataset contains a large number of redundant data, therefore to make it correct we remove the redundancy from dataset. After removing the redundancy we convert the text data into numeric data. After conversion of data we perform normalization on the obtained data. Statistical normalization is applied on the dataset so obtained. The statistical normalization convert the mean to zero and variance to one. The Statistical normalization is defined as; Xi = 𝑣𝑖−µ𝑖 𝜎 Where µ is mean of n values and σ is standard deviation Standard deviation is beneficially applicable in large amount of data because dataset should follow normal distribution. This normalization method does not scale the value in range of -1 to 1 but instead in the range of [0,1]. After normalization the data is send to PCA for feature extraction. Principal component analysis(PCA) Principal constituent scrutiny (PCA) is one of the most priceless after math which was demanded from linear algebra. Principal constituent scrutiny is utilized plentifully in each and every forms of scrutiny – starting from neuroscience up to computer graphics - its characteristics like facile, non-parametric method of eliminating redundant data from mystifying data sets makes it so verstile. PCA can provides a roadmap for how to cut a convoluted data set to a lower dimension to expose the fromperiod to period hidden, clear dynamics that oftentimes underlie it with negligible supplementary manipulation. As a little of the web characteristics have higher possibilities to be encompassed in web intrusions, we have used PCA way to recognize these characteristics.The PCA was used above the training dataset in order to delineate the features that give most oftentimes in a machinery of an attack. According to the obtained aftermath, we have selected three features out of 41 utilized to delineate every single connection of KDD99Cup dataset.The goalwas to select the smallest probable number of the features as maintaining elevated detection rate of intrusions. In such a method detection might be gave as a real-time one. Selected features and their explications are gave 41 features from KDD99Cup dataset and their explication are gave as well as selected features up to dimension of 300 that we have obtained employing the alike PCA method as delineated before. Classifier Ensemble Dietterich described a lot of combination methods based on machine learning. Sharky pointed out that the limiting point i.e factor for combining of classifier is due to lack of awareness of full rule of available modular structure, because of little agreement as describing and classifying various class of classifiers. The comprehensive categorization scheme of classifier ensemble is shown. 1. Voting classifier ensembles 2. Classifier ensembles by manipulating training samples 3. Homogeneous classifier ensembles 4. Recursive partition ensembles 5. Heterogeneous classifier ensembles The Ensemble method used in this thesis is Voting. Voting classifier ensembles The three main categories are as: 1. A simple voting scheme: in this scheme each individual classifier is an equally weighted votes. The input is assigned to a high majority voted classifier. 2. Weighted electing scheme: every single poll receives a heaviness, proportional to approximated generalization, presentation of the corresponding classifier. This scheme has higher presentation than easy electing. 3. The weighted majority algorithm: is similar to weighted voting but the difference is how weights are generated. In this thesis the majority voting mechanismis used for the ensemble of classifier to produce a composite classifier III. PROPOSED WORK Fig 1 shows the proposed modelfor intrusion detection using Kdd cup 1999 dataset. Using Kdd Cup 1999 k-NN classifiers are trained firstly. After training Classifiers ,each classifier is tested by apply Kdd cup 1999 dataset as test data.The output of each classifier is combined then. After combining the output prediction of each classifier , genetic programming is then applied. The fitness function defined in genetic programming is applied which selects the optimized classifiers and these classifiers are then ensembled using majority voting. The fitness function defined in genetic programming is sumof six variables and is defined as; F= records+ numfolds +K_value +Time+model+accuracy; The counseled way encompasses two stages. In the early one, the training period, a set of laws for noticing intruders is generated employing web audit data offline. In the subsequent period, the best laws, i.e. the laws alongside the highest fitness benefits, are utilized for intrusion detection in the real-time environment.
  • 5. Fig 1: Diagram of proposed model As a little of the web characteristics have higher possibilities to be encompassed in web intrusions, we have used PCA way to recognize these characteristics. The PCA was used above the training dataset in order to delineate the features that give most oftentimes in a machinery of an attack. According to the obtained aftermath, we have selected three features out of 41 utilized to delineate every single connection of KDD99Cup dataset.The goalwas to select the smallest probable number of the features as maintaining elevated detection rate of intrusions. In such a method detection might be gave as a real-time one. Selected features and their explications are gave 41 features from KDD99Cup dataset and their explication are gave as well as selected features up to dimension of 300 that we have obtained employing the alike PCA method as delineated before. The algorithm for producing new laws is gave as follows. The early pace is initialization of an early populace after every single gene is given a random value. Next the parameters of genetic Programming (crossover and mutation rate, size of populace, conclude of progress of rules) are enumerated and the web audit data is being loaded. The working of algorithm is shown in figure 2; After that the early populace is being evolved for a number of generations.In every single creation the quality of every single law, i.e. fitness worth, is being computed according to the fitness purpose, next a number of laws alongside the highest fitness benefits are being selected and at the conclude the genetic operators (crossover and mutation) are gave alongside a precise probability. The output of the algorithm are laws for intrusion detection For the intention of this work, two subsets of KDD99Cup datasets for training and assessing are derived. Every single connection has the corresponding marking that states whether it is a normal connection or a precise kind of an attack. The subset utilized for training encompassed assorted aggressions and normal connections. The most of the connections selected are normal, that is usually the case in real-world networks. Fig 2: Flow chart of Genetic Programming Genetic software design is utilized to inductively produce a GP classifier as a K-Nearest Neighbor established ensemble for the task of data classification. Nearest Neighbor, in fact, can be elucidated as constitution of purposes whereas the purpose set is the set of attribute examinations and the terminal set are the classes. The purpose set can be obtained by changing every single attribute into an attribute-test function. At the commencing the fitness of every single individual (K) is evaluated. Then, at every single creation, every single tree experiences one of the genetic operators (reproduction, crossover, mutation) reliant on the probability test. If crossover is requested, the friend of the present individual is selected as the acquaintance possessing the best fitness,and the offspring is generated. The present K nearest acquaintance is next substituted by the best of the two offspring if the fitness of the last is larger than that of the former. The evaluation of the fitness of every single k- classifier is computed on the whole training data. Later the killing of the number of generations described by the user, the individual alongside the best fitness embodies the classifier. After optimizing the classifiers we use majority voting mechanism to combine the classifiers. Based on this mechanism the classifier with highest weighted vote is selected for classification. As a result of this the shortcoming of one classifier is overwhelmed by the one with higher accuracy. IV. RELATED WORK MS Hoque, et al in “An implementation of intrusion detection arrangement employing genetic algorithm” [13] There are assorted ways being utilized in intrusion
  • 6. detections, but unfortunately each of the arrangements so distant is not completely flawless. So, the quest of improvement continues. In this progression, here we present an Intrusion Detection Arrangement (IDS), by requesting genetic algorithm (GA) to effectually notice assorted kinds of web intrusions. Parameters and progress procedures for GA are debated in features and implemented.we present and requested an Intrusion Detection Arrangement by requesting genetic algorithm to effectually notice assorted kinds of web intrusions. To apply and compute the presentation of our arrangement we utilized the average KDD99 benchmark dataset and obtained reasonable detection rate. To compute the fitness of a chromosome we utilized the average deviation equation alongside distance. S Owais, et al in “Survey: Employing Genetic Algorithm Way in Intrusion Detection Arrangements Techniques” [14] Intrusion detection arrangements (IDS) proposal methods for modeling and knowing normal and harsh arrangement behavior. GAs can be prosperously utilized to tune the membership purposes utilized by the IDS. In this paper a survey were gave ways established on IDS, and on requesting of GAs (GAs) on IDS.GA as evolutionary algorithms was prosperously utilized in disparate kinds of IDS. Employing GA returned amazing aftermath, the best fitness worth was extremely closely to the flawless fitness value. GA is a randomization find method frequently utilized for optimization problem. GA was prosperously able to produce a ideal alongside the wanted characteristics of elevated correct detection rate and low fake affirmative rate for IDS . And it utilized prosperously in IDS to discriminate the normal deed and the intruded deeds, and clustering GAs are a enthusing method for the detection of malicious intrusions into computer systems. Z Bankovic, et al in “Improving web protection employing genetic algorithm approach” [15] In this work we have comprehended a misuse detection arrangement established on genetic algorithm(GA) approach. For evolving and assessing new laws for intrusion detection the KDD99Cup training and assessing dataset were used. To be able to procedure web data in real period, we have used main constituent scutiny (PCA) to remove the most vital features of the data. In that method we were able to retain the elevated level of detection rates of aggressions as speeding up the processing of the data.In this work we have used genetic algorithm way to intrusion detection. Multimedia implementation of the counseled way is presented. Genetic algorithm was utilized to attain association laws for intrusion detection as main constituent scutiny was utilized to recognize the most vital features of web connections. P LaRoche, et al in “Genetic Software design Instituted WiFi Data Link Layer Attack Detection” [16] This paper presents a genetic software design established detection arrangement for Data Link layer aggressions on a WiFi network. We discover the use of two disparate fitness purposes in order to accomplish both a elevated detection rate and a low fake affirmative rate. Aftermath display that the detection arrangement industrialized can accomplish a detection rate above 90% and a fake affirmative rate below 1%.Our upcoming work will be to discover the use of larger data sets for training and assessing our L-GP established IDS. This will permit us to confirm the effectiveness of our work above larger webs as well as a varied number and length of DoS attacks. Also, we design on requesting the alike way delineated here on supplementary WiFi aggressions, alongside the aim of growing an IDS that can be utilized to notice a collection of attacks. G Folino, et al in “GP Ensemble for Distributed Intrusion Detection Systems” [17] In this paper an intrusion detection algorithm established on GP ensembles is proposed. The algorithm runs on a distributed hybrid multi- isle model-based nature to monitor security-related attention inside a network. Every single isle encompasses a cellular genetic plan whose target is to produce a decision-tree predictor, trained on the innate data stored in the node. A distributed intrusion detection algorithm established on the ensemble paradigm has been counseled and the suitability of genetic software design as a constituent learner of the ensemble has been investigated. Experimental aftermath display the applicability of the way for this kind of problems. Upcoming scutiny aims at spreading the method after pondering not batch data sets but data streams that change online on every single node of the network. W Lu, et al in “Detecting New Forms of Web Intrusion Employing Genetic Programming” [18] In this paper, a law progress way established on Genetic Software design (GP) for noticing novel aggressions on webs is gave and four genetic operators, namely reproduction, mutation, crossover, and dropping condition operators, are utilized to evolve new rules. New laws are utilized to notice novel or recognized web attacks. A training and assessing dataset counseled by DARPA is utilized to evolve and assess these new rules. The facts of believed implementation displays that a law generated by GP has a low fake affirmative rate (FPR), a low fake negative rate and a elevated rate of noticing unfamiliar attacks. Moreover, the law center composed of new laws has elevated detection rate alongside low FPR. An alternative to the DARPA evaluation way is additionally investigated.In this paper, we have gave and assessed a GP-based way for noticing recognized or novel aggressions on a network. The facts of believed implementation displays that new laws generated by GP have the possible skill to notice novel forms of attacks.Though,the detection consequence is not good for a little runs because the selection of crossover and mutation points in corresponding procedures is random. In supplement, selecting the probability of genetic operators selection is experience based. In our implementation, the probability of mutation and crossover are 0.01 and 0.6, respectively. Wei Li, et al in “Using Genetic Algorithm for Web Intrusion Detection” [19] This paper describes a method of requesting Genetic Algorithm (GA) to web Intrusion Detection Arrangements (IDSs). A brief overview of the Intrusion Detection System, genetic algorithm, and connected detection methods is presented. Parameters and progress procedure for GA are debated in detail. Unlike supplementary implementations of the alike setback, this
  • 7. implementation considers both temporal and spatial data of web connections in encoding the web connection data into laws in IDS.Future work includes crafting a average examination data set for the genetic algorithm counseled in this paper and requesting it to a examination environment. Methodical specification of parameters to ponder for genetic algorithm ought to be ambitious across the experiments. Joining vision from disparate protection sensors into a average law center is one more enthusing span in this work. Srinivas Mukkamala, et al in “Modeling Intrusion Detection Arrangements Employing Linear Genetic Software design Approach” [20] This paper investigates the suitability of linear genetic software design (LGP) method to ideal effectual intrusion detection arrangements, as contrasting its presentation alongside manmade neural webs and prop vector machines. Due to rising events of cyber aggressions and, constructing competent intrusion detection arrangements (IDSs) are vital for protecting data arrangements protection,and yet it stays an elusive aim and a outstanding challenge. We additionally examine key feature indentification for constructing effectual and competent IDSs. We note, though, that the difference in accuracy figures incline to be extremely tiny and could not be statistically momentous, exceptionally in think of the fact that the 5 classes of outlines differ in their sizes tremendously. Extra definitive conclusions can merely be made afterward analyzing extra comprehensive sets of web traffic data. Mark Crosbie, et al in “Applying Genetic Software design to Intrusion Detection” [21] This paper presents a possible resolution to the intrusion detection setback in cmnputer security. It uses a combination of work in the fields of Manmade Attendance and computer security. It displays how an intrusion detection arrangement can be requested employing self-governing agents, and how these agents can be crafted employing Genetic This prototype progress work has increased countless inquiries, main amid that is how to make statements concerning tile effectiveness of our intrusion detector. How can we be sure it will notice, a specific intrusion? Can we compute a probability a priori of its effectiveness. V. RESULTS AND ANALYSIS The algorithm creates k-NN classifier from training data, while trying to maximize the probability of detection and reduce errors for each class in training data. Model was created using the KDD data set. One decision tree was created using the KDD training data subset and the second one using the KDD testing data subset. The decision tree created using the KDD training data subset was tested on the KDD testing data subset and vice versa. After creating the k-NN models for the ,Probe, DoS, U2R and R2L attack categories, optimized rules were extracted using the Nearest Neighbor rules utility. The total number of instances of KDD CUP 1999 taken in this model is around one million because of less avalibility of computer memory. The confusion matrix of the Kdd dataset is shown in table below. Table 1: Confusion matrix Confusion matrix of KDD CUP 1999 Dataset Normal Probe DoS R2L U2R Normal 43224 30 20 5 0 Probe 90 27135 10 0 0 DoS 10 25 27180 0 0 R2L 30 0 30 2178 0 U2R 0 0 0 0 33 The performance of the ensembler is determined by analyzing the following terms: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑡𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 Recall = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 Precision = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 Specifity = 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 +𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 Table below shows the accuracy, precision, recall and specificity of each class obtained in this thesis; Table 2: Values of different parameters obtained Class Precision Specificity Recall Accuracy Normal 0.997 0.997 0.998 0.432 Probe 0.997 0.999 0.996 0.271 DoS 0.997 0.999 0.998 0.271 R2L 0.997 0.999 0.973 0.022 U2R 1.000 1.000 1.000 0.003 From above two tables, the overall accuracy calculated is approximately 99.8%. The accuracy of various individual k-NN classifiers are shown in the table below: Table 3: Accuracy of Single k-NN classifier for different values of k. Valus of K Accuracy 14 96.26225582 10 93.83818346 18 95.29526776 20 98.20442344 40 91.71358899 30 97.92837342 The table above shows that the single classifier has always low accuracy then that of the ensembler one. From this comparision we can say that the ensembler designed in this thesis gives a better accuracy rate then the individual ones.
  • 8. As shown in figure below our approach can obtain the largest area under ROC (0.99976) curve as well as the lowest false alarm rate when all the intrusion attacks can be correctly classified in the ROC graph, it further supports the robust performance of our approach. The extensive experimental results in this paper have shown the successful classification of sophisticated intrusion attacks and normal network traffic. Fig 3: ROC curve for GP based Classifier showing 0.99976 area under the curve VI. CONCLUSION To addresses these all issues we proposed a system i.e ensemble using genetic program which will have a better performance as compared to others. In this paper we ensemble only NN classifiers. Few carried out on resembling heterogeneous type to make out good results. This paper has address many issues which are creating trouble for designing effective classifiers. The paper discussed the GP in detailed and how can be a classifier of better performance be developed. In short in ensemble of classifiers using genetic programming a lot of human expert requirement has been decreased and an automatic system has been developed.These results demonstrate that classifier models trained using any four folds acquire sufficient information to achieve high detection rates if the fifth fold are employed for testing. Algorithms tested in the literature were able to achieve only approximately a detection rate of 80% for the U2R and the R2L attack categories. If KDD training and testing data subsets are merged and re-sampled with 99.89% of records in the training data subset of records in the testing data subset, the detection rates for the U2R and R2L attack categories rise to 99%. This clearly indicates that the original KDD training and testing data subsets represent dissimilar target hypotheses for the U2R and the R2L attack categories. Intrusion detection based on statistical pattern recognition approaches has attracted a wide range of interest over the last decade in response to the growing demand of reliable and intelligent intrusion detection systems (IDS), which are required to detect sophisticated and polymorphous intrusion attacks. In this work We have presented a novel intrusion detection approach that uses Genetic programming based Ensemble approach for detecting intrusion detection. The experimental results demonstrate that the GP base Ensemble Classifier is effective for reducing false alarm information such that the widespread IDS systems can be implemented using our approach considering both accuracy and interpretability. In future Feature selection can be used not only to alleviate the curse of dimensionality and minimize classification errors, but also to improve the interpretability of Ensemble-based classifiers. Our future work will focus on reducing features for the classifiers by methods of feature selection. Also, the work will be continued to study the fitness function of the genetic algorithm to manipulate more parameters of the fuzzy inference module, even concentrating on fuzzy rules themselves VII. REFERENCES [1]. Kabiri, Peyman, and Ali A. Ghorbani. "Research on Intrusion Detection and Response: A Survey." IJ Network Security 1, no. 2 (2005): 84-102. [2]. Schnackenberg, Dan, Kelly Djahandari, and Dan Sterne. "Infrastructure for intrusion detection and response." In DARPA Information Survivability Conference and Exposition, 2000. DISCEX'00. Proceedings, vol. 2, pp. 3-11. IEEE, 2000. [3]. Garcia-Teodoro, Pedro, J. Diaz-Verdejo, Gabriel Maciá-Fernández, and Enrique Vázquez. "Anomaly-based network intrusion detection: Techniques,systems and challenges." computers & security 28, no. 1 (2009): 18-28. [4]. Wu, Handong, Stephen Schwab, and Robert Lom Peckham. "Signature based network intrusion detection system and method." U.S. Patent 7,424,744, issued September 9, 2008. [5]. Kreibich, Christian, and Jon Crowcroft. "Honeycomb: creating intrusion detection
  • 9. signatures using honeypots." ACM SIGCOMM Computer Communication Review 34, no. 1 (2004): 51-56. [6]. Roesch, Martin. "Snort: Lightweight Intrusion Detection for Networks." In LISA, vol. 99, no. 1, pp. 229-238. 1999. [7]. Rokach, Lior. "Ensemble-based classifiers." Artificial Intelligence Review 33, no. 1-2 (2010): 1- 39. [8]. Ruta, Dymitr, and Bogdan Gabrys. "An overview of classifier fusion methods." Computing and Information systems 7, no. 1 (2000): 1-10. [9]. Polikar, Robi. "Ensemble based systems in decision making." Circuits and Systems Magazine, IEEE 6, no. 3 (2006): 21-45. [10]. Džeroski, Saso, and Bernard Ženko. "Is combining classifiers with stacking better than selecting the best one?." Machine learning 54, no. 3 (2004): 255- 273. [11]. Beyer, Kevin, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. "When is “nearest neighbor” meaningful?." In Database Theory— ICDT’99, pp. 217-235. Springer Berlin Heidelberg, 1999. [12]. Liao, Yihua, and V. Rao Vemuri. "Use of k-nearest neighbor classifier for intrusion detection." Computers & Security 21, no. 5 (2002): 439-448. [13]. Gower, John Clifford. "Properties of Euclidean and non-Euclidean distance matrices." Linear Algebra and its Applications 67 (1985): 81-97. [14]. Hoque, Mohammad Sazzadul, Md Mukit, Md Bikas, and Abu Naser. "An implementation of intrusion detection systemusing genetic algorithm." arXiv preprint arXiv:1204.1336 (2012). [15]. Owais, Suhail, Vaclav Snasel, Pavel Kromer, and Ajith Abraham. "Survey: using genetic algorithm approach in intrusion detection systems techniques." In Computer Information Systems and Industrial Management Applications, 2008. CISIM'08. 7th, pp. 300-307. IEEE, 2008. [16]. Bankovic, Zorana, Dušan Stepanovic, Slobodan Bojanic, and Octavio Nieto-Taladriz. "Improving network security using genetic algorithm approach." Computers & Electrical Engineering 33, no. 5 (2007): 438-451. [17]. LaRoche, Patrick, and A. Nur Zincir-Heywood. "Genetic programming based wifi data link layer attack detection." In null, pp. 285-292. IEEE, 2006. [18]. Folino, Gianluigi, Clara Pizzuti, and Giandomenico Spezzano. "GP ensemble for distributed intrusion detection systems." In Pattern Recognition and Data Mining, pp. 54-62. Springer Berlin Heidelberg, 2005. [19]. Lu, Wei, and Issa Traore. "Detecting new forms of network intrusion using genetic programming." Computational Intelligence 20, no. 3 (2004): 475- 494. [20]. Li, Wei. "Using genetic algorithm for network intrusion detection." Proceedings of the United States Department of Energy Cyber Security Group (2004): 1-8. [21]. Mukkamala, Srinivas, Andrew H. Sung, and Ajith Abraham. "Modeling intrusion detection systems using linear genetic programming approach." In Innovations in applied artificial intelligence, pp. 633-642. Springer Berlin Heidelberg, 2004. [22]. Crosbie, Mark, and Gene Spafford. "Applying genetic programming to intrusion detection." In Working Notes for the AAAI Symposium on Genetic Programming, pp. 1-8. Cambridge, MA: MIT Press, 1995.