rpaper

Genetic Programming Ensemble to improve k-NN based
network intrusion detection system
Authors
Imran Ahmed Malik Mrs. Amrita Prasad
Student at Sharda University Greater Noida Astt. Professor at Sharda University
Email:imran409.im@gmail.com Email: amrita.prasad@sharda.ac.in
Abstract—From the start of network system, security
threats usually known as intrusions has become very
important and critical issue in networks, data and
information systems. In order to overcome these threats
every time a detection systemwas needed because of drastic
growth in networks. Because of the growth of system,
attackers became stronger and every time compromises the
security of system. Hence a need of Intrusion Detection
system became very important and essential tool in network
security. Detection and prevention of such attacks called
intrusions mainly depends on the capability and efficiency
of Intrusion Detection System (IDS). As with increase in
network scalability with high pace, The need for light
weight Intrusion Detection System with high detection rate
is a requirement. Therefore, many ensemble mechanisms
has been proposed by using many methodologies’, these
methodologies have their own benefits and short comings..
In this paper we propose ensemble of k-NN classifiers using
genetic programming. KDD CUP 1999 dataset is used for
intrusion detection. The KDD CUP 1999 dataset contains
4,900,000 records. Each record contains 41 features and is
labeled as either normal or an attack. There are 22 types of
attacks and these attacks are classified into four categories
namely DoS, Probe , R2L and U2R.Since the dataset is not
accurate we perform some preprocessing on dataset .The
first step in preprocessing is removing redundancy ,after
removing redundancy we perform the data normalization
using statistical methods in the range [0 1]. After
normalization we extract the features using PCA feature
selection. The PCA feature selection is applied on the
dataset and 300 features are extracted using Cartesian
product. The dataset is then applied to Ensembler of
classifiers . The ensembler of classifiers classifies the input
data into five categories. Out of these five categories one is
normal and other four are attacks. The ensembler of
classifiers using genetic programming gives an accuracy of
99.97%.
Keywords—Intrusion detection; Anomaly detection;
Misuse detection; KDD Cup 99; Ensemble Approaches.
Genetic Programming, PCA
I. INTRODUCTION
In the past two decades with the rapid progress in the
Internet based technology, new application areas for
computer network have emerged. All of these application
areas made the network an attractive target for the abuse and
a big vulnerability for the community. A fun to do job or a
challenge to win action for some people became a
nightmare for the others. In many cases malicious acts made
this nightmare to become a reality.
In addition to the hacking, new entities like worms, Trojans
and viruses introduced more panic into the networked
society. As the current situation is a relatively new
phenomenon, network defenses are weak [1]. However, due
to the popularity of the computer networks, their
connectivity and our ever growing dependency on them,
realization of the threat can have devastating consequences.
Securing such an important infrastructure has become the
priority one research area for many researchers.
Aim of this paper is to work Intrusion Detection Systems
(IDS) and to analyze some current problems that exist in
this research area. In comparison to some mature and well
settled research areas, IDS is a young field of research.
However, due to its mission critical nature, it has attracted
significant attention towards itself. Density of research on
this subject is constantly rising and everyday more
researchers are engaged in this field of work. The threat of a
new wave of cyber or network attacks is not just a
probability that should be considered, but it is an accepted
fact that can occur at any time. The current trend for the IDS
is far from a reliable protective system, but instead the main
idea is to make it possible to detect novel network attacks.
There are two major approaches for detecting intrusions,
signature-based and anomaly-based intrusion detection
[3][4]. In the first approach, attack patterns or the behavior

of the intruder is modeled (attack signature is modeled).
Here the system will signal the intrusion once a match is
detected. However, in the second approach normal behavior
of the network is modeled. In this approach, the systemwill
raise the alarm once the behavior of the network does not
match with its normal behavior. There is another Intrusion
Detection (ID) approach that is called specification-based
intrusion detection. In this approach, the normal behavior
(expected behavior) of the host is specified and
consequently modeled. In this approach, as a direct price for
the security, freedom of operation for the host is limited. In
this paper, these approaches will be briefly discussed and
compared.
The idea of having an intruder accessing the systemwithout
even being able to notice it is the worst nightmare for any
network security officer. Since the current ID technology is
not accurate enough to provide a reliable detection, heuristic
methodologies can be a way out. As for the last line of
defense, and in order to reduce the number of undetected
intrusions, heuristic methods such as Honey Pots (HP) can
be deployed [5][6]. HPs can be installed on any systemand
act as trap or decoy for a resource.
Another major problem in this research area is the speed of
detection. Computer networks have a dynamic nature in a
sense that information and data within them are
continuously changing. Therefore, detecting an intrusion
accurately and promptly, the system has to operate in real
time. Operating in real time is not just to perform the
detection in real time, but is to adapt to the new dynamics in
the network. Real time operating IDS is an active research
area pursued by many researchers. Most of the research
works are aimed to introduce the most time efficient
methodologies. The goal is to make the implemented
methods suitable for the real time implementation. .
The main emphasis of this paper is on the detection part of
the intrusion detection. In this thesis we train the classifier
using KDD CUP 1999 dataset. Since the original dataset
around 4 million of records in which there are 22 types of
attacks. These attacks are categorized into four classes and
are DoS, Probe, R2L and U2R. The dataset contains a large
number of redundant values. In order to make it correct we
remove the redundancy from the dataset after removing the
redundancy we perform the value conversions i.e from text
data to numeric data . After value conversion we perform
the normalization on the obtained dataset using statistical
normalization. The normalization is done in the range [0,1].
After normalization the dataset is send for PCA feature
extraction, PCA extracts 300 features using Cartesian
product. This dataset is now correct and accurate. The
dataset so obtained is used to train classifiers. After
classifiers learning, test is carried out using KDD test data.
Each classifiers are combined in a fashion manner and data
is given to each classifier. The output prediction values are
combined. After combination genetic programming is
applied ,based on the fitness function the genetic
programming select the optimized classifiers. The optimized
classifiers so obtained are ensembled using majority voting.
The accuracy of the obtained ensemble is approximately
99.9%.
II. BACKGROUND
In this section we will discuss the mechanisms that are used
for this research. The section will be based on Genetic
Programming, k-NN classifiers, Dataset Description and
Ensemble mechanism.
Genetic Programming
In order to produce a subsequent populace three GP
operators namely replication, mutation and crossover will be
utilized for GP process. These operators aid in meeting to
optimal solution. The optimal composite classifier is
anticipated at the conclude of GP process.
GP’s are heuristic find software design projected to simulate
procedures in usual system. GP fit in to the larger class of
evolutionary software design that produce resolutions to
optimize setbacks employing disparate methods inspired by
usual progress such as inheritance, mutation, selection and
crossover. These are adaptive heuristic find software design
postulated on the evolutionary thoughts of usual selection
and genetic. The frank believed of these evolutionary
software design is to rouse procedure in usual arrangement
vital for evolution. GP’s are utilized for numerical and
computational optimization and established on discover the
evolutionary aspects of models of communal systems. GP
way is utilized to optimize the set of indices derived from
convoluted web theory. Genetic software designs are find
software design established on the technicians of usual
selection and usual genetics. They join survival of the fittest
amongst thread constructions alongside structured yet
randomized data transactions to form a find software design
alongside a little sort of innovative flair of human search.
The GP performs a balanced find on assorted nodes and
there is a demand to retain populace diversity discovery so
that each vital data cannot be capitulated because there is a
outstanding demand to focus on fit servings of the
population. Reproduction in GP is described as the
procedure of producing offspring. The use of GP’s has been
utilized to supplement web established approaches. GP to be
utilized to optimize a set of indices derived fromconvoluted
web theory. The early necessity of a GP is a set of
resolutions embodied by chromosomes shouted population.
The resolutions removed from one populace can be utilized
to form a new population. This can be more increased that
the new populace will be larger than the aged one. The best
resolutions are selected to form new offspring. These
resolutions are selected on the basis of their fitness i.e. the

most suitable offspring will become chances to reproduce.
GP’s are utilized for Search, Optimization, and Contraption
Learning. GP’s are extremely public method for
optimization and are oftentimes prosperous in real requests
and to those interested in meta-heuristics. Evolutionary
software design are utilized to resolve setbacks that do not
by now have a well-defined effectual solution. Genetic
software design have been utilized to resolve optimization
problem.
Nearest neighbor Classifier
Among the various methods of supervised statistical pattern
recognition, the Nearest Neighbor rule achieves consistently
high performance, without a priori assumptions about the
distributions from which the training examples are drawn. It
involves a training set of both positive and negative cases.A
new sample is classified by calculating the distance to the
nearest training case; the sign of that point then determines
the classification of the sample [11][12]. The k-NN
classifier extends this idea by taking the k nearest points and
assigning the sign of the majority. It is common to select k
small and odd to break ties (typically 1, 3 or 5). Larger k
values help reduce the effects of noisy points within the
training data set, and the choice of k is often performed
through cross-validation.
The nearest neighbor machine learning algorithms
dependent of the position of instances present in the input
data. The newly encountered samples are classified based
on the data already stored in the database. The new samples
are classified based on the closet sample determined by the
Euclidien Distance [13]. The decision is determined by the
closest k example. To assign correct class to the data sample
an optimal mapping function f(x) is used. For the
classification problems which have only two classes the data
is classified in two classes i.e either C1 or C2. The output of
the Nearest Neighbor classifier is determined by the ROC
curve, receiver operating characteristic (ROC), or ROC
curve, is a graphical plot that illustrates the performance of
a binary classifier system as its discrimination threshold is
varied. The curve is created by plotting the true positive rate
(TPR) against the false positive rate (FPR) at various
threshold settings. The true-positive rate is also known as
sensitivity or the sensitivity index d', known as "d-prime" in
signal detection and biomedical informatics, or recall in
machine learning. Figure below shows the ROC Curve for
GP based NN Classifier.
KDD CUP 99 Data Set Descriptions
KDD’99 is the the most widely used data set for intrusion
detection procedures . The data set is corrected and is
crafted established on the data collected in DARPA’98 IDS
evaluation programs. The DARPA’98 contains 4 gigabytes
of compressed binary data of 7 weeks of overall web
traffic, which is processed into concerning 5 million
records of connection, every single record contains 100
bytes of traffic data. 2 million connection records has been
examined in two weeks. The KDD training dataset
encompasses of concerning 4,900,000 solitary connection
vectors,every single vector of that contains 41 features and
is classified as normal or an attack, alongside precisely one
particular type of attack. The attacks fall in of the pursuing
four groups:
1) DoS : in this attack the attacker makes a little bit
process computing or makes the resource
unavailable or too maximum to grasp highest
demands, or may completely denies users
admission to a contraption.
2) U2R: This type of attack the vulnerability in order
to become the administrator of the victim
computer . This is achieved by passwords
sniffing , a lexicon attack, or communal
engineering..
3) R2L: occurs after an attacker who has the skill to
dispatch packets to a contraption above a web but
who does not have an report on that contraption
exploits a little vulnerability to gain innate
admission as a user of that contraption.
4) Probing Attack: in this type of attack we collects
information regarding a broad network of
machines . And the information is used to
compromise its security controls.
The KDD’99 CUP dataset features can be categorized into
main three categories:
1) Basic features: in this group, the attributes of a
TCP/IP connection are extracted . Among all
these features many leads to a delay in the
detection.
2) Traffic features: it encompasses those features
which were computed using window interval This
group is divided into two categories:
a) same host features: The main aim is to
examine the connection established in past 2
second having exact destination host which the
current connection is holding. The statistics of
protocol service, behavior etc is calculated
b) “same service” features: The main aim is to
examine the connection established in past 2
second having same service which the current
connection is holding..
Above mentioned two kinds of “traffic” are mainly based
on time . Though, there are countless sluggish aggressions
which scan the ports employing a far high time period
than 2 seconds, e,g, one in each single minute. The
consequences of these aggressions may not present
intrusion outlines alongside a period window of time two
seconds. In order to resolve this setback, the “same host”
and “same service” features are calculated and established
on the connection window containing 100 connections as a
replacement of period window of two seconds. These are
termed asconnection-based traffic features.

3) Content features: R2L as well as U2R attack
types don’t bare sequential patterns as in most
Pribing and DoS attacks. On the other hand DoS
and Probing type attacks mainly contains
countless connections to a small host(s) in a very
short spain of time; R2L as well as U2R attacks
are mainly embedded in the data servings of the
TCP/IP packets, these involve highly a single
connection. In order to notice these kinds of
attacks , a little features able to gaze for dubious
deeds in the data serving, e.g., number of
floundered login attempts.
Data Preprocessing
KDD Cup 1999 dataset contains a large number of
redundant data, therefore to make it correct we remove the
redundancy from dataset. After removing the redundancy
we convert the text data into numeric data. After conversion
of data we perform normalization on the obtained data.
Statistical normalization is applied on the dataset so
obtained. The statistical normalization convert the mean to
zero and variance to one. The Statistical normalization is
defined as;
Xi =
𝑣𝑖−µ𝑖
𝜎
Where µ is mean of n values and σ is standard deviation
Standard deviation is beneficially applicable in large
amount of data because dataset should follow normal
distribution. This normalization method does not scale the
value in range of -1 to 1 but instead in the range of [0,1].
After normalization the data is send to PCA for feature
extraction.
Principal component analysis(PCA)
Principal constituent scrutiny (PCA) is one of the most
priceless after math which was demanded from linear
algebra. Principal constituent scrutiny is utilized plentifully
in each and every forms of scrutiny – starting from
neuroscience up to computer graphics - its characteristics
like facile, non-parametric method of eliminating redundant
data from mystifying data sets makes it so verstile. PCA can
provides a roadmap for how to cut a convoluted data set to a
lower dimension to expose the fromperiod to period hidden,
clear dynamics that oftentimes underlie it with negligible
supplementary manipulation. As a little of the web
characteristics have higher possibilities to be encompassed in
web intrusions, we have used PCA way to recognize these
characteristics.The PCA was used above the training dataset
in order to delineate the features that give most oftentimes in
a machinery of an attack.
According to the obtained aftermath, we have selected
three features out of 41 utilized to delineate every single
connection of KDD99Cup dataset.The goalwas to select the
smallest probable number of the features as maintaining
elevated detection rate of intrusions. In such a method
detection might be gave as a real-time one. Selected features
and their explications are gave 41 features from KDD99Cup
dataset and their explication are gave as well as selected
features up to dimension of 300 that we have obtained
employing the alike PCA method as delineated before.
Classifier Ensemble
Dietterich described a lot of combination methods based on
machine learning. Sharky pointed out that the limiting point
i.e factor for combining of classifier is due to lack of
awareness of full rule of available modular structure,
because of little agreement as describing and classifying
various class of classifiers. The comprehensive
categorization scheme of classifier ensemble is shown.
1. Voting classifier ensembles
2. Classifier ensembles by manipulating training samples
3. Homogeneous classifier ensembles
4. Recursive partition ensembles
5. Heterogeneous classifier ensembles
The Ensemble method used in this thesis is Voting.
Voting classifier ensembles
The three main categories are as:
1. A simple voting scheme: in this scheme each
individual classifier is an equally weighted votes.
The input is assigned to a high majority voted
classifier.
2. Weighted electing scheme: every single poll
receives a heaviness, proportional to approximated
generalization, presentation of the corresponding
classifier. This scheme has higher presentation than
easy electing.
3. The weighted majority algorithm: is similar to
weighted voting but the difference is how weights
are generated.
In this thesis the majority voting mechanismis used for
the ensemble of classifier to produce a composite
classifier
III. PROPOSED WORK
Fig 1 shows the proposed modelfor intrusion detection using
Kdd cup 1999 dataset. Using Kdd Cup 1999 k-NN classifiers
are trained firstly. After training Classifiers ,each classifier is
tested by apply Kdd cup 1999 dataset as test data.The output
of each classifier is combined then. After combining the
output prediction of each classifier , genetic programming is
then applied. The fitness function defined in genetic
programming is applied which selects the optimized
classifiers and these classifiers are then ensembled using
majority voting. The fitness function defined in genetic
programming is sumof six variables and is defined as;
F= records+ numfolds +K_value +Time+model+accuracy;
The counseled way encompasses two stages. In the early
one, the training period, a set of laws for noticing intruders is
generated employing web audit data offline. In the
subsequent period, the best laws, i.e. the laws alongside the
highest fitness benefits, are utilized for intrusion detection in
the real-time environment.

Fig 1: Diagram of proposed model
As a little of the web characteristics have higher
possibilities to be encompassed in web intrusions, we have
used PCA way to recognize these characteristics. The PCA
was used above the training dataset in order to delineate the
features that give most oftentimes in a machinery of an
attack.
According to the obtained aftermath, we have selected
three features out of 41 utilized to delineate every single
connection of KDD99Cup dataset.The goalwas to select the
smallest probable number of the features as maintaining
elevated detection rate of intrusions. In such a method
detection might be gave as a real-time one. Selected features
and their explications are gave 41 features from KDD99Cup
dataset and their explication are gave as well as selected
features up to dimension of 300 that we have obtained
employing the alike PCA method as delineated before.
The algorithm for producing new laws is gave as follows.
The early pace is initialization of an early populace after
every single gene is given a random value. Next the
parameters of genetic Programming (crossover and mutation
rate, size of populace, conclude of progress of rules) are
enumerated and the web audit data is being loaded. The
working of algorithm is shown in figure 2;
After that the early populace is being evolved for a
number of generations.In every single creation the quality of
every single law, i.e. fitness worth, is being computed
according to the fitness purpose, next a number of laws
alongside the highest fitness benefits are being selected and
at the conclude the genetic operators (crossover and
mutation) are gave alongside a precise probability. The
output of the algorithm are laws for intrusion detection For
the intention of this work, two subsets of KDD99Cup
datasets for training and assessing are derived. Every single
connection has the corresponding marking that states
whether it is a normal connection or a precise kind of an
attack. The subset utilized for training encompassed assorted
aggressions and normal connections. The most of the
connections selected are normal, that is usually the case in
real-world networks.
Fig 2: Flow chart of Genetic Programming
Genetic software design is utilized to inductively produce a
GP classifier as a K-Nearest Neighbor established ensemble
for the task of data classification. Nearest Neighbor, in fact,
can be elucidated as constitution of purposes whereas the
purpose set is the set of attribute examinations and the
terminal set are the classes. The purpose set can be obtained
by changing every single attribute into an attribute-test
function. At the commencing the fitness of every single
individual (K) is evaluated. Then, at every single creation,
every single tree experiences one of the genetic operators
(reproduction, crossover, mutation) reliant on the probability
test. If crossover is requested, the friend of the present
individual is selected as the acquaintance possessing the best
fitness,and the offspring is generated. The present K nearest
acquaintance is next substituted by the best of the two
offspring if the fitness of the last is larger than that of the
former. The evaluation of the fitness of every single k-
classifier is computed on the whole training data. Later the
killing of the number of generations described by the user,
the individual alongside the best fitness embodies the
classifier. After optimizing the classifiers we use majority
voting mechanism to combine the classifiers. Based on this
mechanism the classifier with highest weighted vote is
selected for classification. As a result of this the shortcoming
of one classifier is overwhelmed by the one with higher
accuracy.
IV. RELATED WORK
MS Hoque, et al in “An implementation of intrusion
detection arrangement employing genetic algorithm” [13]
There are assorted ways being utilized in intrusion

detections, but unfortunately each of the arrangements so
distant is not completely flawless. So, the quest of
improvement continues. In this progression, here we present
an Intrusion Detection Arrangement (IDS), by requesting
genetic algorithm (GA) to effectually notice assorted kinds
of web intrusions. Parameters and progress procedures for
GA are debated in features and implemented.we present and
requested an Intrusion Detection Arrangement by requesting
genetic algorithm to effectually notice assorted kinds of web
intrusions. To apply and compute the presentation of our
arrangement we utilized the average KDD99 benchmark
dataset and obtained reasonable detection rate. To compute
the fitness of a chromosome we utilized the average
deviation equation alongside distance. S Owais, et al in
“Survey: Employing Genetic Algorithm Way in Intrusion
Detection Arrangements Techniques” [14] Intrusion
detection arrangements (IDS) proposal methods for
modeling and knowing normal and harsh arrangement
behavior. GAs can be prosperously utilized to tune the
membership purposes utilized by the IDS. In this paper a
survey were gave ways established on IDS, and on
requesting of GAs (GAs) on IDS.GA as evolutionary
algorithms was prosperously utilized in disparate kinds of
IDS. Employing GA returned amazing aftermath, the best
fitness worth was extremely closely to the flawless fitness
value. GA is a randomization find method frequently utilized
for optimization problem. GA was prosperously able to
produce a ideal alongside the wanted characteristics of
elevated correct detection rate and low fake affirmative rate
for IDS . And it utilized prosperously in IDS to discriminate
the normal deed and the intruded deeds, and clustering GAs
are a enthusing method for the detection of malicious
intrusions into computer systems. Z Bankovic, et al in
“Improving web protection employing genetic algorithm
approach” [15] In this work we have comprehended a misuse
detection arrangement established on genetic algorithm(GA)
approach. For evolving and assessing new laws for intrusion
detection the KDD99Cup training and assessing dataset were
used. To be able to procedure web data in real period, we
have used main constituent scutiny (PCA) to remove the
most vital features of the data. In that method we were able
to retain the elevated level of detection rates of aggressions
as speeding up the processing of the data.In this work we
have used genetic algorithm way to intrusion detection.
Multimedia implementation of the counseled way is
presented. Genetic algorithm was utilized to attain
association laws for intrusion detection as main constituent
scutiny was utilized to recognize the most vital features of
web connections. P LaRoche, et al in “Genetic Software
design Instituted WiFi Data Link Layer Attack Detection”
[16] This paper presents a genetic software design
established detection arrangement for Data Link layer
aggressions on a WiFi network. We discover the use of two
disparate fitness purposes in order to accomplish both a
elevated detection rate and a low fake affirmative rate.
Aftermath display that the detection arrangement
industrialized can accomplish a detection rate above 90%
and a fake affirmative rate below 1%.Our upcoming work
will be to discover the use of larger data sets for training and
assessing our L-GP established IDS. This will permit us to
confirm the effectiveness of our work above larger webs as
well as a varied number and length of DoS attacks. Also, we
design on requesting the alike way delineated here on
supplementary WiFi aggressions, alongside the aim of
growing an IDS that can be utilized to notice a collection of
attacks. G Folino, et al in “GP Ensemble for Distributed
Intrusion Detection Systems” [17] In this paper an intrusion
detection algorithm established on GP ensembles is
proposed. The algorithm runs on a distributed hybrid multi-
isle model-based nature to monitor security-related attention
inside a network. Every single isle encompasses a cellular
genetic plan whose target is to produce a decision-tree
predictor, trained on the innate data stored in the node. A
distributed intrusion detection algorithm established on the
ensemble paradigm has been counseled and the suitability of
genetic software design as a constituent learner of the
ensemble has been investigated. Experimental aftermath
display the applicability of the way for this kind of problems.
Upcoming scutiny aims at spreading the method after
pondering not batch data sets but data streams that change
online on every single node of the network. W Lu, et al in
“Detecting New Forms of Web Intrusion Employing Genetic
Programming” [18] In this paper, a law progress way
established on Genetic Software design (GP) for noticing
novel aggressions on webs is gave and four genetic
operators, namely reproduction, mutation, crossover, and
dropping condition operators, are utilized to evolve new
rules. New laws are utilized to notice novel or recognized
web attacks. A training and assessing dataset counseled by
DARPA is utilized to evolve and assess these new rules. The
facts of believed implementation displays that a law
generated by GP has a low fake affirmative rate (FPR), a low
fake negative rate and a elevated rate of noticing unfamiliar
attacks. Moreover, the law center composed of new laws has
elevated detection rate alongside low FPR. An alternative to
the DARPA evaluation way is additionally investigated.In
this paper, we have gave and assessed a GP-based way for
noticing recognized or novel aggressions on a network. The
facts of believed implementation displays that new laws
generated by GP have the possible skill to notice novel forms
of attacks.Though,the detection consequence is not good for
a little runs because the selection of crossover and mutation
points in corresponding procedures is random. In
supplement, selecting the probability of genetic operators
selection is experience based. In our implementation, the
probability of mutation and crossover are 0.01 and 0.6,
respectively. Wei Li, et al in “Using Genetic Algorithm for
Web Intrusion Detection” [19] This paper describes a
method of requesting Genetic Algorithm (GA) to web
Intrusion Detection Arrangements (IDSs). A brief overview
of the Intrusion Detection System, genetic algorithm, and
connected detection methods is presented. Parameters and
progress procedure for GA are debated in detail. Unlike
supplementary implementations of the alike setback, this

implementation considers both temporal and spatial data of
web connections in encoding the web connection data into
laws in IDS.Future work includes crafting a average
examination data set for the genetic algorithm counseled in
this paper and requesting it to a examination environment.
Methodical specification of parameters to ponder for genetic
algorithm ought to be ambitious across the experiments.
Joining vision from disparate protection sensors into a
average law center is one more enthusing span in this work.
Srinivas Mukkamala, et al in “Modeling Intrusion Detection
Arrangements Employing Linear Genetic Software design
Approach” [20] This paper investigates the suitability of
linear genetic software design (LGP) method to ideal
effectual intrusion detection arrangements, as contrasting its
presentation alongside manmade neural webs and prop
vector machines. Due to rising events of cyber aggressions
and, constructing competent intrusion detection
arrangements (IDSs) are vital for protecting data
arrangements protection,and yet it stays an elusive aim and a
outstanding challenge. We additionally examine key feature
indentification for constructing effectual and competent
IDSs. We note, though, that the difference in accuracy
figures incline to be extremely tiny and could not be
statistically momentous, exceptionally in think of the fact
that the 5 classes of outlines differ in their sizes
tremendously. Extra definitive conclusions can merely be
made afterward analyzing extra comprehensive sets of web
traffic data. Mark Crosbie, et al in “Applying Genetic
Software design to Intrusion Detection” [21] This paper
presents a possible resolution to the intrusion detection
setback in cmnputer security. It uses a combination of work
in the fields of Manmade Attendance and computer security.
It displays how an intrusion detection arrangement can be
requested employing self-governing agents, and how these
agents can be crafted employing Genetic This prototype
progress work has increased countless inquiries, main amid
that is how to make statements concerning tile effectiveness
of our intrusion detector. How can we be sure it will notice, a
specific intrusion? Can we compute a probability a priori of
its effectiveness.
V. RESULTS AND ANALYSIS
The algorithm creates k-NN classifier from training data,
while trying to maximize the probability of detection and
reduce errors for each class in training data. Model was
created using the KDD data set. One decision tree was
created using the KDD training data subset and the second
one using the KDD testing data subset. The decision tree
created using the KDD training data subset was tested on
the KDD testing data subset and vice versa. After creating
the k-NN models for the ,Probe, DoS, U2R and R2L attack
categories, optimized rules were extracted using the Nearest
Neighbor rules utility.
The total number of instances of KDD CUP 1999 taken in
this model is around one million because of less avalibility
of computer memory. The confusion matrix of the Kdd
dataset is shown in table below.
Table 1: Confusion matrix
Confusion matrix of KDD CUP 1999 Dataset
Normal Probe DoS R2L U2R
Normal 43224 30 20 5 0
Probe 90 27135 10 0 0
DoS 10 25 27180 0 0
R2L 30 0 30 2178 0
U2R 0 0 0 0 33
The performance of the ensembler is determined by
analyzing the following terms:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑡𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Recall =
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Precision =
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Specifity =
𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 +𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Table below shows the accuracy, precision, recall and
specificity of each class obtained in this thesis;
Table 2: Values of different parameters obtained
Class Precision Specificity Recall Accuracy
Normal 0.997 0.997 0.998 0.432
Probe 0.997 0.999 0.996 0.271
DoS 0.997 0.999 0.998 0.271
R2L 0.997 0.999 0.973 0.022
U2R 1.000 1.000 1.000 0.003
From above two tables, the overall accuracy calculated is
approximately 99.8%. The accuracy of various individual
k-NN classifiers are shown in the table below:
Table 3: Accuracy of Single k-NN classifier for
different values of k.
Valus of K Accuracy
14 96.26225582
10 93.83818346
18 95.29526776
20 98.20442344
40 91.71358899
30 97.92837342
The table above shows that the single classifier has always
low accuracy then that of the ensembler one. From this
comparision we can say that the ensembler designed in this
thesis gives a better accuracy rate then the individual ones.

As shown in figure below our approach can obtain the
largest area under ROC (0.99976) curve as well as the
lowest false alarm rate when all the intrusion attacks can be
correctly classified in the ROC graph, it further supports the
robust performance of our approach. The extensive
experimental results in this paper have shown the successful
classification of sophisticated intrusion attacks and normal
network traffic.
Fig 3: ROC curve for GP based Classifier showing
0.99976 area under the curve
VI. CONCLUSION
To addresses these all issues we proposed a system i.e
ensemble using genetic program which will have a better
performance as compared to others. In this paper we
ensemble only NN classifiers. Few carried out on
resembling heterogeneous type to make out good results.
This paper has address many issues which are creating
trouble for designing effective classifiers. The paper
discussed the GP in detailed and how can be a classifier of
better performance be developed. In short in ensemble of
classifiers using genetic programming a lot of human expert
requirement has been decreased and an automatic system
has been developed.These results demonstrate that classifier
models trained using any four folds acquire sufficient
information to achieve high detection rates if the fifth fold
are employed for testing. Algorithms tested in the literature
were able to achieve only approximately a detection rate of
80% for the U2R and the R2L attack categories. If KDD
training and testing data subsets are merged and re-sampled
with 99.89% of records in the training data subset of records
in the testing data subset, the detection rates for the U2R
and R2L attack categories rise to 99%. This clearly indicates
that the original KDD training and testing data subsets
represent dissimilar target hypotheses for the U2R and the
R2L attack categories.
Intrusion detection based on statistical pattern recognition
approaches has attracted a wide range of interest over the
last decade in response to the growing demand of reliable
and intelligent intrusion detection systems (IDS), which are
required to detect sophisticated and polymorphous intrusion
attacks. In this work We have presented a novel intrusion
detection approach that uses Genetic programming based
Ensemble approach for detecting intrusion detection. The
experimental results demonstrate that the GP base Ensemble
Classifier is effective for reducing false alarm information
such that the widespread IDS systems can be implemented
using our approach considering both accuracy and
interpretability. In future Feature selection can be used not
only to alleviate the curse of dimensionality and minimize
classification errors, but also to improve the interpretability
of Ensemble-based classifiers. Our future work will focus
on reducing features for the classifiers by methods of
feature selection. Also, the work will be continued to study
the fitness function of the genetic algorithm to manipulate
more parameters of the fuzzy inference module, even
concentrating on fuzzy rules themselves
VII. REFERENCES
[1]. Kabiri, Peyman, and Ali A. Ghorbani. "Research on
Intrusion Detection and Response: A Survey." IJ
Network Security 1, no. 2 (2005): 84-102.
[2]. Schnackenberg, Dan, Kelly Djahandari, and Dan
Sterne. "Infrastructure for intrusion detection and
response." In DARPA Information Survivability
Conference and Exposition, 2000. DISCEX'00.
Proceedings, vol. 2, pp. 3-11. IEEE, 2000.
[3]. Garcia-Teodoro, Pedro, J. Diaz-Verdejo, Gabriel
Maciá-Fernández, and Enrique Vázquez.
"Anomaly-based network intrusion detection:
Techniques,systems and challenges." computers &
security 28, no. 1 (2009): 18-28.
[4]. Wu, Handong, Stephen Schwab, and Robert Lom
Peckham. "Signature based network intrusion
detection system and method." U.S. Patent
7,424,744, issued September 9, 2008.
[5]. Kreibich, Christian, and Jon Crowcroft.
"Honeycomb: creating intrusion detection

signatures using honeypots." ACM SIGCOMM
Computer Communication Review 34, no. 1
(2004): 51-56.
[6]. Roesch, Martin. "Snort: Lightweight Intrusion
Detection for Networks." In LISA, vol. 99, no. 1,
pp. 229-238. 1999.
[7]. Rokach, Lior. "Ensemble-based classifiers."
Artificial Intelligence Review 33, no. 1-2 (2010): 1-
39.
[8]. Ruta, Dymitr, and Bogdan Gabrys. "An overview of
classifier fusion methods." Computing and
Information systems 7, no. 1 (2000): 1-10.
[9]. Polikar, Robi. "Ensemble based systems in decision
making." Circuits and Systems Magazine, IEEE 6,
no. 3 (2006): 21-45.
[10]. Džeroski, Saso, and Bernard Ženko. "Is combining
classifiers with stacking better than selecting the
best one?." Machine learning 54, no. 3 (2004): 255-
273.
[11]. Beyer, Kevin, Jonathan Goldstein, Raghu
Ramakrishnan, and Uri Shaft. "When is “nearest
neighbor” meaningful?." In Database Theory—
ICDT’99, pp. 217-235. Springer Berlin Heidelberg,
1999.
[12]. Liao, Yihua, and V. Rao Vemuri. "Use of k-nearest
neighbor classifier for intrusion detection."
Computers & Security 21, no. 5 (2002): 439-448.
[13]. Gower, John Clifford. "Properties of Euclidean and
non-Euclidean distance matrices." Linear Algebra
and its Applications 67 (1985): 81-97.
[14]. Hoque, Mohammad Sazzadul, Md Mukit, Md
Bikas, and Abu Naser. "An implementation of
intrusion detection systemusing genetic algorithm."
arXiv preprint arXiv:1204.1336 (2012).
[15]. Owais, Suhail, Vaclav Snasel, Pavel Kromer, and
Ajith Abraham. "Survey: using genetic algorithm
approach in intrusion detection systems
techniques." In Computer Information Systems and
Industrial Management Applications, 2008.
CISIM'08. 7th, pp. 300-307. IEEE, 2008.
[16]. Bankovic, Zorana, Dušan Stepanovic, Slobodan
Bojanic, and Octavio Nieto-Taladriz. "Improving
network security using genetic algorithm
approach." Computers & Electrical Engineering 33,
no. 5 (2007): 438-451.
[17]. LaRoche, Patrick, and A. Nur Zincir-Heywood.
"Genetic programming based wifi data link layer
attack detection." In null, pp. 285-292. IEEE, 2006.
[18]. Folino, Gianluigi, Clara Pizzuti, and Giandomenico
Spezzano. "GP ensemble for distributed intrusion
detection systems." In Pattern Recognition and Data
Mining, pp. 54-62. Springer Berlin Heidelberg,
2005.
[19]. Lu, Wei, and Issa Traore. "Detecting new forms of
network intrusion using genetic programming."
Computational Intelligence 20, no. 3 (2004): 475-
494.
[20]. Li, Wei. "Using genetic algorithm for network
intrusion detection." Proceedings of the United
States Department of Energy Cyber Security Group
(2004): 1-8.
[21]. Mukkamala, Srinivas, Andrew H. Sung, and Ajith
Abraham. "Modeling intrusion detection systems
using linear genetic programming approach." In
Innovations in applied artificial intelligence, pp.
633-642. Springer Berlin Heidelberg, 2004.
[22]. Crosbie, Mark, and Gene Spafford. "Applying
genetic programming to intrusion detection." In
Working Notes for the AAAI Symposium on
Genetic Programming, pp. 1-8. Cambridge, MA:
MIT Press, 1995.

rpaper

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (8)

Similar to rpaper

Similar to rpaper (20)

rpaper