More Related Content
Similar to 41 125-1-pb (20)
More from Mahendra Sisodia
More from Mahendra Sisodia (15)
41 125-1-pb
- 1. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012
Attack Detection By Clustering And
Classification Approach
Ms. Priyanka J. Pathak , Asst. Prof. Snehlata S. Dongre
Abstract—Intrusion detection is a software application that
monitors network and/or system activities for malicious streams for intrusion detection. Clustering is done by
activities or policy violations and produces reports to a
K-means algorithm this will generates clusters of similar
Management Station. Security is becoming big issue for all
networks. Hackers and intruders have made many successful attack type which fall into same category. The centroids
attempts to bring down high profile company networks and web generated by this algorithm is used in next step of
services. Intrusion Detection System (IDS) is an important classification. In classification step K-nearest neighbors and
detection that is used as a countermeasure to preserve data decision tree algorithms are used which detects the different
integrity and system availability from attacks. The work is types of attacks.
implemented in two phases, in first phase clustering by
K-means is done and in next step of classification is done with
k-nearest neighbours and decision trees. The objects are II. CLUSTERING
clustered or grouped based on the principle of maximizing the A cluster [Han & Camber] is a collection of data objects
intra-class similarity and minimizing the interclass similarity. that are similar to one another within the same cluster and are
This paper proposes an approach which make the clusters of dissimilar to the objects in other clusters. A cluster of data
similar attacks and in next step of classification with K nearest objects can be treated collectively as one group and so may
neighbours and Decision trees it detect the attack types. This
be
method is advantageous over single classifier as it detect better
class than single classifier system. Considered as a form of data compression. Although
classification is an effective means for distinguishing groups
Index Terms—Data mining, Decision Trees, K-Means, or classes of objects, it requires the often costly collection
K-Nearest Neighbours, and labeling of a large set of training tuples or patterns, which
the classifier uses to model each group. It is often more
I. INTRODUCTION desirable to proceed in the reverse direction: First partition
the set of data into groups based on data similarity (e.g., using
Now a day everyone gets connected to the system but, there
clustering), and then assign labels to the relatively small
are many issues in network environment. So there is need of
number of groups. Additional advantages of such a
securing information, because there are lots of security clustering-based process are that it is adaptable to changes
threats are present in network environment. Intrusion and helps single out useful features that distinguish different
detection[9] is a software application that monitors network groups.
and/or system activities for malicious activities or policy Clustering[8],[10] is an unsupervised learning technique
violations and produces reports to a Management Station. which divides the datasets into subparts, which share
Intrusion detection systems are usually classified as host common properties. For clustering data points, there should
based or network based in terms of target domain. A number be high intra cluster similarity and low inter cluster similarity.
of techniques are available for intrusion detection [8],[11]. A clustering method which results in such type of clusters is
Data mining is the one of the efficient techniques available considered as good clustering algorithm.
for intrusion detection. Data mining refers to extracting or
A. K-Means Algorithm
mining knowledge from large amounts of data. Data mining
techniques may be supervised or unsupervised. The K-means [Han & Camber] is one of the simplest
Clustering[10] and Classification are the two important unsupervised learning algorithms that solve the well known
techniques used in data mining for extracting valuable clustering problem. The procedure follows a simple and easy
information. This paper proposes clustering and way to classify a given data set through a certain number of
classification. Classification is the processing of finding a set clusters (assume k clusters) fixed a priori. The main idea is to
of models which describe and distinguish data classes or define k centroids, one for each cluster. These centroids
concepts, for the purposes of being able to use the model to should be placed in a cunning way because of different
predict the class of objects whose class label is unknown. The location causes different result. So, the better choice is to
paper is organized as follows. Section 2 gives the details of place them as much as possible far away from each other. The
clustering algorithms. Section 3 & Section 4 details about next step is to take each point belonging to a given data set
strategies for classification for novel class detection in data and associate it to the nearest centroid. When no point is
pending, the first step is completed and an early groupage is
done. At this point we need to re-calculate k new centroids as
115
All Rights Reserved © 2012 IJARCSEE
- 2. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012
barycenters of the clusters resulting from the previous step. figure shows overall system architecture. The Clustering and
After we have these k new centroids, a new binding has to be Classification are the two parts of system.
done between the same data set points and the nearest new The input to the system is KDD data set. The first step is
centroid. A loop has been generated. As a result of this loop preprocess the input and this preprocessed input is applied to
we may notice that the k centroids change their location step the K-Means Clustering algorithm. The Preprocessing steps
by step until no more changes are done. contains :
The algorithm is composed of the following steps: Data cleaning
Algorithm: k-means Feature Selection
The k-means algorithm for partitioning, where each Data transformation etc.
cluster’s center is represented by the mean value of the
objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Method:
(1) Arbitrarily choose k objects from D as the initial
cluster centers;
(2) Repeat
(3) Reassign each object to the cluster to which the object
is the most similar, based on the mean value of the objects in
the cluster;
(4) Update the cluster means, i.e., calculate the mean value
of the objects for each cluster;
(5) Until no change;
Fig.3 System Architecture
The K-Means creates the clusters of similar attacks based on
similarity measure. Each record of data set contains 42
different features. The values of these features are already
Fig: 1 Clustering of a set of objects based on the k-means method
pre-processed in training data. The calculated centroid values
are used in next K-Nearest neighbours classification.
This produces a separation of the objects into groups from
which the metric to be minimized can be calculated.
Fig4:Preprocessed Data
Fig2: output of K-means algorithm
A. K-Nearest Neighbour Classification
III. CLASSIFICATION In this method, one simply finds in the N-dimensional
feature space the closest object from the training set to an
Classification [4],[13] is a process of finding a model that object being classified. Since the neighbour is nearby, it is
describes and distinguishes data classes or concept for the likely to be similar to the object being classified and so is
purpose of being able to use the model to predict the class of likely to be the same class as that object. Nearest neighbour
objects whose class label is unknown. The derived model is methods have the advantage that they are easy to implement.
based on the analysis of set of training data. This paper They can also give quite good results if the features are
introduces a method about clustering and classification. The chosen carefully. A new sample is classified by calculating
clustering is done by K-Means algorithm which is explained the distance to the nearest training case; the sign of that point
in earlier section. The classification method is proposed then determines the classification of the sample. Closeness"
through which novel class can be detected. The following
116
All Rights Reserved © 2012 IJARCSEE
- 3. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012
is defined in terms of Euclidean distance, where the 2. Choose attribute for which entropy is minimum (or,
Euclidean distance between two points, equivalently, information gain is maximum)
3. Make node containing that attribute
The algorithm is as follows:
Input: A data set, S
Output: A decision tree
If all the instances have the same value for the target
attribute then return a decision tree that is simply
this value.
Else
◦ Compute Gain values for all attributes and
select an attribute with the highest value
and create a node for that attribute.
◦ Make a branch from this node for every
Fig5: Flow Chart of K-Nearest Neighbour algorithm
value of the attribute
The unknown sample is assigned the most common class ◦ Assign all possible values of the attribute
among its k nearest neighbours. Following fig. shows attack to branches.
types and their number of nodes detected by K-Nearest ◦ Follow each branch by partitioning the
neighbour algorithm. dataset to be only instances whereby the
value of the branch is present and then go
back to 1.
This algorithm usually produces small trees, but it does not
always produce the smallest possible tree.
The optimization step makes use of information entropy:
Entropy:
Where :
E(S) is the information entropy of the set S;
n is the number of different values of the attribute
in S (entropy is computed for one chosen attribute)
f S(j) is the frequency (proportion) of the value j in
the set S.
Fig6: Output of K-Nearest-Neighbour algorithm
log2 is the binary logarithm
An entropy of 0 identifies a perfectly classified set.
Entropy is used to determine which node to split next in the
B. Decision Trees algorithm. The higher the entropy, the higher the potential to
Decision tree learning [5] ,[6] is a method commonly used improve the classification.
in data mining. The goal is to create a model that predicts the Gain:
value of a target variable based on several input variables.
Each interior node corresponds to one of the input variables; Gain is computed to estimate the gain produced by
there are edges to children for each of the possible values of
a split over an attribute :
that input variable. Each leaf represents a value of the target
variable given the values of the input variables represented
by the path from the root to the leaf. A tree can be "learned"
by splitting the source set into subsets based on an attribute
value test. This process is repeated on each derived subset in Where :
a recursive manner called recursive partitioning. The
G(S,A) is the gain of the set S after a split over the
recursion is completed when the subset at a node all has the
A attribute
same value of the target variable, or when splitting no longer
adds value to the predictions. E(S) is the information entropy of the set S.
C. ID3 Algorithm
m is the number of different values of the attribute A
In decision tree learning, ID3 (Iterative Dichotomiser 3)[6] in S.
is an algorithm used to generate a decision tree. The ID3
algorithm can be summarized as follows: Fs(Ai) is the frequency (proportion) of the items
1. Take all unused attributes and count their entropy possessing Ai as value for A in S.
concerning test samples
Ai is ith possible value of A.
117
All Rights Reserved © 2012 IJARCSEE
- 4. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012
SAi is a subset of S containing all items where the [12] Srilatha Chebrolua, Ajith Abrahama, Johnson P. Thomasa” Feature
deduction and ensemble design of intrusion detection systems” 2004
value of A is Ai. Elsevier Ltd
Gain quantifies the entropy improvement by splitting over an [13] Sandhya Peddabachigaria, Ajith Abrahamb,,Crina Grosanc, Johnson
Thomasa” Modeling intrusion detection system using hybrid intelligent
attribute. systems” 2005 Elsevier Ltd.
First Author:
Priyanka J. Pathak
IV sem MTech[CSE],
G.H.Raisoni College of Engineering,Nagpur,
R.T.M.N.U, Nagpur
Second Author :
Asst. Prof. Snehlata Dongre
CSE Department,
G.H.Raisoni College of engineering,Nagpur
R.T.M.N.U, Nagpur
Fig7: Output of ID3 algorithm
IV. CONCLUSION
In this paper, the work is done through which the novel
class is detected The system uses K-Means clustering
algorithm which produces different clusters of similar type of
attacks and total nodes per clusters in the input dataset. Also
it shows the updated centroids of each parameter in the input
dataset. The Classification stage gives details about detection
of different types of attacks and number of nodes in dataset.
REFERENCES
[1] Kapil K. Wankhade, Snehlata S. Dongre, Prakash S. Prasad , Mrudula
M. Gudadhe, Kalpana A. Mankar,” Intrusion Detection System Using
New Ensemble Boosting Approach” 2011 3rd International
Conference on Computer Modeling and Simulation (ICCMS 2011)
[2] Kapil K. Wankhade, Snehlata S. Dongre, Kalpana A. Mankar, Prashant
K. Adakane,” A New Adaptive Ensemble Boosting Classifier for
Concept Drifting Stream Data” 2011 3rd International Conference on
Computer Modeling and Simulation (ICCMS 2011)
[3] Hongbo Zhu, Yaqiang Wang, Zhonghua Yu “Clustering of Evolving
Data Stream with Multiple Adaptive Sliding Window” International
Conference on Data Storage and Data Engineering, 2010.
[4] Peng Zhang†, Xingquan Zhu‡, Jianlong Tan†, Li Guo† “Classifier and
Cluster Ensembles for Mining Concept Drifting Data Streams” IEEE
International Conference on Data Mining, 2010.
[5] T.Jyothirmayi, Suresh Reddy “An Algorithm for Better Decision
Tree” T.Jyothirmayi et. al. / (IJCSE) International Journal on Computer
Science and Engineering , 2010
[6] Yongjin Liu, NaLi, Leina Shi, Fangping Li “An Intrusion Detection
Method Based on Decision Tree”, International Conference on
E-Health Networking, Digital Ecosystems and Technologies, 2010.
[7] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby and R. Gavalda, “New
ensemble methods for evloving data streams”, In KDD’09, ACM,
Paris, 2009, pp. 139-148.
[8] LID Li-xiong, KANGJing, GUO Yun-fei, HUANGHai “A Three-Step
Clustering Algorithm over an Evolving Data Stream” The National
High Technology Research and Development Program("863"
Program) of China, Fund 2008 AAOII002 sponsors.
[9] Mrutyunjaya Panda, Manas Ranjan Patra “A COMPARATIVE
STUDY OF DATA MINING ALGORITHMS FOR NETWORK
INTRUSION DETECTION” First International Conference on
Emerging Trends in Engineering and Technology,2008
[10] Shuang Wu, Chunyu Yang and Jie Zhou “Clustering-training for Data
Stream Mining” Sixth IEEE International Conference on Data Mining -
Workshops (ICDMW'06)
[11] Sang-Hyun Oh, Jin-Suk Kang, Yung-Cheol Byun, Gyung-Leen Park3
and Sang-Yong Byun3 “Intrusion Detection based on Clustering a Data
Stream” Third ACIS Int'l Conference on Software Engineering
Research, Management and Applications (SERA’05).
118
All Rights Reserved © 2012 IJARCSEE