ISSN: 2277 – 9043                                                    International Journal of Advanced Research in Compute...
ISSN: 2277 – 9043                                                           International Journal of Advanced Research in ...
ISSN: 2277 – 9043                                                       International Journal of Advanced Research in Comp...
ISSN: 2277 – 9043                                                             International Journal of Advanced Research i...
Upcoming SlideShare
Loading in …5

41 125-1-pb


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

41 125-1-pb

  1. 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 Attack Detection By Clustering And Classification Approach Ms. Priyanka J. Pathak , Asst. Prof. Snehlata S. DongreAbstract—Intrusion detection is a software application thatmonitors network and/or system activities for malicious streams for intrusion detection. Clustering is done byactivities or policy violations and produces reports to a K-means algorithm this will generates clusters of similarManagement Station. Security is becoming big issue for allnetworks. Hackers and intruders have made many successful attack type which fall into same category. The centroidsattempts to bring down high profile company networks and web generated by this algorithm is used in next step ofservices. Intrusion Detection System (IDS) is an important classification. In classification step K-nearest neighbors anddetection that is used as a countermeasure to preserve data decision tree algorithms are used which detects the differentintegrity and system availability from attacks. The work is types of attacks.implemented in two phases, in first phase clustering byK-means is done and in next step of classification is done withk-nearest neighbours and decision trees. The objects are II. CLUSTERINGclustered or grouped based on the principle of maximizing the A cluster [Han & Camber] is a collection of data objectsintra-class similarity and minimizing the interclass similarity. that are similar to one another within the same cluster and areThis paper proposes an approach which make the clusters of dissimilar to the objects in other clusters. A cluster of datasimilar attacks and in next step of classification with K nearest objects can be treated collectively as one group and so mayneighbours and Decision trees it detect the attack types. This bemethod is advantageous over single classifier as it detect betterclass than single classifier system. Considered as a form of data compression. Although classification is an effective means for distinguishing groups Index Terms—Data mining, Decision Trees, K-Means, or classes of objects, it requires the often costly collectionK-Nearest Neighbours, and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more I. INTRODUCTION desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using Now a day everyone gets connected to the system but, there clustering), and then assign labels to the relatively smallare many issues in network environment. So there is need of number of groups. Additional advantages of such asecuring information, because there are lots of security clustering-based process are that it is adaptable to changesthreats are present in network environment. Intrusion and helps single out useful features that distinguish differentdetection[9] is a software application that monitors network groups.and/or system activities for malicious activities or policy Clustering[8],[10] is an unsupervised learning techniqueviolations and produces reports to a Management Station. which divides the datasets into subparts, which shareIntrusion detection systems are usually classified as host common properties. For clustering data points, there shouldbased or network based in terms of target domain. A number be high intra cluster similarity and low inter cluster similarity.of techniques are available for intrusion detection [8],[11]. A clustering method which results in such type of clusters isData mining is the one of the efficient techniques available considered as good clustering algorithm.for intrusion detection. Data mining refers to extracting or A. K-Means Algorithmmining knowledge from large amounts of data. Data miningtechniques may be supervised or unsupervised. The K-means [Han & Camber] is one of the simplestClustering[10] and Classification are the two important unsupervised learning algorithms that solve the well knowntechniques used in data mining for extracting valuable clustering problem. The procedure follows a simple and easyinformation. This paper proposes clustering and way to classify a given data set through a certain number ofclassification. Classification is the processing of finding a set clusters (assume k clusters) fixed a priori. The main idea is toof models which describe and distinguish data classes or define k centroids, one for each cluster. These centroidsconcepts, for the purposes of being able to use the model to should be placed in a cunning way because of differentpredict the class of objects whose class label is unknown. The location causes different result. So, the better choice is topaper is organized as follows. Section 2 gives the details of place them as much as possible far away from each other. Theclustering algorithms. Section 3 & Section 4 details about next step is to take each point belonging to a given data setstrategies for classification for novel class detection in data and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as 115 All Rights Reserved © 2012 IJARCSEE
  2. 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012barycenters of the clusters resulting from the previous step. figure shows overall system architecture. The Clustering andAfter we have these k new centroids, a new binding has to be Classification are the two parts of system.done between the same data set points and the nearest new The input to the system is KDD data set. The first step iscentroid. A loop has been generated. As a result of this loop preprocess the input and this preprocessed input is applied towe may notice that the k centroids change their location step the K-Means Clustering algorithm. The Preprocessing stepsby step until no more changes are done. contains : The algorithm is composed of the following steps:  Data cleaning Algorithm: k-means  Feature Selection The k-means algorithm for partitioning, where each  Data transformation etc.cluster’s center is represented by the mean value of theobjects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) Arbitrarily choose k objects from D as the initialcluster centers; (2) Repeat (3) Reassign each object to the cluster to which the objectis the most similar, based on the mean value of the objects inthe cluster; (4) Update the cluster means, i.e., calculate the mean valueof the objects for each cluster; (5) Until no change; Fig.3 System Architecture The K-Means creates the clusters of similar attacks based on similarity measure. Each record of data set contains 42 different features. The values of these features are already Fig: 1 Clustering of a set of objects based on the k-means method pre-processed in training data. The calculated centroid values are used in next K-Nearest neighbours classification. This produces a separation of the objects into groups fromwhich the metric to be minimized can be calculated. Fig4:Preprocessed Data Fig2: output of K-means algorithm A. K-Nearest Neighbour Classification III. CLASSIFICATION In this method, one simply finds in the N-dimensional feature space the closest object from the training set to an Classification [4],[13] is a process of finding a model that object being classified. Since the neighbour is nearby, it isdescribes and distinguishes data classes or concept for the likely to be similar to the object being classified and so ispurpose of being able to use the model to predict the class of likely to be the same class as that object. Nearest neighbourobjects whose class label is unknown. The derived model is methods have the advantage that they are easy to implement.based on the analysis of set of training data. This paper They can also give quite good results if the features areintroduces a method about clustering and classification. The chosen carefully. A new sample is classified by calculatingclustering is done by K-Means algorithm which is explained the distance to the nearest training case; the sign of that pointin earlier section. The classification method is proposed then determines the classification of the sample. Closeness"through which novel class can be detected. The following 116 All Rights Reserved © 2012 IJARCSEE
  3. 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012is defined in terms of Euclidean distance, where the 2. Choose attribute for which entropy is minimum (or,Euclidean distance between two points, equivalently, information gain is maximum) 3. Make node containing that attribute The algorithm is as follows: Input: A data set, S Output: A decision tree  If all the instances have the same value for the target attribute then return a decision tree that is simply this value.  Else ◦ Compute Gain values for all attributes and select an attribute with the highest value and create a node for that attribute. ◦ Make a branch from this node for every Fig5: Flow Chart of K-Nearest Neighbour algorithm value of the attribute The unknown sample is assigned the most common class ◦ Assign all possible values of the attributeamong its k nearest neighbours. Following fig. shows attack to branches.types and their number of nodes detected by K-Nearest ◦ Follow each branch by partitioning theneighbour algorithm. dataset to be only instances whereby the value of the branch is present and then go back to 1. This algorithm usually produces small trees, but it does not always produce the smallest possible tree. The optimization step makes use of information entropy: Entropy: Where :  E(S) is the information entropy of the set S;  n is the number of different values of the attribute in S (entropy is computed for one chosen attribute)  f S(j) is the frequency (proportion) of the value j in the set S. Fig6: Output of K-Nearest-Neighbour algorithm  log2 is the binary logarithm An entropy of 0 identifies a perfectly classified set. Entropy is used to determine which node to split next in the B. Decision Trees algorithm. The higher the entropy, the higher the potential to Decision tree learning [5] ,[6] is a method commonly used improve the data mining. The goal is to create a model that predicts the Gain:value of a target variable based on several input variables.Each interior node corresponds to one of the input variables; Gain is computed to estimate the gain produced bythere are edges to children for each of the possible values of a split over an attribute :that input variable. Each leaf represents a value of the targetvariable given the values of the input variables representedby the path from the root to the leaf. A tree can be "learned"by splitting the source set into subsets based on an attributevalue test. This process is repeated on each derived subset in Where :a recursive manner called recursive partitioning. The  G(S,A) is the gain of the set S after a split over therecursion is completed when the subset at a node all has the A attributesame value of the target variable, or when splitting no longeradds value to the predictions.  E(S) is the information entropy of the set S. C. ID3 Algorithm  m is the number of different values of the attribute A In decision tree learning, ID3 (Iterative Dichotomiser 3)[6] in an algorithm used to generate a decision tree. The ID3algorithm can be summarized as follows:  Fs(Ai) is the frequency (proportion) of the items 1. Take all unused attributes and count their entropy possessing Ai as value for A in S. concerning test samples  Ai is ith possible value of A. 117 All Rights Reserved © 2012 IJARCSEE
  4. 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012  SAi is a subset of S containing all items where the [12] Srilatha Chebrolua, Ajith Abrahama, Johnson P. Thomasa” Feature deduction and ensemble design of intrusion detection systems” 2004 value of A is Ai. Elsevier LtdGain quantifies the entropy improvement by splitting over an [13] Sandhya Peddabachigaria, Ajith Abrahamb,,Crina Grosanc, Johnson Thomasa” Modeling intrusion detection system using hybrid intelligentattribute. systems” 2005 Elsevier Ltd. First Author: Priyanka J. Pathak IV sem MTech[CSE], G.H.Raisoni College of Engineering,Nagpur, R.T.M.N.U, Nagpur Second Author : Asst. Prof. Snehlata Dongre CSE Department, G.H.Raisoni College of engineering,Nagpur R.T.M.N.U, Nagpur Fig7: Output of ID3 algorithm IV. CONCLUSION In this paper, the work is done through which the novelclass is detected The system uses K-Means clusteringalgorithm which produces different clusters of similar type ofattacks and total nodes per clusters in the input dataset. Alsoit shows the updated centroids of each parameter in the inputdataset. The Classification stage gives details about detectionof different types of attacks and number of nodes in dataset. REFERENCES[1] Kapil K. Wankhade, Snehlata S. Dongre, Prakash S. Prasad , Mrudula M. Gudadhe, Kalpana A. Mankar,” Intrusion Detection System Using New Ensemble Boosting Approach” 2011 3rd International Conference on Computer Modeling and Simulation (ICCMS 2011)[2] Kapil K. Wankhade, Snehlata S. Dongre, Kalpana A. Mankar, Prashant K. Adakane,” A New Adaptive Ensemble Boosting Classifier for Concept Drifting Stream Data” 2011 3rd International Conference on Computer Modeling and Simulation (ICCMS 2011)[3] Hongbo Zhu, Yaqiang Wang, Zhonghua Yu “Clustering of Evolving Data Stream with Multiple Adaptive Sliding Window” International Conference on Data Storage and Data Engineering, 2010.[4] Peng Zhang†, Xingquan Zhu‡, Jianlong Tan†, Li Guo† “Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams” IEEE International Conference on Data Mining, 2010.[5] T.Jyothirmayi, Suresh Reddy “An Algorithm for Better Decision Tree” T.Jyothirmayi et. al. / (IJCSE) International Journal on Computer Science and Engineering , 2010[6] Yongjin Liu, NaLi, Leina Shi, Fangping Li “An Intrusion Detection Method Based on Decision Tree”, International Conference on E-Health Networking, Digital Ecosystems and Technologies, 2010.[7] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby and R. Gavalda, “New ensemble methods for evloving data streams”, In KDD’09, ACM, Paris, 2009, pp. 139-148.[8] LID Li-xiong, KANGJing, GUO Yun-fei, HUANGHai “A Three-Step Clustering Algorithm over an Evolving Data Stream” The National High Technology Research and Development Program("863" Program) of China, Fund 2008 AAOII002 sponsors.[9] Mrutyunjaya Panda, Manas Ranjan Patra “A COMPARATIVE STUDY OF DATA MINING ALGORITHMS FOR NETWORK INTRUSION DETECTION” First International Conference on Emerging Trends in Engineering and Technology,2008[10] Shuang Wu, Chunyu Yang and Jie Zhou “Clustering-training for Data Stream Mining” Sixth IEEE International Conference on Data Mining - Workshops (ICDMW06)[11] Sang-Hyun Oh, Jin-Suk Kang, Yung-Cheol Byun, Gyung-Leen Park3 and Sang-Yong Byun3 “Intrusion Detection based on Clustering a Data Stream” Third ACIS Intl Conference on Software Engineering Research, Management and Applications (SERA’05). 118 All Rights Reserved © 2012 IJARCSEE