Intrusion Detection using C4.5: Performance Enhancement by Classifier Combination


Published on

Data Security has become a very critical part of any
organizational information system. Intrusion Detection
System (IDS) is used as a security measure to preserve data
integrity and system availability from various attacks. This
paper evaluates the performance of C4.5 classifier and its
combination using bagging, boosting and stacking over NSLKDD
dataset for IDS. This dataset set consists of selected
records of the complete KDD dataset.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Intrusion Detection using C4.5: Performance Enhancement by Classifier Combination

  1. 1. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010 Intrusion Detection using C4.5: Performance Enhancement by Classifier Combination Manasi Gyanchandani1, R. N. Yadav2, J. L. Rana3 Dept. of Information Technology MANIT Bhopal Dept of Electronics and Communication MANIT Bhopal Ex-HOD, Dept. of CS/IT, MANIT, Bhopal Data Security has become a very critical part of any computer security has been an active area of research sinceorganizational information system. Intrusion Detection it was originally proposed in [8].System (IDS) is used as a security measure to preserve data Initially KDDCUP ’99 dataset was used for IDS but itintegrity and system availability from various attacks. This has some inherent problems. The important problem was apaper evaluates the performance of C4.5 classifier and itscombination using bagging, boosting and stacking over NSL- huge number of redundant records; about 78% and 75% ofKDD dataset for IDS. This dataset set consists of selected the records are duplicate in the train and test setrecords of the complete KDD dataset. respectively. This large amount of redundant records in the train set will cause learning algorithms to be biased I. INTRODUCTION towards the more frequent records, and thus prevent it from learning unfrequent records which are usually more Our lives have drastically changed due to information harmful to networks such as U2R and R2L attacks. Thetechnology, at the same time we are completely dependent other problem with duplicated records in the test set willon technology that is vulnerable to attacks. Because of cause the evaluation results to be biased by the methodsthese attacks the confidentiality, integrity and availability which have better detection rates on the frequent recordsof the information may be lost. It is estimated that these .To remove these problems a new data set was proposed,attacks are costing tens or even hundreds of millions of NSL-KDD [1].dollars each year. The numbers of attacks are even C4.5 which is an extension of ID3 algorithm [11] is andoubling each year. These attacks may cause a serious algorithm used to generate a decision tree. The decisionthreat to national security. IDS were designed to monitor trees generated by C4.5 can be used for classification, andattacks and generate alarms whenever certain abnormal for this reason, it is often referred to as a statisticalactivities are detected. classifier .The results will vary significantly if training data IDSs can be categorized based on which events they is changed. This variation is known as error due to variancemonitor, the way they collect information that an intrusion that can be minimized using various classifierhas occurred. IDSs that critically analysis data circulating combinations.on the network are called as Network based IDS (NIDSs) Section II presents the description of dataset used.and IDS that reside on the host and collect logs of Section III describes the various classifier combinationoperating system-related events are called as Host based techniques. Section IV provides the experimental resultsIDS (HIDSs) [3] [8]. and discussion. Section V concludes the paper. Two types of Intrusion Detection techniques exist based on the method of inspecting the traffic: II. DATA SET DESCRIPTION • Signature based IDS • Statistical anomaly based IDS. Mostly all the experiments on intrusion detection are In signature based IDS, also known as misuse detection, done on KDDCUP ’99 dataset, which is a subset of thesignatures of known attacks are stored and the events are 1998 DARPA Intrusion Detection Evaluation data set, andmatched against the stored signatures. It will signal an is processed, extracting 41 features from the raw data ofintrusion if a match is found. The main drawback with this DARPA 98 data set. [4] defined higher-level features thatmethod is that it cannot detect new attacks whose help in distinguishing between “good” normal connectionssignatures are unknown. This means that an IDS using from “bad” connections (attacks). This data can be used tomisuse detection will only detect known attacks or attacks test both host based and network based systems, and boththat are similar enough to a known attack to match its signature and anomaly detection systems. A connection issignature [3]. Statistical anomaly based intrusion detection a sequence of Transmission Control Protocol (TCP)has attracted many academic researchers due to its potential packets starting and ending with well defined times,for addressing novel attacks. The researchers have found between which data flows from a source IP address to athat several machine learning algorithms have a very high target IP address under some well defined protocol. Eachdetection rate while keeping a low false alarm rate. connection is labeled as normal, or as an attack, withAnomaly detection applied to intrusion detection and exactly one specific attack type. Each connection record consists of about 100 bytes [9]. Some of the basics features of individual TCP connection are listed in Table I .© 2010 ACEEE 46DOI: 01.IJSIP.01.03.45
  2. 2. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010 TABLE I TPR = TP / (TP+FN) BASIC FEATURES OF INDIVIDUAL TCP CONNECTIONS • A false positive (FP) occurs when the outcome is Feature incorrectly predicted as yes (or positive) when it Description Type name is actually no (negative). It is calculated as length (number of seconds) of the Contin below. Duration connection uous FPR = FP / (TN + FP) Protocol _ type of the protocol, e.g. tcp, udp, etc. Discret • A false negative (FN) occurs when the outcome e Type is incorrectly predicted as negative when it is Service network service on the destination, e.g., Discret actually positive. http, telnet, etc. e • Recall: The percentage of the total relevant src_byte number of data bytes from source to Contin s destination uous documents in a database retrieved by your dst_byte number of data bytes from destination to Contin search. If it is known that there were 1000 s source uous relevant documents in a database and search Discret retrieved 100 of these relevant documents, the Flag normal or error status of the connection e 1 if connection is from/to the same Discret recall would be 10%. It is calculated as below. Land Recall =TP / (TP+FN) host/port; 0 otherwise e wrong_fr number of ``wrong fragments Contin • Precision: The percentage of relevant documents agment uous in relation to the number of documents retrieved. Contin Urgent number of urgent packets uous If search retrieves 100 documents and 20 of these are relevant, the precision is 20%. It isA. NSL-KDD calculated as below. KDD train and test set consists of huge number of Precision=TP / (TP+FP)redundant records. Almost about 78% and 75% of the • The overall success rate is the number of correctrecords are duplicated in the train and test set respectively. classifications divided by the total number ofThis may cause the classification algorithms to be biased classifications.towards these redundant records and thus prevent it from Success rate = (TP+TN) / (TP+TN+FP+FN)classifying the other records (which are not duplicate).To Error Rate = 1- Success ratesolve this problem, a new dataset was developed NSL- • In a multiclass prediction, the result on a test set isKDD. All the repeated records in the entire KDD train and often displayed as a two dimensional confusiontest set were removed, and only one copy of each record matrix with a row and column for each class. Eachwas kept. Tables II and III show the statistical analysis of matrix element shows the number of test examplesthe reduction of repeated records in the KDD train and test for which the actual class is the row and thesets, respectively, [1]. predicted class is the column. Good results TABLE II correspond to large numbers down the main STATISTICAL ANALYSIS OF REDUNDANT RECORDS IN THE diagonal and small, ideally zero, off-diagonal KDD TRAIN SET elements.The confusion Matrix is formed based on the Table IV. Reduction Original Records Distinct Records TABLE IV rate CONFUSION MATRIX Attacks 3,925,650 262,178 93.32% Normal 972,781 812,814 16.44% Predicted Class Total 4,898,431 1,074,992 78.05% Attack Normal TABLE III STATISTICAL ANALYSIS OF REDUNDANT RECORDS IN THE Actual Attack TP FN KDD TEST SET Class Normal FP TN Original Records Distinct Records Reduction rate Attacks 250,436 29,378 88.26% Normal 60,591 47,911 20.92% III.CLASSIFIER COMBINATION TECHNIQUES Total 311,027 77,289 75.15% Classifier combination technique can be used to reduceB. Evaluation Metrics the error due to variance. In order to make decisions in intrusion detection more reliable , the output of different Metrics which are mainly used to evaluate the models can be combined. Several machine learningperformance of classifier are present in [6] [2] and are techniques do this by learning an ensemble of models andgiven here for ready reference. using them in combination , Bagging, Boosting, and • The true positives (TP) and true negatives (TN) are Stacking are most efficient among them. These models can correct classifications. True positive is the probability increase the predictive performance over a single model that there is an alert, when there is an intrusion. It is and can also be applied to numeric prediction problems and calculated as below. classification tasks. The performance of these three models© 2010 ACEEE 47DOI: 01.IJSIP.01.03.45
  3. 3. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010is good. An ensemble of classifiers is a set of classifiers It was found that for the normal class, as shown in thewhose individual decisions are combined to classify new Table V, bagging gives the better result. The recall wasexamples. The purpose of combining classifiers consists on found to be 0.719 for bagging and it was 0.708 for C4.5,improving the accuracy of a single classifier [10]. both having the same precision value (0.973). While for the anomaly class as shown in Table VI, both recall andA. Bagging: precision have higher values for bagging. The Bootstrap aggregating algorithm generates different TABLE Vclassifiers from different bootstrap samples and combines PERFORMANCE METRICS FOR NORMAL CLASSdecisions from the different classifiers into a singleprediction by voting (the class that gets more votes from Bagging Boosting Stacking C4.5the classifiers wins). TP 0.973 0.957 0.974 0.973B. Boosting: FP 0.288 0.346 0.326 0.304 Another method to construct an ensemble of classifiers Recall 0.719 0.677 0.693 0.708is know as boosting, which is used to boost the Precision 0.973 0.957 0.974 0.973performance of a weak learner. A weak learner is a simpleclassifier whose error is less than 50% on training TABLE VIinstances. The models which are more successful will be PERFORMANCE METRICS FOR ANOMALY CLASSassigned with more weight as compared to other models.Here each new model is influenced by the performance of Bagging Boosting Stacking C4.5previously built model. TP 0.712 0.654 0.674 0.696 Thus boosting can built a powerful combined classifier FP 0.027 0.043 0.026 0.027from very simple learning methods. It can convert these Recall 0.972 0.953 0.971 0.971simple learning methods called as weak learners into strong Precision 0.712 0.654 0.674 0.696ones. It produces classifiers that are more accurate on freshdata than ones generated by bagging. But it sometimes failsin practical situations: It generate a classifier that is less V CONCLUSIONSaccurate than a single classifier from the same data [7]. Error due to variance has been reduced using classifier C. Stacking: combinations thus increasing the performance of the Stacking is the abbreviation to refer to Stacked classification using the NSL-KDD dataset. Out of the threeGeneralization. Unlike bagging and boosting, it uses classifiers Bagging provides better results. NSL-KDDdifferent learning algorithms to generate the ensemble of dataset can be used for performance evaluation for 5-classifiers. The main idea of stacking is to combine classes (normal, dos, probe, u2r and r2l) instead of 2-classifiers from different learners such as decision trees, classes. Further performance can be improved by reducinginstance-based learners, etc. the features as given in [12]. Since each one uses different knowledge representation Different set of features are used for different class.and different learning biases, the hypothesis space will be More classification algorithm and its combination can beexplored differently, and different classifiers will be used on NSL-KDD datasetobtained. Thus, it is expected that they will not becorrelated. REFERENCES Once the classifiers have been generated, they must be [1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “Acombined. Unlike bagging and boosting, stacking does not Detailed Analysis of the KDD CUP 99 Data Set,”use a voting system because, for example, if the majority of Proceedings of the Second IEEE Symposium onthe classifiers make bad predictions this will lead to a final Computational Intelligence for Security and Defensebad classification. To solve this problem, stacking uses the Applications (CISDA) 2009.concept of Meta learner.[10] The Meta learner (or level-1 [2] M. Shyu, S. Chen, K. Sarinnapakorn, & L. Chang,” A novelmodel), tries to learn, using a learning algorithm, how the anomaly detection scheme based on principal component classifier”, Proceedings of the IEEE foundation & Newdecisions of the base classifiers (or level-0 models) should Directions of Data Mining Workshop, in conjunction with thebe combined . Third IEEE International Conference on Data Mining (ICDM03), pp. 172-179, 2003. IV RESULTS AND DISCUSSION [3] D.E.Denning, “An Intrusion Detection Model”, IEEE Transactions on Software Engineering, SE-13, pp. 222-232, In order to reduce the error due to variance classifier 1987.combinations are used. Initially C4.5 classifier is applied [4] Stolfo J., Fan W., Lee W., Prodromidis A., and Chan P.K.,over NSL-KDD dataset. NSL-KDD contains 125973 “Cost-based modeling and evaluation for data mining withrecords in the train set and 22544 records in the test set. To application to fraud and intrusion detection,” DARPAimprove the performance of C4.5 classifier over NSL-KDD Information Survivability Conference, 2000.dataset, classifier combinations techniques: bagging, [5] and stacking are used. [6] P Srinivasulu, D Nagaraju, P Ramesh Kumar, and K Nagerwara Rao, “Classifying the Network Intrusion Attacks© 2010 ACEEE 48DOI: 01.IJSIP.01.03.45
  4. 4. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010 using Data Mining Classification Methods and their [9] The KDD Archive. KDD99 cup dataset, 1999. Performance Comparison”, International Journal of Computer Science and [10] Ricardo Aler, Daniel Borrajo, and Agapito Ledezma, “ Network Security, Vol.9 No.6, pp 11-18 June 2009. Heuristic Search Based Stacking of Classifiers”, Universidad[7] Ian H. Witten and Eibe Frank, “Data Mining”, Practical Carlos III, Avda, Universidad, 30, 28911 Leganés (Madrid) Machine Learning Tools and Techniques, Second Edition, , 2002. Elsevier, 2005. [11][8] Srilatha Chebrolu, Ajith Abraham, Johnson P. Thomas,” [12] Anazida Zainal, Mohd Aizaini Maarof, and Siti Mariyam Feature Deduction and ensemble design of Intrusion Shamsuddin, Ensemble Classifiers for Network Intrusion Detection”, Elsevier, Computer and Security, 24,pp 295-307, Detection System”, Journal of Information Assurance and 2005. Security 4, 217-225, 2009© 2010 ACEEE 49DOI: 01.IJSIP.01.03.45