More Related Content Similar to Intrusion Detection using C4.5: Performance Enhancement by Classifier Combination (20) More from IDES Editor (20) Intrusion Detection using C4.5: Performance Enhancement by Classifier Combination1. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010
Intrusion Detection using C4.5: Performance
Enhancement by Classifier Combination
Manasi Gyanchandani1, R. N. Yadav2, J. L. Rana3
Dept. of Information Technology MANIT Bhopal
manasi_gyanchandani@yahoo.co.in
Dept of Electronics and Communication MANIT Bhopal
myadav@gmail.com
Ex-HOD, Dept. of CS/IT, MANIT, Bhopal
jl_rana@yahoo.com.
Abstract: Data Security has become a very critical part of any computer security has been an active area of research since
organizational information system. Intrusion Detection it was originally proposed in [8].
System (IDS) is used as a security measure to preserve data Initially KDDCUP ’99 dataset was used for IDS but it
integrity and system availability from various attacks. This has some inherent problems. The important problem was a
paper evaluates the performance of C4.5 classifier and its
combination using bagging, boosting and stacking over NSL-
huge number of redundant records; about 78% and 75% of
KDD dataset for IDS. This dataset set consists of selected the records are duplicate in the train and test set
records of the complete KDD dataset. respectively. This large amount of redundant records in the
train set will cause learning algorithms to be biased
I. INTRODUCTION towards the more frequent records, and thus prevent it from
learning unfrequent records which are usually more
Our lives have drastically changed due to information harmful to networks such as U2R and R2L attacks. The
technology, at the same time we are completely dependent other problem with duplicated records in the test set will
on technology that is vulnerable to attacks. Because of cause the evaluation results to be biased by the methods
these attacks the confidentiality, integrity and availability which have better detection rates on the frequent records
of the information may be lost. It is estimated that these .To remove these problems a new data set was proposed,
attacks are costing tens or even hundreds of millions of NSL-KDD [1].
dollars each year. The numbers of attacks are even C4.5 which is an extension of ID3 algorithm [11] is an
doubling each year. These attacks may cause a serious algorithm used to generate a decision tree. The decision
threat to national security. IDS were designed to monitor trees generated by C4.5 can be used for classification, and
attacks and generate alarms whenever certain abnormal for this reason, it is often referred to as a statistical
activities are detected. classifier .The results will vary significantly if training data
IDSs can be categorized based on which events they is changed. This variation is known as error due to variance
monitor, the way they collect information that an intrusion that can be minimized using various classifier
has occurred. IDSs that critically analysis data circulating combinations.
on the network are called as Network based IDS (NIDSs) Section II presents the description of dataset used.
and IDS that reside on the host and collect logs of Section III describes the various classifier combination
operating system-related events are called as Host based techniques. Section IV provides the experimental results
IDS (HIDSs) [3] [8]. and discussion. Section V concludes the paper.
Two types of Intrusion Detection techniques exist based
on the method of inspecting the traffic: II. DATA SET DESCRIPTION
• Signature based IDS
• Statistical anomaly based IDS. Mostly all the experiments on intrusion detection are
In signature based IDS, also known as misuse detection, done on KDDCUP ’99 dataset, which is a subset of the
signatures of known attacks are stored and the events are 1998 DARPA Intrusion Detection Evaluation data set, and
matched against the stored signatures. It will signal an is processed, extracting 41 features from the raw data of
intrusion if a match is found. The main drawback with this DARPA 98 data set. [4] defined higher-level features that
method is that it cannot detect new attacks whose help in distinguishing between “good” normal connections
signatures are unknown. This means that an IDS using from “bad” connections (attacks). This data can be used to
misuse detection will only detect known attacks or attacks test both host based and network based systems, and both
that are similar enough to a known attack to match its signature and anomaly detection systems. A connection is
signature [3]. Statistical anomaly based intrusion detection a sequence of Transmission Control Protocol (TCP)
has attracted many academic researchers due to its potential packets starting and ending with well defined times,
for addressing novel attacks. The researchers have found between which data flows from a source IP address to a
that several machine learning algorithms have a very high target IP address under some well defined protocol. Each
detection rate while keeping a low false alarm rate. connection is labeled as normal, or as an attack, with
Anomaly detection applied to intrusion detection and exactly one specific attack type. Each connection record
consists of about 100 bytes [9]. Some of the basics features
of individual TCP connection are listed in Table I .
© 2010 ACEEE 46
DOI: 01.IJSIP.01.03.45
2. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010
TABLE I TPR = TP / (TP+FN)
BASIC FEATURES OF INDIVIDUAL TCP CONNECTIONS
• A false positive (FP) occurs when the outcome is
Feature incorrectly predicted as yes (or positive) when it
Description Type
name is actually no (negative). It is calculated as
length (number of seconds) of the Contin below.
Duration
connection uous FPR = FP / (TN + FP)
Protocol
_ type of the protocol, e.g. tcp, udp, etc.
Discret • A false negative (FN) occurs when the outcome
e
Type is incorrectly predicted as negative when it is
Service
network service on the destination, e.g., Discret actually positive.
http, telnet, etc. e
• Recall: The percentage of the total relevant
src_byte number of data bytes from source to Contin
s destination uous documents in a database retrieved by your
dst_byte number of data bytes from destination to Contin search. If it is known that there were 1000
s source uous relevant documents in a database and search
Discret retrieved 100 of these relevant documents, the
Flag normal or error status of the connection
e
1 if connection is from/to the same Discret
recall would be 10%. It is calculated as below.
Land Recall =TP / (TP+FN)
host/port; 0 otherwise e
wrong_fr
number of ``wrong'' fragments
Contin • Precision: The percentage of relevant documents
agment uous in relation to the number of documents retrieved.
Contin
Urgent number of urgent packets
uous
If search retrieves 100 documents and 20 of
these are relevant, the precision is 20%. It is
A. NSL-KDD calculated as below.
KDD train and test set consists of huge number of Precision=TP / (TP+FP)
redundant records. Almost about 78% and 75% of the • The overall success rate is the number of correct
records are duplicated in the train and test set respectively. classifications divided by the total number of
This may cause the classification algorithms to be biased classifications.
towards these redundant records and thus prevent it from Success rate = (TP+TN) / (TP+TN+FP+FN)
classifying the other records (which are not duplicate).To Error Rate = 1- Success rate
solve this problem, a new dataset was developed NSL- • In a multiclass prediction, the result on a test set is
KDD. All the repeated records in the entire KDD train and often displayed as a two dimensional confusion
test set were removed, and only one copy of each record matrix with a row and column for each class. Each
was kept. Tables II and III show the statistical analysis of matrix element shows the number of test examples
the reduction of repeated records in the KDD train and test for which the actual class is the row and the
sets, respectively, [1]. predicted class is the column. Good results
TABLE II correspond to large numbers down the main
STATISTICAL ANALYSIS OF REDUNDANT RECORDS IN THE diagonal and small, ideally zero, off-diagonal
KDD TRAIN SET
elements.The confusion Matrix is formed based on
the Table IV.
Reduction
Original Records Distinct Records TABLE IV
rate
CONFUSION MATRIX
Attacks 3,925,650 262,178 93.32%
Normal 972,781 812,814 16.44% Predicted Class
Total 4,898,431 1,074,992 78.05%
Attack Normal
TABLE III
STATISTICAL ANALYSIS OF REDUNDANT RECORDS IN THE Actual
Attack TP FN
KDD TEST SET Class
Normal FP TN
Original Records Distinct Records Reduction rate
Attacks 250,436 29,378 88.26%
Normal 60,591 47,911 20.92% III.CLASSIFIER COMBINATION TECHNIQUES
Total 311,027 77,289 75.15%
Classifier combination technique can be used to reduce
B. Evaluation Metrics the error due to variance. In order to make decisions in
intrusion detection more reliable , the output of different
Metrics which are mainly used to evaluate the
models can be combined. Several machine learning
performance of classifier are present in [6] [2] and are
techniques do this by learning an ensemble of models and
given here for ready reference.
using them in combination , Bagging, Boosting, and
• The true positives (TP) and true negatives (TN) are Stacking are most efficient among them. These models can
correct classifications. True positive is the probability increase the predictive performance over a single model
that there is an alert, when there is an intrusion. It is and can also be applied to numeric prediction problems and
calculated as below. classification tasks. The performance of these three models
© 2010 ACEEE 47
DOI: 01.IJSIP.01.03.45
3. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010
is good. An ensemble of classifiers is a set of classifiers It was found that for the normal class, as shown in the
whose individual decisions are combined to classify new Table V, bagging gives the better result. The recall was
examples. The purpose of combining classifiers consists on found to be 0.719 for bagging and it was 0.708 for C4.5,
improving the accuracy of a single classifier [10]. both having the same precision value (0.973). While for the
anomaly class as shown in Table VI, both recall and
A. Bagging:
precision have higher values for bagging.
The Bootstrap aggregating algorithm generates different
TABLE V
classifiers from different bootstrap samples and combines
PERFORMANCE METRICS FOR NORMAL CLASS
decisions from the different classifiers into a single
prediction by voting (the class that gets more votes from Bagging Boosting Stacking C4.5
the classifiers wins). TP 0.973 0.957 0.974
0.973
B. Boosting:
FP 0.288 0.346 0.326 0.304
Another method to construct an ensemble of classifiers
Recall 0.719 0.677 0.693 0.708
is know as boosting, which is used to boost the
Precision 0.973 0.957 0.974 0.973
performance of a weak learner. A weak learner is a simple
classifier whose error is less than 50% on training TABLE VI
instances. The models which are more successful will be PERFORMANCE METRICS FOR ANOMALY CLASS
assigned with more weight as compared to other models.
Here each new model is influenced by the performance of Bagging Boosting Stacking C4.5
previously built model. TP 0.712 0.654 0.674 0.696
Thus boosting can built a powerful combined classifier FP 0.027 0.043 0.026 0.027
from very simple learning methods. It can convert these Recall 0.972 0.953 0.971 0.971
simple learning methods called as weak learners into strong
Precision 0.712 0.654 0.674 0.696
ones. It produces classifiers that are more accurate on fresh
data than ones generated by bagging. But it sometimes fails
in practical situations: It generate a classifier that is less V CONCLUSIONS
accurate than a single classifier from the same data [7].
Error due to variance has been reduced using classifier
C. Stacking: combinations thus increasing the performance of the
Stacking is the abbreviation to refer to Stacked classification using the NSL-KDD dataset. Out of the three
Generalization. Unlike bagging and boosting, it uses classifiers Bagging provides better results. NSL-KDD
different learning algorithms to generate the ensemble of dataset can be used for performance evaluation for 5-
classifiers. The main idea of stacking is to combine classes (normal, dos, probe, u2r and r2l) instead of 2-
classifiers from different learners such as decision trees, classes. Further performance can be improved by reducing
instance-based learners, etc. the features as given in [12].
Since each one uses different knowledge representation Different set of features are used for different class.
and different learning biases, the hypothesis space will be More classification algorithm and its combination can be
explored differently, and different classifiers will be used on NSL-KDD dataset
obtained. Thus, it is expected that they will not be
correlated. REFERENCES
Once the classifiers have been generated, they must be [1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A
combined. Unlike bagging and boosting, stacking does not Detailed Analysis of the KDD CUP 99 Data Set,”
use a voting system because, for example, if the majority of Proceedings of the Second IEEE Symposium on
the classifiers make bad predictions this will lead to a final Computational Intelligence for Security and Defense
bad classification. To solve this problem, stacking uses the Applications (CISDA) 2009.
concept of Meta learner.[10] The Meta learner (or level-1 [2] M. Shyu, S. Chen, K. Sarinnapakorn, & L. Chang,” A novel
model), tries to learn, using a learning algorithm, how the anomaly detection scheme based on principal component
classifier”, Proceedings of the IEEE foundation & New
decisions of the base classifiers (or level-0 models) should
Directions of Data Mining Workshop, in conjunction with the
be combined . Third IEEE International Conference on Data Mining
(ICDM03), pp. 172-179, 2003.
IV RESULTS AND DISCUSSION [3] D.E.Denning, “An Intrusion Detection Model”, IEEE
Transactions on Software Engineering, SE-13, pp. 222-232,
In order to reduce the error due to variance classifier 1987.
combinations are used. Initially C4.5 classifier is applied [4] Stolfo J., Fan W., Lee W., Prodromidis A., and Chan P.K.,
over NSL-KDD dataset. NSL-KDD contains 125973 “Cost-based modeling and evaluation for data mining with
records in the train set and 22544 records in the test set. To application to fraud and intrusion detection,” DARPA
improve the performance of C4.5 classifier over NSL-KDD Information Survivability Conference, 2000.
dataset, classifier combinations techniques: bagging, [5] http://weka.sourceforge.net/wekadoc/index.php/en:
boosting and stacking are used. [6] P Srinivasulu, D Nagaraju, P Ramesh Kumar, and K
Nagerwara Rao, “Classifying the Network Intrusion Attacks
© 2010 ACEEE 48
DOI: 01.IJSIP.01.03.45
4. ACEEE Int. J. on Signal & Image Processing, Vol. 01, No. 03, Dec 2010
using Data Mining Classification Methods and their [9] The KDD Archive. KDD99 cup dataset, 1999.
Performance http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Comparison”, International Journal of Computer Science and [10] Ricardo Aler, Daniel Borrajo, and Agapito Ledezma, “
Network Security, Vol.9 No.6, pp 11-18 June 2009. Heuristic Search Based Stacking of Classifiers”, Universidad
[7] Ian H. Witten and Eibe Frank, “Data Mining”, Practical Carlos III, Avda, Universidad, 30, 28911 Leganés (Madrid)
Machine Learning Tools and Techniques, Second Edition, , 2002.
Elsevier, 2005. [11] http://en.wikipedia.org/wiki/C4.5_algorithm
[8] Srilatha Chebrolu, Ajith Abraham, Johnson P. Thomas,” [12] Anazida Zainal, Mohd Aizaini Maarof, and Siti Mariyam
Feature Deduction and ensemble design of Intrusion Shamsuddin, Ensemble Classifiers for Network Intrusion
Detection”, Elsevier, Computer and Security, 24,pp 295-307, Detection System”, Journal of Information Assurance and
2005. Security 4, 217-225, 2009
© 2010 ACEEE 49
DOI: 01.IJSIP.01.03.45