• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
585
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
24
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010 35 Layered Approach Using Conditional Random Fields for Intrusion Detection Kapil Kumar Gupta, Baikunth Nath, Senior Member, IEEE, and Ramamohanarao Kotagiri, Member, IEEE Abstract—Intrusion detection faces a number of challenges; an intrusion detection system must reliably detect malicious activities in a network and must perform efficiently to cope with the large amount of network traffic. In this paper, we address these two issues of Accuracy and Efficiency using Conditional Random Fields and Layered Approach. We demonstrate that high attack detection accuracy can be achieved by using Conditional Random Fields and high efficiency by implementing the Layered Approach. Experimental results on the benchmark KDD ’99 intrusion data set show that our proposed system based on Layered Conditional Random Fields outperforms other well-known methods such as the decision trees and the naive Bayes. The improvement in attack detection accuracy is very high, particularly, for the U2R attacks (34.8 percent improvement) and the R2L attacks (34.5 percent improvement). Statistical Tests also demonstrate higher confidence in detection accuracy for our method. Finally, we show that our system is robust and is able to handle noisy data without compromising performance. Index Terms—Intrusion detection, Layered Approach, Conditional Random Fields, network security, decision trees, naive Bayes. Ç1 INTRODUCTION The signature-based systems are trained by extracting specificI NTRUSION detection as defined by the SysAdmin, Audit, Networking, and Security (SANS) Institute is the art ofdetecting inappropriate, inaccurate, or anomalous activity patterns (or signatures) from previously known attacks while the anomaly-based systems learn from the normal data[6]. Today, intrusion detection is one of the high priority and collected when there is no anomalous activity [11].challenging tasks for network administrators and security Another approach for detecting intrusions is to considerprofessionals. More sophisticated security tools mean that the both the normal and the known anomalous patterns forattackers come up with newer and more advanced penetra- training a system and then performing classification on thetion methods to defeat the installed security systems [4] and test data. Such a system incorporates the advantages of both[24]. Thus, there is a need to safeguard the networks from the signature-based and the anomaly-based systems and isknown vulnerabilities and at the same time take steps to known as the Hybrid System. Hybrid systems can be verydetect new and unseen, but possible, system abuses by efficient, subject to the classification method used, and candeveloping more reliable and efficient intrusion detection also be used to label unseen or new instances as they assignsystems. Any intrusion detection system has some inherent one of the known classes to every test instance. This isrequirements. Its prime purpose is to detect as many attacks possible because during training the system learns featuresas possible with minimum number of false alarms, i.e., the from all the classes. The only concern with the hybrid methodsystem must be accurate in detecting attacks. However, an is the availability of labeled data. However, data requirementaccurate system that cannot handle large amount of network is also a concern for the signature- and the anomaly-basedtraffic and is slow in decision making will not fulfill the systems as they require completely anomalous and attack-purpose of an intrusion detection system. We desire a system free data, respectively, which are not easy to ensure.that detects most of the attacks, gives very few false alarms, The rest of this paper is organized as follows: In Section 2,copes with large amount of data, and is fast enough to make we discuss the related work with emphasis on variousreal-time decisions. methods and frameworks used for intrusion detection. We Intrusion detection started in around 1980s after the describe the use of Conditional Random Fields (CRFs) forinfluential paper from Anderson [10]. Intrusion detection intrusion detection [23] in Section 3 and the Layeredsystems are classified as network based, host based, or Approach [22] in Section 4. We then describe how toapplication based depending on their mode of deployment integrate the Layered Approach and the CRFs in Section 5.and data used for analysis [11]. Additionally, intrusion In Section 6, we give our experimental results and comparedetection systems can also be classified as signature based or our method with other approaches that are known toanomaly based depending upon the attack detection method. perform well. We observe that our proposed system, Layered CRFs, performs significantly better than other. The authors are with the Department of Computer Science and Software systems. We study the robustness of our method in Section 7 Engineering, and NICTA Victoria Research Laboratory, The University of by introducing noise in the system. We discuss feature Melbourne, Parkville 3010, Australia. selection in Section 8 and draw conclusions in Section 9. E-mail: {kgupta, bnath, rao}@csse.unimelb.edu.au.Manuscript received 6 Mar. 2007; revised 11 Dec. 2007; accepted 28 Jan.2008; published online 12 Mar. 2008. 2 RELATED WORKFor information on obtaining reprints of this article, please send e-mail to:tdsc@computer.org, and reference IEEECS Log Number TDSC-2007-03-0031. The field of intrusion detection and network security hasDigital Object Identifier no. 10.1109/TDSC.2008.20. been around since late 1980s. Since then, a number of 1545-5971/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 2. 36 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010methods and frameworks have been proposed and many often hard to select the best possible architecture for a neuralsystems have been built to detect intrusions. Various network. Support vector machines have also been used fortechniques such as association rules, clustering, naive Bayes detecting intrusions [26]. Support vector machines map real-classifier, support vector machines, genetic algorithms, valued input feature vector to a higher dimensional featureartificial neural networks, and others have been applied to space through nonlinear mapping and can provide real-timedetect intrusions. In this section, we briefly discuss these detection capability, deal with large dimensionality oftechniques and frameworks. data, and can be used for binary-class as well as multiclass Lee et al. introduced data mining approaches for classification. Other approaches for detecting intrusiondetecting intrusions in [30], [31], and [32]. Data mining include the use of genetic algorithm and autonomous andapproaches for intrusion detection include association rules probabilistic agents for intrusion detection [1] and [5]. Theseand frequent episodes, which are based on building methods are generally aimed at developing a distributedclassifiers by discovering relevant patterns of program intrusion detection system.and user behavior. Association rules [8] and frequent To overcome the weakness of a single intrusion detectionepisodes are used to learn the record patterns that describe system, a number of frameworks have been proposed, whichuser behavior. These methods can deal with symbolic data, describe the collaborative use of network-based and host-and the features can be defined in the form of packet and based systems [45]. Systems that employ both signature-connection details. However, mining of features is limited based and behavior-based techniques are discussed in [19]to entry level of the packet and requires the number of and [41]. In [32], the authors describe a data mining frame-records to be large and sparsely populated; otherwise, they work for building adaptive intrusion detection models. Atend to produce a large number of rules that increase the distributed intrusion detection framework based on mobilecomplexity of the system [7]. agents is discussed in [12]. Data clustering methods such as the k-means and the fuzzy The most closely related work, to our work, is of Lee et al.c-means have also been applied extensively for intrusion [30], [31], and [32]. They, however, consider a data miningdetection [36] and [39]. One of the main drawbacks of the approach for mining association rules and finding frequentclustering technique is that it is based on calculating numeric episodes in order to calculate the support and confidence ofdistance between the observations, and hence, the observa- the rules separately. Instead, in our work, we define featurestions must be numeric. Observations with symbolic features from the observations as well as from the observations andcannot be easily used for clustering, resulting in inaccuracy. the previous labels and perform sequence labeling via theIn addition, the clustering methods consider the features CRFs to label every feature in the observation. This setting isindependently and are unable to capture the relationship sufficient for modeling the correlation between differentbetween different features of a single record, which further features of an observation. We also compare our work withdegrades attack detection accuracy. [21], which describes the use of maximum entropy principle Naive Bayes classifiers have also been used for intrusion for detecting anomalies in the network traffic. The keydetection [9]. However, they make strict independence difference between [21] and our work is that the authors inassumption between the features in an observation result- [21] use only the normal data during training and build aing in lower attack detection accuracy when the features are baseline system, i.e., a behavior-based system, while we traincorrelated, which is often the case for intrusion detection. our system with both the normal and the anomalous data,Bayesian network can also be used for intrusion detection i.e., we build a hybrid system. Second, the system in [21] fails[28]. However, they tend to be attack specific and build a to model long-range dependencies in the observations,decision network based on special characteristics of which can be easily represented in our model. We alsoindividual attacks. Thus, the size of a Bayesian network integrate the Layered Approach with the CRFs to gain theincreases rapidly as the number of features and the type ofattacks modeled by a Bayesian network increases. benefits of computational efficiency and high accuracy of To detect anomalous traces of system calls in privileged detection in a single system.processes [20], hidden Markov models (HMMs) have been We compare the Layered Approach with the works in [18],applied in [17], [42], and [43]. However, modeling the system [25], and [41]. The authors in [18] describe the combination ofcalls alone may not always provide accurate classification as “strong” classifiers using stacking, where the decision tress,in such cases various connection level features are ignored. naive Bayes, and a number of other classification methods areFurther, HMMs are generative systems and fail to model used as base classifiers. The authors show that the outputlong-range dependencies between the observations [29]. We from these classifiers can be combined to generate a betterfurther discuss this in detail in Section 3. classifier rather than selecting the best one. In [25], the authors Decision trees have also been used for intrusion use a combination of “weak” classifiers. The individualdetection [9]. The decision trees select the best features for classification power of weak classifiers is slightly better thaneach decision node during the construction of the tree based random guessing. The authors show that a number of suchon some well-defined criteria. One such criterion is to use classifiers when combined using simple majority votingthe information gain ratio, which is used in C4.5. Decision mechanism, provide good classification. In [41], the authorstrees generally have very high speed of operation and high- apply a combination of anomaly and misuse detectors forattack detection accuracy. better qualification of analyzed events. However, our work is Debar et al. [14] and Zhang et al. [46] discuss the use of not based upon classifier combination. Combination ofartificial neural networks for network intrusion detection. classifiers is expensive with regard to the processing timeThough the neural networks can work effectively with noisy and decision making. The purpose of classifier combination isdata, they require large amount of data for training and it is to improve accuracy. Rather, our system is based upon serial Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 3. GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 37layering of multiple hybrid detectors. From our experiments edges in subgraph S. In addition, the features fk and gkin Section 6, we show that the Layered CRFs perform better are assumed to be given and fixed. For example, athan individual classifiers and they are more efficient and Boolean edge feature fk might be true if the observation Xiaccurate than a system based on classifier combination. The is “protocol ¼ tcp,” tag YiÀ1 is “normal,” and tag Yi isresults from individual classifiers at a layer are not combined “normal.” Similarly, a Boolean vertex feature gk might be trueat any later stage in the Layered Approach, and hence, an if the observation Xi is “service ¼ ftp” and tag Yi is “attack.”attack can be blocked at the layer where it is detected. There is Further, the parameter estimation problem is to find theno communication overhead among the layers and the central parameters  ¼ ð1 ; 2 ; . . . ; 1 ; 2 ; . . .Þ from the training data Ndecision-maker. In addition, since the layers are independent D ¼ ðxi ; yi Þi¼1 with the empirical distribution pðx; yÞ [29]. ~they can be trained separately and deployed at critical CRFs are undirected graphical models used for sequencelocations in a network depending upon the specific require- tagging. The prime difference between CRF and otherments of a network. Using a stacked system will not give us graphical models such as the HMM is that the HMM, beingthe advantage of reduced processing when an attack is generative, models the joint distribution pðy; xÞ, whereas thedetected at the initial layers in the sequential model. CRF are discriminative models and directly model the In this paper, we show the effectiveness of CRFs for conditional distribution pðyjxÞ, which is the distribution ofintrusion detection. Motivated by our results in [23], we interest for the task of classification and sequence labeling.perform detailed analysis and show that CRFs are a strong Similar to HMM, the naive Bayes is also generative andcandidate for building robust intrusion detection systems. models the joint distribution. Modeling the joint distributionWe then show that high efficiency can be achieved by has two disadvantages. First, it is not the distribution ofimplementing the Layered Approach. Finally, we integrate interest, since the observations are completely visible and thethe Layered Approach and the CRFs to develop a system interest is in finding the correct class for the observations,that is accurate and performs efficiently. which is the conditional distribution pðyjxÞ. Second, inferring the conditional probability pðyjxÞ from the modeled joint distribution, using the Bayes rule, requires the marginal3 CONDITIONAL RANDOM FIELDS FOR INTRUSION distribution pðxÞ. To estimate this marginal distribution is DETECTION difficult since the amount of training data is often limited andConditional models are probabilistic systems that are used the observation x contains highly dependent features that areto model the conditional distribution over a set of random difficult to model and therefore strong independencevariables. Such models have been extensively used in the assumptions are made among the features of an observation.natural language processing tasks. Conditional models offer This results in reduced accuracy [40]. CRFs, however, predicta better framework as they do not make any unwarranted the label sequence y given the observation sequence x. Thisassumptions on the observations and can be used to model allows them to model arbitrary relationship among differentrich overlapping features among the visible observations. features in an observation x [15]. CRFs also avoid theMaxent classifiers [37], maximum entropy Markov models observation bias and the label bias problem, which are[34], and CRFs [29] are such conditional models. The present in other discriminative models, such as the maximumadvantage of CRFs is that they are undirected and are, thus, entropy Markov models. This is because the maximumfree from the Label Bias and the Observation Bias [27]. The entropy Markov models have a per-state exponential modelsimplest conditional classifier is the Maxent classifier based for the conditional probabilities of the next state given theupon maximum entropy classification, which estimates the current state and the observation, whereas the CRFs have aconditional distribution of every class given the observations single exponential model for the joint probability of the entire[37]. The training data is used to constrain this conditional sequence of labels given the observation sequence [29].distribution while ensuring maximum entropy and hence The task of intrusion detection can be compared to manymaximum uniformity. We now give a brief description of the problems in machine learning, natural language processing,CRFs, which is motivated from the work in [29]. and bioinformatics. The CRFs have proven to be very Let X be the random variable over data sequence to be successful in such tasks, as they do not make any unwar-labeled and Y the corresponding label sequence. In ranted assumptions about the data. Hence, we explore theaddition, let G ¼ ðV ; EÞ be a graph such that Y ¼ ðYv Þv2ðV Þ , suitability of CRFs for intrusion detection.so that Y is indexed by the vertices of G. Then, ðX; Y Þ is aCRF, when conditioned on X, the random variables Yv 3.1 Motivating Exampleobey the Markov property with respect to the graph: The data analyzed by the intrusion detection system forpðYv jX; Yw ; w 6¼ vÞ ¼ pðYv jX; Yw ; w $ vÞ, where w $ v means classification often has a number of features that are highlythat w and v are neighbors in G, i.e., a CRF is a random field correlated and complex relationships exist between them.globally conditioned on X. For a simple sequence (or chain) For example, when classifying network connections asmodeling, as in our case, the joint distribution over the label either normal or as attack, a system may consider featuressequence Y given X has the following form: such as “logged in” and “number of file creations.” When ! these features are analyzed individually, they do not X X provide any information that can aid in detecting attacks. p ðyjxÞ / exp k fk ðe; yje ; xÞ þ k gk ðv; yjv ; xÞ ; ð1Þ However, when these features are analyzed together, they e2E;k v2V ;k can provide meaningful information, which can be helpfulwhere x is the data sequence, y is a label sequence, and yjs is for the classification task. Taking another example, thethe set of components of y associated with the vertices or connection level feature such as the “service invoked” at the Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 4. 38 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010 Fig. 2. Layered representation. Our first goal is to improve the attack detection accuracy. We first compare the accuracy of CRFs for detecting attacksFig. 1. Graphical representation of a CRF. with other methods in Section 6. We consider all the 41 features in the data set for each of the four attack groupsdestination provides some information about the class label separately. As we shall observe, the CRFs outperform other(in case an attacker sends request to a service that is not methods for detecting “Unauthorized access to Root” (U2R)available). This information becomes more concrete and attacks. They are also effective in detecting the Probe,aids in classification when analyzed with other features “Remote to Local” (R2L), and “Denial of Service” (DoS)such as “protocol type” and “amount of data transferred” attacks. However, CRFs can be expensive during trainingbetween source and destination (in case the client connects and testing. For a simple linear chain structure, the timeto an available service such as the ftp and performs data complexity for training a CRF is OðT L2 NIÞ, where T is thetransfer). These relationships, between different features in length of the sequence, L is the number of labels, N is thethe observed data, if considered during classification can number of training instances, and I is the number ofsignificantly decrease classification error. The CRFs do not iterations. During inference, the Viterbi algorithm isconsider features to be independent and hence perform employed, which has a complexity of OðT L2 Þ. The quad-better when compared with other methods. ratic complexity is significant when the number of labels is The data set used in our experiments represents features large as in language tasks. However, for intrusion detection,of every session in relational form with only one label for there are only two labels “normal” and “attack,” and thus,the entire record. In this case, using a conditional model the system is very efficient. We further improve the overallwould result in a simple maximum entropy classifier [40]. system performance by using the Layered Approach, whichHowever, we represent the data in the form of a sequence decreases T , i.e., the length of the sequence. The Layeredand assign a label to every feature in the sequence using the Approach is described next.first-order Markov assumption instead of assigning a singlelabel to the entire observation. Though, this increases thecomplexity but it also increases the attack detection accuracy. 4 LAYERED APPROACH FOR INTRUSION DETECTION Each record represents a separate connection, and hence, We now describe the Layer-based Intrusion Detectionwe consider every record as a separate sequence. We aim to System (LIDS) in detail. The LIDS draws its motivationmodel the relationships among features of individual from what we call as the Airport Security model, where aconnections using a CRF, as shown in Fig. 1. In the figure, number of security checks are performed one after the otherfeatures such as duration, protocol, service, flag, and in a sequence. Similar to this model, the LIDS represents asrc_bytes take some possible value for every connection. sequential Layered Approach and is based on ensuringDuring training, feature weights are learnt, and during availability, confidentiality, and integrity of data and (or)testing, features are evaluated for the given observation, services over a network. Fig. 2 gives a generic representa-which is then labeled accordingly. tion of the framework. As it is evident from the figure, every label is connected The goal of using a layered model is to reduce computationto every input feature, which indicates that all the features and the overall time required to detect anomalous events. Thein an observation help in labeling, and thus, a CRF can time required to detect an intrusive event is significant andmodel dependencies among the features in an observation. can be reduced by eliminating the communication overheadPresent intrusion detection systems do not consider such among different layers. This can be achieved by making therelationships among the features in the observations. They layers autonomous and self-sufficient to block an attackeither consider only one feature, such as in the case of without the need of a central decision-maker. Every layer insystem call modeling, or assume conditional independence the LIDS framework is trained separately and then deployedamong different features in the observation as in the case ofa naive Bayes classifier. As we will show from our sequentially. We define four layers that correspond to theexperimental results, the CRFs can effectively model such four attack groups mentioned in the data set. They are Proberelationships among different features of an observation layer, DoS layer, R2L layer, and U2R layer. Each layer is thenresulting in higher attack detection accuracy. Another separately trained with a small set of relevant features.advantage of using CRFs is that every element in the Feature selection is significant for Layered Approach andsequence is labeled such that the probability of the entire discussed in the next section. In order to make the layerslabeling is maximized, i.e., all the features in the observa- independent, some features may be present in more than onetion collectively determine the final labels. Hence, even if layer. The layers essentially act as filters that block anysome data is missing, the observation sequence can still be anomalous connection, thereby eliminating the need oflabeled with less number of features. further processing at subsequent layers enabling quick Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 5. GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 39response to intrusion. The effect of such a sequence of layers is illegitimate requests. Hence, for the DoS layer, trafficthat the anomalous events are identified and blocked as soon features such as the “percentage of connections having sameas they are detected. destination host and same service” and packet level features Our second goal is to improve the speed of operation of such as the “source bytes” and “percentage of packets withthe system. Hence, we implement the LIDS and select a errors” are significant. To detect DoS attacks, it may not besmall set of features for every layer rather than using all the important to know whether a user is “logged in or not.”41 features. This results in significant performance im-provement during both the training and the testing of the 5.1.3 R2L Layersystem. In many situations, there is a trade-off between The R2L attacks are one of the most difficult to detect asefficiency and accuracy of the system and there can be they involve the network level and the host level features.various avenues to improve system performance. Methods We therefore selected both the network level features suchsuch as naive Bayes assume independence among the as the “duration of connection” and “service requested”observed data. This certainly increases system efficiency, and the host level features such as the “number of failedbut it may severely affect the accuracy. To balance this login attempts” among others for detecting R2L attacks.trade-off, we use the CRFs that are more accurate, thoughexpensive, but we implement the Layered Approach to 5.1.4 U2R Layerimprove overall system performance. The performance of The U2R attacks involve the semantic details that are veryour proposed system, Layered CRFs, is comparable to that difficult to capture at an early stage. Such attacks are oftenof the decision trees and the naive Bayes, and our system content based and target an application. Hence, for U2Rhas higher attack detection accuracy. attacks, we selected features such as “number of file creations” and “number of shell prompts invoked,” while5 INTEGRATING LAYERED APPROACH WITH we ignored features such as “protocol” and “source bytes.” CONDITIONAL RANDOM FIELD We used domain knowledge together with the practical significance and the feasibility of each feature beforeIn Section 1, we discussed two main requirements for an selecting it for a particular layer. Thus, from the totalintrusion detection system; accuracy of detection and 41 features, we selected only 5 features for Probe layer,efficiency in operation. As discussed in Sections 3 and 4, 9 features for DoS layer, 14 features for R2L layer, andrespectively, the CRFs can be effective in improving the 8 features for U2R layer. Since each layer is independent ofattack detection accuracy by reducing the number of false every other layer, the feature set for the layers is notalarms, while the Layered Approach can be implemented to disjoint. The selected features for all the four layers areimprove the overall system efficiency. Hence, a natural presented in Appendix A. We then use the CRFs for attackchoice is to integrate them to build a single system that is detection as discussed in Section 3. However, the differenceaccurate in detecting attacks and efficient in operation. Given is that we use only the selected features for each layer ratherthe data, we first select four layers corresponding to the four than using all the 41 features. We now give the algorithmattack groups (Probe, DoS, R2L, and U2R) and perform for integrating CRFs with the Layered Approach.feature selection for each layer, which is described next. Algorithm5.1 Feature Selection Training Step 1: Select the number of layers, n, for the completeIdeally, we would like to perform feature selection auto-matically. However, as will be discussed later in Section 8, system.the methods for automatic feature selection were not found Step 2: Separately perform features selection for each layer.to be effective. In this section, we describe our approach for Step 3: Train a separate model with CRFs for each layerselecting features for every layer and why some features using the features selected from Step 2.were chosen over others. In our system, every layer is Step 4: Plug in the trained models sequentially such thatseparately trained to detect a single type of attack category. only the connections labeled as normal are passedWe observe that the attack groups are different in their to the next layer.impact, and hence, it becomes necessary to treat them Testingdifferently. Hence, we select features for each layer based Step 5: For each (next) test instance perform Steps 6upon the type of attacks that the layer is trained to detect. through 9. Step 6: Test the instance and label it either as attack or5.1.1 Probe Layer normal.The probe attacks are aimed at acquiring information about Step 7: If the instance is labeled as attack, block it andthe target network from a source that is often external to the identify it as an attack represented by the layernetwork. Hence, basic connection level features such as the name at which it is detected and go to Step 5. Else“duration of connection” and “source bytes” are significantwhile features like “number of files creations” and “number pass the sequence to the next layer.of files accessed” are not expected to provide information Step 8: If the current layer is not the last layer in the system,for detecting probes. test the instance and go to Step 7. Else go to Step 9. Step 9: Test the instance and label it either as normal or as5.1.2 DoS Layer an attack. If the instance is labeled as an attack,The DoS attacks are meant to force the target to stop the block it and identify it as an attack correspondingservice(s) that is (are) provided by flooding it with to the layer name. Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 6. 40 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010 TABLE 1 desktop running with Intel(R) Core(TM) 2, CPU 2.4 GHz, Data Set and 2-Gbyte RAM under exactly the same conditions. We are mainly interested in the test time efficiency and not in the time required for training of the model as the real-time performance of the system depend upon the test efficiency alone. We note that our system is very efficient during testing. When we considered all the 41 features, the time taken to test all the 250,436 attacks was 57 seconds, which reduced to 17 seconds when we performed feature selection and implemented the Layered Approach. More details will be presented when we give the detailed results for the Our final goal is to improve both the attack detection experiments.accuracy and the efficiency of the system. Hence, we For our results, we give the Precision, Recall, and F -Valueintegrate the CRFs and the Layered Approach to build a and not the accuracy alone as with the given data set, it issingle system. We perform detailed experiments and show easy to achieve very high accuracy by carefully selecting thethat our integrated system has dual advantage. First, as sample size. From Table 1, we note that the number ofexpected, the efficiency of the system increases signifi- instances for the U2R, Probes, and R2L attacks is very low.cantly. Second, since we select significant features for each Hence, if we use accuracy as a measure for testing thelayer, the accuracy of the system further increases. This is performance of the system, the system can be biased and canbecause all the 41 features are not required for detecting attain an accuracy of more than 99 percent for U2R attacksattacks belonging to a particular attack group. Using more [16]. However, Precision, Recall, and F -Value are notfeatures than required can result in fitting irregularities in dependent on the size of the training and the test samples.the data, which has a negative effect on the attack detection They are defined as follows:accuracy of the system. TP Precision ¼6 EXPERIMENTS TP þ FP TPFor our experiments, we use the benchmark KDD ’99 Recall ¼ TP þ FNintrusion data set [3]. This data set is a version of the original ð1 þ
  • 7. 2 Þ Ã Recall à Precision1998 DARPA intrusion detection evaluation program, which F -Value ¼ ;is prepared and managed by the MIT Lincoln Laboratory.
  • 8. 2 Ã ðRecall þ PrecisionÞThe data set contains about five million connection records where TP, FP, and FN are the number of True Positives,as the training data and about two million connection False Positives, and False Negatives, respectively, and
  • 9. records as the test data. In our experiments, we use corresponds to the relative importance of precision versus10 percent of the total training data and 10 percent of the recall and is usually set to 1.test data (with corrected labels), which are provided We divide the training data into different groups;separately. This leads to 494,020 training and 311,029 test Normal, Probe, DoS, R2L, and U2R. Similarly, we divideinstances. Each record in the data set represents a connection the test data. We perform 10 experiments for each attackbetween two IP addresses, starting and ending at some well- class by randomly selecting data corresponding to thatdefined times with a well-defined protocol. Further, every attack class and normal data only. For example, to detectrecord is represented by 41 different features. Each record Probe attacks, we train and test the system with Proberepresents a separate connection and is hence considered to attacks and normal data only. We do not add the DoS, R2L,be independent of any other record. and U2R data when detecting Probes. Not including these The training data is either labeled as normal or as one of attacks while training allows the system to better learn thethe 24 different kinds of attack. These 24 attacks can be features for Probe attacks and normal events. When such agrouped into four classes; Probing, DoS, R2L, and U2R. system is deployed online, other attacks such as DoS canSimilarly, the test data is also labeled as either normal or as either be seen as normal or as Probes. If DoS attacks areone of the attacks belonging to the four attack groups. It is detected as normal, we expect them to be detected as attackimportant to note that the test data is not from the same at other layers in the system. However, if the DoS attacks areprobability distribution as the training data, and it includes detected as Probe, it must be considered as an advantagespecific attack types not present in the training data. This since the attack is detected at an early stage. Similarly, ifmakes the intrusion detection task more realistic [3]. Table 1 some Probe attacks are not detected at the Probe layer, theygives the number of instances for each group of attack in the may be detected at subsequent layers. Hence, for four attackdata set. For our experiments with CRFs, we use the CRF toolkit, classes, we have four independent models, which areCRF++ [2]. We use the Weka tool [44] to perform experiments trained separately with specific features to detect attackswith the decision trees and the naive Bayes classifier. We belonging to that particular group. For our experiments, wedevelop python and shell scripts for data formatting and report the best, the average, and the worst cases.implementing the Layered Approach. For all our experi- We represent a single layer, for example the Probe layer,ments, we perform hybrid detection, as discussed in Section 1, in Fig. 3. Other layers can be constructed similarly.and use both the normal and the anomalous connections for In Section 6.1, we perform experiments with individualtraining the model. We perform our experiments on a layers in the system. In Section 6.2, we represent how to Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 10. GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 41 TABLE 3 Normal and Probes (with Feature Selection)Fig. 3. Representation of a single layer (e.g., probe layer).implement the system in real scenario and compare ourresults with other systems in Section 6.3. In Section 6.4, wediscuss the significance of our results.6.1 Building Individual Layers of the SystemWe perform two sets of experiments. From the firstexperiment, we wish to examine the accuracy of CRFs forintrusion detection. The objective is to see how CRFs perform feature selection, our system achieves much highercompare with other techniques, which are known to accuracy and there is significant improvement in efficiency.perform well. We do not consider feature selection, andthe systems are trained using all the 41 features. From this 6.1.2 Detecting Probe Attacks with Feature Selectionexperiment, we observe that CRFs perform much better for We used the same set of instances for this experiment asU2R attacks while the decision trees achieve higher attack used in the previous experiment. However, we performdetection for Probes and R2L. The difference in attack feature selection for this experiment. Table 3 gives thedetection accuracy for DoS is not significant. We note that results for this experiment.the reason for better performance of decision trees is that We observe that the system takes only 2.04 seconds tothey perform feature selection. This motivates us to perform label all the 64,759 test instances. The Layered CRFsour second experiment where we perform feature selection perform better and faster than our previous experimentby selecting a small set of features for every attack group and are the best choice for detecting Probes. We also noteinstead of using all the 41 features. We perform the same that there is no significant advantage with respect to timeexperiment with decision trees and naive Bayes and for the layered decision trees as the number of features usedcompare the results. We call the integrated models as in normal decision trees and in the layered decision trees isLayered CRFs, layered decision trees, and layered naive approximately the same, resulting in similar efficiency. WeBayes, respectively. For better comparison and readability, further note that the Recall and hence the F -Value for thewe give the results for both the experiments together. layered naive Bayes decreases drastically. This can be explained as follows: The classification accuracy with naive6.1.1 Detecting Probe Attacks with All 41 Features Bayes generally improves as the number of featuresWe randomly select about 10,000 normal records and all the increases. However, if the number of features increases to a very large extent, the estimation tends to becomeProbe records from the training data as the training data for unreliable. As a result, when we use all the 41 features,detecting Probe attacks. We then use all the normal and the naive Bayes performs well but when we decrease theProbe records from the test data for testing. Hence, we have number of features to five, its classification accuracy15,000 training instances and 64,759 test instances. Table 2 decreases. From this experiment, we conclude that thegives the results for the experiments. Layered CRFs are a better choice for detecting Probe In Table 2, the testing time of 14.53 seconds represents attacks.the total time taken to label 64,759 test instances. The resultsshow that the decision trees are more efficient than the CRFs 6.1.3 Detecting DoS Attacks with All 41 Featuresand the naive Bayes. This is because they have a small tree We randomly select about 20,000 normal records and aboutstructure, often with very few decision nodes, which is very 4,000 DoS records from the training data as the training dataefficient. The attack detection accuracy is also higher for the for detecting DoS attacks. We then use all the normal anddecision trees as they select the best possible features duringtree construction. However, as we show next, once we DoS records from the test data for testing. Hence, we have 24,000 training instances and 290,446 test instances. Table 4 gives the results for the experiments. TABLE 2 In Table 4, the testing time of 64.42 seconds represents Normal and Probes (All 41 Features) the time taken to label all the 290,446 test instances. The results show that all the three methods considered have similar attack detection accuracy; however, decision trees give a slight advantage with regard to test time efficiency. 6.1.4 Detecting DoS Attacks with Feature Selection We used the same data for this experiment as used in the previous experiment. However, we perform feature selec- tion. Table 5 gives the results. We observe that the system now takes only 15.17 seconds to label all the 290,446 test instances. The results follow the Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 11. 42 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010 TABLE 4 TABLE 7 Normal and DoS (All 41 Features) Normal and R2L (with Feature Selection) TABLE 5 Normal and DoS (with Feature Selection) 6.1.6 Detecting R2L Attacks with Feature Selection Table 7 gives the results when we performed feature selection for detecting R2L attacks. We observe that the time taken to test all the 76,942 instances is only 5.96 seconds. Further, the Layered CRFs perform much better than the CRFs (an increase of about 60 percent), layered decision trees (an increase of about 125 percent), decision trees (an increase of about 17 percent), layered naive Bayes (an increase of about 250 percent), and naive Bayes (an increase of about 250 percent) and are the best choice for detectingsame trend as in the previous experiment with only a slight the R2L attacks. The Layered CRFs take slightly moreimprovement. However, if we consider the testing time we time, which is acceptable since we achieve much higherfind that layered decision trees are a better choice. We also detection accuracy.note that there is slight increase in the detection accuracy 6.1.7 Detecting U2R Attacks with All 41 Featureswhen we perform feature selection, but this increase is notsignificant. The real advantage is seen in the reduced time We randomly select about 1,000 normal records and all thefor testing, which decreases four folds. U2R records from the training data as the training data for detecting the User to Root attacks. We then use all the6.1.5 Detecting R2L Attacks with All 41 Features normal and U2R records from the test data for testing.We randomly select about 1,000 normal records and all the Hence, we have 1,000 training instances and 60,661 testR2L records from the training data as the training data for instances. Table 8 gives the results.detecting R2L attacks. We then use all the normal and In Table 8, the testing time of 13.45 seconds representsR2L records from the test data for testing. Hence, we have the time taken to label all the 60,661 test instances. In this2,000 training instances and 76,942 test instances. Table 6 gives experiment, we find that the CRFs are far better than thethe results. other two methods. The F -Value for CRFs is more than In Table 6, the testing time of 17.16 seconds represents 150 percent with respect to the decision trees and more thanthe time taken to label all the 76,942 test instances. We 600 percent with respect to the naive Bayes. The U2R attacksobserve that the decision trees have a higher F -Value, but if are very difficult to detect and most of the present intrusionwe look at the number of false alarms, we find that the CRFs detection systems fail to detect such attacks with acceptableperform better and have high Precision compared to the reliability. We find that the CRFs can be used to reliablydecision trees and the naive Bayes. detect such attacks. TABLE 6 TABLE 8 Normal and R2L (All 41 Features) Normal and U2R (All 41 Features) Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 12. GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 43 TABLE 9 TABLE 10 Normal and U2R (with Feature Selection) Confusion Matrix TABLE 11 Attack Detection at Each Layer (Case 1)6.1.8 Detecting U2R Attacks with Feature SelectionIn this experiment, we used exactly the same set of instancesas we used in the previous experiment. We also performfeature selection. Table 9 gives the results for this experiment. We observe that the system takes only 2.67 seconds to labelall the 60,661 test instances. The Layered CRFs are the best category once the system detects an event as anomalous.choice for detecting the U2R attacks and are far better than Layered Approach not only improves the attack detection,CRFs (an increase of about 8 percent), layered decision trees but it also helps identify the type of attack once detected(an increase of about 30 percent), decision trees (an increase because every layer is trained to detect only a particularof about 184 percent), layered naive Bayes (an increase of category of attack. Hence, if an attack is detected at the U2Rabout 38 percent), and naive Bayes (an increase of about layer, it is very likely that the attack is of “U2R” type. This675 percent). We observe that the attack detection capability enables to perform quick recovery and prevent similaralso increases for the decision trees and the naive Bayes. attacks. Fig. 4 gives the real-time system representation. It is evident from the results that the accuracy of Layered We integrate the four models (with feature selection) fromCRFs is significantly higher for the U2R, R2L, and the Probe Section 6.1 to develop the final system. In this experiment,attacks. The difference in accuracy is, however, not sig- we use the same data for training the individual models asnificant for the DoS attacks. Further, regardless of the used in our previous experiments. However, the data in themethod considered and particularly for the CRFs, the time test set is relabeled either as normal or as attack and all therequired for training and testing the system is drastically data from the test set is passed though the system startingreduced once we perform feature selection. We also note from the first layer. If layer 1 detects any connection as anthat the increase in detection accuracy is not significant for attack, it is blocked and labeled as “Probe.” Only the eventsthe layered decision trees and the layered naive Bayes for labeled as “Normal” are allowed to go to the next layer. Thethe DoS group of attack. Their accuracy of detection same process is repeated at the next layers where an attack isdecreases for the Probe and R2L attacks while it increases blocked and labeled as “DoS,” “R2L,” or “U2R” at layer 2,for the U2R attacks. However, we find that in all the cases, layer 3, and layer 4, respectively. We perform all thethe Layered CRFs perform significantly better and can better experiments 10 times and report their average. We give thelearn a model when we use a small set of specific features for results for this experiment in Tables 10, 11, and 12. Table 10training. gives the confusion matrix where the values represent the percent detection with respect to each of the five classes.6.2 Implementing the System in Real Life From the table, we observe that our system can detectIn real scenario, we are not aware of the category of an most of the “Probe” (98.62 percent), “DoS” (97.40 percent),attack. Rather, we are interested in identifying the attack and “U2R” (86.33 percent) attacks while giving very few false alarms at each layer. The system can also detect “R2L” attacks with much higher reliability (29.62 percent) when compared with the previously reported systems, as we will discuss later in Section 6.3. The confusion matrix shows that TABLE 12 Attack Detection at Each Layer (Case 2)Fig. 4. Real-time representation of the system. Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 13. 44 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010only 71.90 percent of DoS attacks are labeled as DoS during TABLE 13testing. However, it is very important to note that the Layered versus Nonlayered Approachaccuracy for detecting DoS attacks is not 71.90 percent, butit is 25:50 þ 71:90 þ 0:00 þ 0:00 ¼ 97:40 percent. This isbecause 25.50 percent of the DoS attacks are alreadydetected at the first layer, though our system identifiesthem as probes since they are detected at the first layer. Thisis acceptable because in the real environment it is critical todetect an attack as early as possible to minimize its impact. It is also important to note that most of the “U2R” attacksare detected in the third layer itself and hence labeled as“R2L.” However, if we remove the third layer, the fourthlayer can detect these attacks with similar accuracy. Further,looking at the “R2L” and “U2R” columns in Table 10, it is accurate in detecting attacks particularly the U2R, the R2L,natural to think that the two layers can be merged. However, and the Probes.this has two disadvantages. First, merging the two layers It is important to note that the time should be read inresults in increasing the number of features, which reduces relative terms rather than absolute, as for ease of experi-efficiency. The merged layer performs poorly with regard to ments we used scripts for implementation. In real environ-the total time taken when compared with both the ment, high speed can be achieved by implementing theunmerged layers together. Second, when the layers are complete system in languages with efficient compilers suchmerged, the “U2R” attacks are not detected effectively and as the “C Language.” Further, pipelining can be implemen-their individual attack detection accuracy decreases. This is ted in multicore processors, where each core may represent abecause the number of “U2R” instances is very low in thetraining data and the system simply learns the features that single layer, and due to pipelining, multiple I/O operationsare specific to the “R2L” attacks. Hence, we prefer separate can be replaced by a single I/O operation providing verylayers for the two attack groups. Using our approach, we can high speed of operation.hope that any attack, even though its category is unknown, 6.3 Comparison of Resultscan be detected at any one of the layers in the system. Wecan also increase or decrease the number of layers depend- In this section, we compare our work with other well-knowning upon the environment where the system is deployed. methods based on the anomaly intrusion detection principle. We evaluate the performance of each layer in the The anomaly-based systems primarily detect deviationssystem in Table 11. From the table, we observe that out of from the learnt normal data by using statistical methods,all the 250,436 attack instances in the test data set, more machine learning, or data mining approaches. Standardthan 25 percent of the attacks are blocked at layer 1, and techniques such as the decision trees and naive Bayes aremore than 90 percent of all the attacks have been blocked known to perform well. However, our experiments showby the end of layer 2. Thus, the Layered Approach is very that the Layered CRFs perform far better than theseeffective in reducing the attack traffic at each layer in the techniques. The main reason for this is that the CRFs do notsystem. The configuration takes 21 seconds to classify all consider the observation features to be independent. In [38],the 250,436 attacks. the authors present a comparative study of various classifiers We can do further optimization by putting the DoS layer when applied to the KDD ’99 data set, and in [13], the authorsbefore the Probe layer. We can do this because the data is propose the use of Principle Component Analysis (PCA)relational and each layer in our system is independent. before applying a machine learning algorithm. Use ofPutting the DoS layer before the Probe layer serves dual support vector machines is discussed in [26]. We compareadvantage as most of the attacks are detected at the first our results from the results presented in these papers inlayer itself and the overall system performs efficiently. This Table 14. The table represents the Probability of Detectionoptimization becomes significant in severe attack situations (PD) and False Alarm Rate (FAR) in percent for variouswhen the target is overwhelmed with illegitimate connec- methods including the KDD ’99 cup winners.tions. The results are presented in Table 12. From the table, we observe that the Layered CRFs We observe that the Layered Approach can be very perform significantly better than the previously reportedeffective in restricting the attack traffic to the initial layers in results including the winner of the KDD ’99 cup andthe system. We also performed experiments when we do various other methods applied to this data set. The mostnot implement the Layered Approach, i.e., we consider only impressive part of the Layered CRFs is the margin of improve-a single system that is trained with two classes (normal and ment as compared with other methods. Layered CRFs have veryattack). In this system, all the Probes, DoS, R2L, and U2R high attack detection of 98.6 percent for Probes (5.8 percentattacks are labeled as “attack.” We perform experiments improvement) and 97.40 percent detection for DoS. Theyboth with and without feature selection. For feature outperform by a significant percentage for the R2L (34.5 percentselection, we consider 21 features, which are selected by improvement) and the U2R (34.8 percent improvement) attacks.applying the union operation on the feature sets of all thefour attack types. We compare these results with the 6.4 Discussion and IssuesLayered Approach in Table 13. From our experiments and the comparison in Table 14, we We observe that a system implementing the Layered conclude that the Layered CRFs can be very effective inApproach with feature selection is more efficient and more detecting the Probe, the U2R, and the R2L attacks as well as Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 14. GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 45 TABLE 14 TABLE 15 Comparison of Results Ranking for the Six Methods scores better. Our integrated system also has the advantage that any method can be used in the layers of the system. This gives flexibility to the user to decide between the time and accuracy trade-off. Furthermore, we can increase or decrease the number of layers in the system depending upon the task requirement. Finally, our system can be used for performing analysis on attacks because the attack category can be inferred from the layer at which the attack is detected. To determine the statistical significance of our results, we compare our proposed method (Layered CRFs) with others for detecting Probes, DoS, R2L, and U2R attacks. We use the Wilcoxon sum rank test with 95 percent confidence interval to discriminate the performance of these methods. Table 15 gives the ranking for various methods compared, where a system with rank 1 is the best. The results of the Wilcoxon test indicate that the Layered CRFs are much better (or equal) for detecting attacks. Thus, we conclude that the Layered CRFs are a strong candidatethe DoS attacks. However, if we consider all the 41 features for building robust and efficient intrusion detection systems.given in the data set, we find that the time required to trainand test the model is high. To address this, we performed 7 EFFECT OF NOISEexperiments with our integrated system by implementing afour-layer system. The four layers correspond to Probe, Ideally, we would like to perform similar experiments withDoS, R2L, and U2R. For each layer, we then selected a set of a large number of data sets. However, given the domain offeatures that is sufficient to detect attacks at that particular the problem, there are no other data sets that are freelylayer. Feature selection for each layer enhances the available, which can be used for our experiments. Toperformance of the entire system. The runtime (testing) ameliorate this problem to some extent, we add substantial amount of noise in the training data and perform similarperformance of our model is comparable with other experiments to study the robustness of these systems. Bymethods; however, the time required to train the model is experimenting with noisy data, we want to determine theslightly higher. We also observe that feature selection not sensitivity of the proposed scheme with respect to noise. Ifonly decreases the time required to test an instance, but it the system performs poorly with noisy data, the resultsalso increases the accuracy of attack detection. This is could be an artifact of the data set.because using more features than required can generatesuperfluous rules often resulting in fitting irregularities in 7.1 Addition of Noise to Datathe data, which can misguide classification. From our Addition of noise was controlled by two parameters, theexperimental results, we conclude that the main strength probability of adding noise to a feature, p, and the scalingof our method lies in detecting the R2L and the U2R attacks, factor, s, for a feature. We performed four sets of experi-which are not satisfactorily detected by other methods. Our ments with noisy data, separately, one for each layer. Formethod gives slight improvement for detecting Probe each layer, we varied the parameter p between 0 and 0.95 (byattacks and was similar in accuracy when compared with keeping it at values 0.10, 0.20, 0.33, 0.50, 0.75, 0.90, and 0.95)other methods for detecting the DoS attacks. and varied the parameter s between À1,000 and þ1,000. In The prime reason for better detection accuracy for the case when the original feature was “0,” noise was added toCRFs is that they do not consider the observation features to any feature by using an additive function (random valuebe independent. CRFs evaluate all the rules together, which between À1,000 and þ1,000) instead of scaling. Figs. 5, 6, 7,are applicable for a given observation. This results in and 8 represent the effect of noise on each layer separately.capturing the correlation among different features of the We find that our integrated system is robust to noise inobservation resulting in higher accuracy. Considering both the training data and performs better than other methodsthe accuracy and the time required for testing, our system for all of the four attack groups. Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 15. 46 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010Fig. 5. Effect of noise on Probe layer. Fig. 8. Effect of noise on U2R layer. We then used PCA for dimensionality reduction [13]. However, the main drawback of using PCA in our task is that PCA transforms a large number of possibly correlated features into a small number of uncorrelated features known as the principle components. Hence, when we applied PCA followed by CRFs in the newly transformed feature space, the method did not provide significant advantage as the strength of our approach is to model correlation among features and the features in the new space are independent. We also note that, to construct a decision tree, the C4.5 algorithm performs feature selection. We selected theFig. 6. Effect of noise on DoS layer. same features as selected by the C4.5 algorithm and then performed experiments with only those features. However,8 AUTOMATIC FEATURE SELECTION there was no significant improvement in the results.From our experiments in the previous sections, we showed We also performed experiments with the methodthe advantages of performing feature selection and im- proposed in [33] for efficiently inducing features for aplementing the Layered Approach for attack detection. We CRF. The method is based upon iteratively constructingperformed our experiments by manually selecting features feature conjunctions that would significantly increase thefor different layers. However, we want to compare the conditional log-likelihood if added to the model. We usedresults of manual feature selection with the results of the Mallet tool [35] for performing these experiments andautomatic feature selection (and no feature selection) for all compare the results with our previous results based onthe layers. Hence, we investigated various methods for Layered CRFs with manual feature selection in Table 16. Weautomatic feature selection. observed that both the systems, with automatic and manual We experimented with a feed forward neural network to feature selections, had similar test time performance, butdetermine the weights for all the 41 features. Features with the accuracy of detection when features were inducedweights close to zero were discarded. As a result, only a automatically was significantly lower than our systemsmall set of features was selected for each layer. However, based upon manual feature selection.when we performed the experiments on the reduced set of It was not surprising that manual feature selectionfeatures and compared the results, there was no significant performed better than automatic feature selection. How-improvement in the detection accuracy, though there was ever, we note that automatic feature selection for Layeredreduction in training and testing time. CRFs performed better than the decision trees, particularly for the R2L and the U2R attacks. This suggests that Layered TABLE 16 Feature SelectionFig. 7. Effect of noise on R2L layer. Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 16. GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 47Approach using CRFs with automated feature selection is a A.2 Features Selected for DoS Layerfeasible scheme for building reliable intrusion detectionsystems.9 CONCLUSIONSIn this paper, we have addressed the dual problem ofAccuracy and Efficiency for building robust and efficientintrusion detection systems. Our experimental results inSection 6 show that CRFs are very effective in improvingthe attack detection rate and decreasing the FAR. Having alow FAR is very important for any intrusion detectionsystem. Further, feature selection and implementing theLayered Approach significantly reduce the time required to A.3 Features Selected for R2L Layertrain and test the model. Even though we used a relationaldata set for our experiments, we showed that the sequencelabeling methods such as the CRFs can be very effective indetecting attacks and they outperform other methods thatare known to work well with the relational data. Wecompared our approach with some well-known methodsand found that most of the present methods for intrusiondetection fail to reliably detect R2L and U2R attacks, whileour integrated system can effectively and efficiently detectsuch attacks giving an improvement of 34.5 percent for theR2L and 34.8 percent for the U2R attacks. We also discussedhow our system is implemented in real life. Our system canhelp in identifying an attack once it is detected at aparticular layer, which expedites the intrusion responsemechanism, thus minimizing the impact of an attack. Weshowed that our system is robust to noise and performsbetter than any other compared system even when thetraining data is noisy. Finally, our system has the advantage A.4 Features Selected for U2R Layerthat the number of layers can be increased or decreaseddepending upon the environment in which the system isdeployed, giving flexibility to the network administrators. The areas for future research include the use of ourmethod for extracting features that can aid in the develop-ment of signatures for signature-based systems. Thesignature-based systems can be deployed at the peripheryof a network to filter out attacks that are frequent andpreviously known, leaving the detection of new unknownattacks for anomaly and hybrid systems. Sequence analysismethods such as the CRFs when applied to relational datagive us the opportunity to employ the Layered Approach,as shown in this paper. This can further be extended to ACKNOWLEDGMENTSimplement pipelining of layers in multicore processors, The authors sincerely thank the anonymous reviewerswhich is likely to result in very high performance. whose comments have greatly helped clarify and improve this paper.APPENDIX AFEATURE SELECTION REFERENCESA.1 Features Selected for Probe Layer [1] Autonomous Agents for Intrusion Detection, http://www.cerias. purdue.edu/research/aafid/, 2010. [2] CRF++: Yet Another CRF Toolkit, http://crfpp.sourceforge.net/, 2010. [3] KDD Cup 1999 Intrusion Detection Data, http://kdd.ics.uci.edu/ databases/kddcup99/kddcup99.html, 2010. [4] Overview of Attack Trends, http://www.cert.org/archive/pdf/ attack_trends.pdf, 2002. [5] Probabilistic Agent Based Intrusion Detection, http://www.cse.sc. edu/research/isl/agentIDS.shtml, 2010. Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 17. 48 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010[6] SANS Institute—Intrusion Detection FAQ, http://www.sans.org/ [29] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random resources/idfaq/, 2010. Fields: Probabilistic Models for Segmenting and Labeling[7] T. Abraham, IDDM: Intrusion Detection Using Data Mining Sequence Data,” Proc. 18th Int’l Conf. Machine Learning Techniques, http://www.dsto.defence./gov.au/publications/ (ICML ’01), pp. 282-289, 2001. 2345/DSTO-GD-0286.pdf, 2008. [30] W. Lee and S. Stolfo, “Data Mining Approaches for Intrusion[8] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Detection,” Proc. Seventh USENIX Security Symp. (Security ’98), Rules between Sets of Items in Large Databases,” Proc. ACM pp. 79-94, 1998. SIGMOD, vol. 22, no. 2, pp. 207-216, 1993. [31] W. Lee, S. Stolfo, and K. Mok, “Mining Audit Data to Build[9] N.B. Amor, S. Benferhat, and Z. Elouedi, “Naive Bayes vs. Intrusion Detection Models,” Proc. Fourth Int’l Conf. Knowledge Decision Trees in Intrusion Detection Systems,” Proc. ACM Symp. Discovery and Data Mining (KDD ’98), pp. 66-72, 1998. Applied Computing (SAC ’04), pp. 420-424, 2004. [32] W. Lee, S. Stolfo, and K. Mok, “A Data Mining Framework for[10] J.P. Anderson, Computer Security Threat Monitoring and Surveillance, Building Intrusion Detection Model,” Proc. IEEE Symp. Security http://csrc.nist.gov/publications/history/ande80.pdf, 2010. and Privacy (SP ’99), pp. 120-132, 1999.[11] R. Bace and P. Mell, Intrusion Detection Systems, Computer [33] A. McCallum, “Efficiently Inducing Features of Conditional Security Division, Information Technology Laboratory, Nat’l Inst. Random Fields,” Proc. 19th Ann. Conf. Uncertainty in Artificial of Standards and Technology, 2001. Intelligence (UAI ’03), pp. 403-410, 2003.[12] D. Boughaci, H. Drias, A. Bendib, Y. Bouznit, and B. Benhamou, [34] A. McCallum, D. Freitag, and F. Pereira, “Maximum Entropy “Distributed Intrusion Detection Framework Based on Mobile Markov Models for Information Extraction and Segmentation,” Agents,” Proc. Int’l Conf. Dependability of Computer Systems Proc. 17th Int’l Conf. Machine Learning (ICML ’00), pp. 591-598, (DepCoS-RELCOMEX ’06), pp. 248-255, 2006. 2000.[13] Y. Bouzida and S. Gombault, “Eigenconnections to Intrusion [35] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit, Detection,” Security and Protection in Information Processing Systems, http://mallet.cs.umass.edu, 2010. pp. 241-258, 2004. [36] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion Detection with[14] H. Debar, M. Becke, and D. Siboni, “A Neural Network Unlabeled Data Using Clustering,” Proc. ACM Workshop Data Component for an Intrusion Detection System,” Proc. IEEE Symp. Mining Applied to Security (DMSA), 2001. Research in Security and Privacy (RSP ’92), pp. 240-250, 1992. [37] A. Ratnaparkhi, “A Maximum Entropy Model for Part-of-Speech[15] T.G. Dietterich, “Machine Learning for Sequential Data: A Tagging,” Proc. Conf. Empirical Methods in Natural Language Review,” Proc. Joint IAPR Int’l Workshop Structural, Syntactic, Processing (EMNLP ’96), pp. 133-142, Assoc. for Computational and Statistical Pattern Recognition (SSPR/SPR ’02), LNCS 2396, Linguistics, 1996. pp. 15-30, 2002. [38] M. Sabhnani and G. Serpen, “Application of Machine Learning[16] P. Dokas, L. Ertoz, A. Lazarevic, J. Srivastava, and P.-N. Tan, “Data Algorithms to KDD Intrusion Detection Dataset within Misuse Mining for Network Intrusion Detection,” Proc. NSF Workshop Next Detection Context,” Proc. Int’l Conf. Machine Learning, Models, Generation Data Mining (NGDM ’02), pp. 21-30, 2002. Technologies and Applications (MLMTA ’03), pp. 209-215, 2003.[17] Y. Du, H. Wang, and Y. Pang, “A Hidden Markov Models-Based [39] H. Shah, J. Undercoffer, and A. Joshi, “Fuzzy Clustering for Anomaly Intrusion Detection Method,” Proc. Fifth World Congress Intrusion Detection,” Proc. 12th IEEE Int’l Conf. Fuzzy Systems on Intelligent Control and Automation (WCICA ’04), vol. 5, (FUZZ-IEEE ’03), vol. 2, pp. 1274-1278, 2003. pp. 4348-4351, 2004. [40] C. Sutton and A. McCallum, “An Introduction to Conditional[18] S. Dzeroski and B. Zenko, “Is Combining Classifiers Better than Random Fields for Relational Learning,” Introduction to Statistical Selecting the Best One,” Proc. 19th Int’l Conf. Machine Learning Relational Learning, 2006. (ICML ’02), pp. 123-129, 2002. [41] E. Tombini, H. Debar, L. Me, and M. Ducasse, “A Serial[19] L. Ertoz, A. Lazarevic, E. Eilertson, P.-N. Tan, P. Dokas, V. Kumar, Combination of Anomaly and Misuse IDSes Applied to HTTP and J. Srivastava, “Protecting against Cyber Threats in Networked Traffic,” Proc. 20th Ann. Computer Security Applications Conf. Information Systems,” Proc. SPIE Battlespace Digitization and (ACSAC ’04), pp. 428-437, 2004. Network Centric Systems III, pp. 51-56, 2003. [42] W. Wang, X.H. Guan, and X.L. Zhang, “Modeling Program[20] S. Forrest, S.A. Hofmeyr, A. Somayaji, and T.A. Longstaff, Behaviors by Hidden Markov Models for Intrusion Detection,” “A Sense of Self for Unix Processes,” Proc. IEEE Symp. Proc. Int’l Conf. Machine Learning and Cybernetics (ICMLC ’04), Research in Security and Privacy (RSP ’96), pp. 120-128, 1996. vol. 5, pp. 2830-2835, 2004.[21] Y. Gu, A. McCallum, and D. Towsley, “Detecting Anomalies in [43] C. Warrender, S. Forrest, and B. Pearlmutter, “Detecting Intru- Network Traffic Using Maximum Entropy Estimation,” Proc. sions Using System Calls: Alternative Data Models,” Proc. IEEE Internet Measurement Conf. (IMC ’05), pp. 345-350, USENIX Assoc., Symp. Security and Privacy (SP ’99), pp. 133-145, 1999. 2005. [44] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning[22] K.K. Gupta, B. Nath, and R. Kotagiri, “Network Security Frame- Tools and Techniques. Morgan Kaufmann, 2005. work,” Int’l J. Computer Science and Network Security, vol. 6, no. 7B, [45] Y.-S. Wu, B. Foo, Y. Mei, and S. Bagchi, “Collaborative pp. 151-157, 2006. Intrusion Detection System (CIDS): A Framework for Accurate[23] K.K. Gupta, B. Nath, and R. Kotagiri, “Conditional Random Fields and Efficient IDS,” Proc. 19th Ann. Computer Security Applications for Intrusion Detection,” Proc. 21st Int’l Conf. Advanced Information Conf. (ACSAC ’03), pp. 234-244, 2003. Networking and Applications Workshops (AINAW ’07), pp. 203-208, [46] Z. Zhang, J. Li, C.N. Manikopoulos, J. Jorgenson, and J. Ucles, 2007. “HIDE: A Hierarchical Network Intrusion Detection System Using[24] K.K. Gupta, B. Nath, R. Kotagiri, and A. Kazi, “Attacking Statistical Preprocessing and Neural Network Classification,” Confidentiality: An Agent Based Approach,” Proc. IEEE Int’l Conf. Proc. IEEE Workshop Information Assurance and Security (IAW ’01), Intelligence and Security Informatics (ISI ’06), vol. 3975, pp. 285-296, pp. 85-90, 2001. 2006.[25] C. Ji and S. Ma, “Combinations of Weak Classifiers,” IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 32-42, 1997.[26] D.S. Kim and J.S. Park, “Network-Based Intrusion Detection with Kapil Kumar Gupta received the BTech degree Support Vector Machines,” Proc. Information Networking, Network- in computer science and engineering from the ing Technologies for Enhanced Internet Services Int’l Conf. (ICOIN ’03), Guru Gobind Singh Indraprastha (GGSIP) Uni- pp. 747-756, 2003. versity, Delhi, India, in 2004. He worked for a[27] D. Klein and C.D. Manning, “Conditional Structure versus year at HCL Technologies, Noida, India. He is Conditional Estimation in NLP Models,” Proc. ACL Conf. currently a PhD student in the Department of Empirical Methods in Natural Language Processing (EMNLP ’02), Computer Science and Software Engineering, vol. 10, pp. 9-16, Assoc. for Computational Linguistics, 2002. The University of Melbourne, Parkville, Australia.[28] C. Kruegel, D. Mutz, W. Robertson, and F. Valeur, “Bayesian His research interests include intrusion detec- Event Classification for Intrusion Detection,” Proc. 19th Ann. tion, network security, data security and data Computer Security Applications Conf. (ACSAC ’03), pp. 14-23, 2003. privacy, machine learning, data mining, and artificial intelligence. Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.
  • 18. GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 49 Baikunth Nath received the MA degree from Ramamohanarao (Rao) Kotagiri received the Punjab University, Chandigarh, India and the BE degree from Andhra University, the ME PhD degree from the University of Queensland, degree from the Indian Institute of Science Brisbane, Australia. He was with Monash Uni- (IISc), Bangalore, India, and the PhD degree versity for more than 25 years in various senior from Monash University. He was awarded the positions including the director of research in the Alexander von Humboldt Fellowship in 1983. He Gippsland School of IT. In 2001, he joined the joined the University of Melbourne in 1980 and Department of Computer Science and Software was appointed as a professor in computer Engineering, The University of Melbourne, Park- science in 1989. He has held several senior ville, Australia, as an associate professor and positions including head of Computer Sciencethe director of postgraduate studies. His research interests include and Software Engineering, head of the School of Electrical Engineeringimage processing, intrusion detection, scheduling, optimization, data and Computer Science, deputy director of the Centre for Ultramining, evolutionary computing, neural networks, financial forecasting, Broadband Information Networks, codirector of the Key Centre forand operations research. He is the author of numerous research Knowledge-Based Systems, and research director for the Cooperativepublications in various well-reputed international journals and con- Research Centre for Intelligent Decision Systems, The University ofference proceedings. He is a senior member of the IEEE. Melbourne. He served as a member of the ARC Information Technology Panel. He also served on the Prime Minister’s Science, Engineering and Innovation Council Working Party on Data for Scientists. He is currently the associate dean for research in the Faculty of Engineering, The University of Melbourne. He is on the editorial boards of the Universal Computer Science, Journal of Knowledge and Information Systems, IEEE Transactions on Knowledge and Data Engineering (TKDE), Journal of Statistical Analysis and Data Mining, and Very Large Data Bases (VLDB) Journal. He served as a program committee member of numerous international conferences including the International Con- ference on Management of Data (SIGMOD), International Conference on Very Large Data Bases (VLDB), International Conference on Logic Programming (ICLP), and International Conference on Data Engineering (ICDE). He is a steering committee member of the IEEE International Conference on Data Mining (ICDM), Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), and International Conference on Database Systems for Advanced Applications (DASFAA). He was the program cochair for VLDB, PAKDD, DASFAA, and the International Conference on Deductive and Object-Oriented Databases (DOOD). His research interests include database systems, logic-based systems, agent-oriented systems, information retrieval, data mining, intrusion detection, and machine learning. He has published widely in conference proceedings and international journals. He is a fellow of the Institute of Engineers Australia, Australian Academy of Technological Sciences and Engineering, and Australian Academy of Science. He is a member of the IEEE. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on July 19,2010 at 00:12:39 UTC from IEEE Xplore. Restrictions apply.