Survey Reports on Four Selected Research Papers on
Data Mining Based Intrusion Detection System
Authors: Zillur Rahman, S S Ahmedur Rahman, Lawangeen Khan
Abstract--- Security is a big issue now a days alarm when a possible intrusion occurs in the
and a lot of effort and finance is being invested system.
into this sector. Improving Network Intrusion
Detection is one of the most prominent field in Many IDS can be described with three
this area. Applying data mining into network fundamental functional components – Information
intrusion detection is not a very old concept Source, Analysis, and Response. Different sources
of information and events based on information
and now it has been proved that data mining
can automate the network intrusion detection are gathered to decide whether intrusion has taken
field with a greater efficiency. This paper tries place. This information is gathered at various
to give a details view on four papers in this levels like system, host, application, etc. Based on
area. Three of them are very current papers analysis of this data, they can detect the intrusion
based on two common practices – Misuse
from 2005 and one is from 2001. The paper
from 2001 discusses about ADAM, which was detection and Anomaly detection.
one of the leading data mining based intrusion
detection system. One of other three papers The main motivation behind using intrusion
discusses about various types of data mining detection in data mining is automation. Pattern of
the normal behavior and pattern of the intrusion
based intrusion detection system models and
tries to draw the advantages of data mining can be computed using data mining. To apply data
based network intrusion detection systems over mining techniques in intrusion detection, first, the
misuse detection systems. The other two papers collected monitoring data needs to be
discusses about two different models: one preprocessed and converted to the format suitable
for mining processing. Next, the reformatted data
discusses packet-based vs. session based
modeling for intrusion detection and other will be used to develop a clustering or
paper presented a model named SCAN, which classification model. Data mining techniques can
can work with good accuracy even with be applied to gain insightful knowledge of
incomplete dataset. This paper is based on intrusion prevention mechanisms. They can help
detect new vulnerabilities and intrusions, discover
these four papers and we hope that this paper
can give the reader a good insight view into this previous unknown patterns of attacker behaviors,
promising area. and provide decision support for intrusion
management. They also identify several research
issues such as performance, delay and many
1. INTRODUCTION: others that are involved in incorporating Data
Mining in Intrusion Detection Systems.
Intrusion Detection System (IDS) is an
important detection used as a countermeasure to One of the most pervasive issues in the computing
preserve data integrity and system availability world today is the problem of security – how to
from attacks. IDS is a combination of software protect their networks and systems from
and hardware that attempts to perform intrusion intrusions, disruptions, and other malicious
detection. Intrusion detection is a process of activities from unwanted attackers. However in
gathering intrusion related knowledge occurring the paper  they report the findings of their
in the process of monitoring the events and research in the area of anomaly-based intrusion
analyzing them for sign or intrusion. It raises detection systems using data-mining techniques to
create a decision tree model of their network
using the 1999 DARPA Intrusion Detection Clustering:
Evaluation data set. Clustering is the process of labeling data and
assigning it into groups. Clustering algorithms can
In the paper  authors have described some group new data instances into similar groups.
existing ID systems in the first part and in the These groups can be used to increase the
second part they have shown how data mining can performance of existing classifiers. High quality
be useful in this field. To do so, they have clusters can also assist human expert with
introduced ADAM (Audit Data Analysis and labeling. Clustering techniques can be categorized
Mining). At the first section they have identified into the following classes:
two kinds of ID systems currently in use. One 1) Pair-wise clustering.
kind is signature-based IDS (example includes P- 2) Central clustering.
BEST, USTAT and NSTAT). Another kind of Pair-wise clustering (similarity based clustering)
IDS is statistics and data mining-based IDS unifies similar data instances based on a data pair-
(example includes IDES, NIDES and EMERALD, wise distance measure. On the other hand, central
Haystack and JAM). As the main part of this clustering, also called centroid-based or model-
paper is the ADAM, so I‘m not describing the based clustering, models each cluster by its
other IDS in this summary part but will be ―centroid‖. In terms of runtime complexity,
explaining their design and experiences with centroid-based clustering algorithms are more
ADAM. efficient than similarity-based clustering
algorithms. Clustering discovers complex
Currently most of the systems deployed assume intrusions occurred over extended periods of time
the availability of complete and clean data for the and different spaces, correlating independent
purpose of intrusion detection. In this paper  network events. It is used to detect hybrids of
the authors have showed that this assumption is attack in the cluster. Clustering is an unsupervised
not valid and they presented SCAN (Stochastic machine learning mechanism for finding patterns
Clustering Algorithm for Network Anomaly in unlabeled data with many dimensions. Most of
detection), that has the ability to detect intrusions the clustering techniques discussed above are the
with high accuracy even when audit data is not basic steps involved in identifying intrusion.
complete. They have used Expectation- These steps are as follows:
Maximization Algorithm to cluster the incoming Find the largest cluster, i.e., the one with
audit data and compute the missing values of audit the most number of instances, and label it
data. They have validated their test using the 1999 normal.
DARPA/ Lincoln Laboratory intrusion detection Sort the remaining clusters in an
evaluation dataset. ascending order of their distances to the
2. EXPLOITING EFFICIENT DATA Select the first K1 clusters so that the
MINING TECHNIQUES TO ENHANCE number of data instances in these clusters
INTRUSION DETECTION SYSTEMS sum up to ¼ ´N, and label them as
normal, where ´ is the percentage of
Data Mining and Intrusion Detection: normal instances.
Data Mining is assisting various applications for Label all the other clusters as attacks.
required data analysis. Recently, data mining is
becoming an important component in intrusion Classification:
detection system. Different data mining Classification is similar to clustering in that it also
approaches like classification, clustering, partitions customer records into distinct segments
association rule, and outlier detection are called classes. But unlike clustering, classification
frequently used to analyze network data to gain analysis requires that the end-user/analyst know
intrusion related knowledge. ahead of time how classes are defined. It is
necessary that each record in the dataset used to
build the classifier already have a value for the
attribute used to define classes. As each record Association Rule:
has a value for the attribute used to define the The Association rule is specifically designed for
classes, and because the end-user decides on the use in data analyses. The association rule
attribute to use, classification is much less considers each attribute/value pair as an item. An
exploratory than clustering. The objective of a item set is a combination of items in a single
classifier is not to explore the data to discover network request. The algorithm scans through the
interesting segments, but to decide how new dataset trying to find item sets that tend to appear
records should be classified. Classifications in many network data. The objective behind using
algorithms can be classified into three types: association rule based data mining is to derive
Extensions to linear discrimination (e.g., multi-feature (attribute) correlations from a
multilayer perceptron, logistic database table. A typical and widely-used
discrimination). example of association rule mining is Market
Decision tree. Basket Analysis. Association rules provide
Rule-based methods. information of this type in the form of "if-then"
As compared to the clustering technique, statements. In association analysis the antecedent
classification technique is less popular in the and consequent are sets of items (called Item sets)
domain of intrusion detection. The main reason that are disjoint. The first number is called the
for this phenomenon is the large amount of data support for the rule. The support is simply the
needed to be collected to apply classification. To number of transactions that include all items in the
build the traces and form the normal and abnormal antecedent and consequent parts of the rule. The
groups, significant amount of data need to be other number is known as the confidence of the
analyzed to ensure its proximity. Using the rule. Many association rule algorithms have been
collected data as empirical models, false alarm developed in the last decades, which can be
rate in such case is significantly lower when classified into two categories:
compared to clustering. Classification approach candidate-generation-and-test approach
can be useful for both misuse detection and such as Apriori.
anomaly detection, but it is more commonly used pattern-growth approach.
for misuse detection. The challenging issues of association rule
algorithms are multiple scans of
Outlier Detection: transaction databases and a large number
Outlier detection has many applications, such as of candidates. Apriori was the first
data cleaning, fraud detection and network scalable algorithm designed for
intrusion. The existence of outliers indicates that association-rule mining algorithm.
individuals or groups that have very different Basic steps for incorporating association rule for
behavior from most of the individuals of the intrusion detection as follows.
dataset. Many times, outliers are removed to First network data need to be formatted
improve accuracy of the estimators. into a database table where each row is an
Most anomaly detection algorithms require a set audit record and each column is a field of
of purely normal data to train the model. the audit records.
Commonly used outlier techniques in intrusion There is evidence that intrusions and user
detection are Mahalanobis distance, detection of activities shows frequent correlations
outliers using Partitioning around medias (PAM), among network data.
and Bay‘s algorithm for distance-based outliers. Also rules based on network data can
Outliers detection is very useful in anomaly based continuously merge the rules from a new run to
intrusion detection. Outlier detection approaches the aggregate rule set of all previous runs.
can useful for detecting any unknown attacks.
This is the primary reason that makes outlier Data Mining Based IDS:
detection a popular approach for intrusion Besides expert systems, state transition analysis,
detection systems. and statistical analysis, data mining is becoming
one of the popular techniques for detecting
intrusion. They explore several such available IDS are capable for detecting both known and
which use data mining technique for intrusion unknown intrusions.
detection. IDS can be classified on the basis of
their strategy of detection. There are two IIDS (Intelligent Intrusion Detection System
categories under this classification: Misuse Architecture):
detection and Anomaly detection. It is distributed and network-based modular
architecture to monitor activities across the whole
Misuse Detection Based IDS: network. It is based on both anomaly and misuse
Misuse detection searches for known patterns of detection. In IIDS multiple sensors, both anomaly
attack. One disadvantage of this strategy is that it and misuse detection sensors serving as experts,
can only detect intrusions which are based on monitor activities on individual workstations,
known patterns. Example misuse detection activities of users, and network traffic by
systems that use data mining include JAM (Java detecting intrusion from and within network
Agents for Meta learning), MADAM ID (Mining traffic and at individual client levels. Currently,
Audit Data for Automated Models for Intrusion the IIDS architecture runs in a high speed cluster
Detection), and Automated Discovery of Concise environment.
Predictive Rules for Intrusion Detection. JAM
(developed at Columbia University) uses data RIDS-100(Rising Intrusion Detection System):
mining techniques to discover patterns of RIDS make the use of both intrusion detection
intrusions. It then applies a meta-learning technique, misuse and anomaly detection.
classifier to learn the signature of attacks. The Distance based outlier detection algorithm is used
association rules algorithm determines for detection deviational behavior among
relationships between fields in the audit trail collected network data. For misuse detection, it
records, and the frequent episodes algorithm has very vast set of collected data pattern which
models sequential patterns of audit events. can be matched with scanned network data for
Features are then extracted from both algorithms misuse detection. This large amount of data
and used to compute models of intrusion behavior. pattern is scanned using data mining classification
The classifiers build the signature of attacks. So Decision Tree algorithm.
thus, data mining in JAM builds misuse detection
3. PACKET VS SESSION BASED
Anomaly Detection Based IDS: MODELING FOR INTRUSION DETECTION
Misuse detection can not detect the attack which SYSTEM
signatures have not been defined. Anomaly
detection addresses this shortcoming. In anomaly In this survey they report the findings of their
detection, the system defines the expected research in the area of anomaly-based intrusion
network behavior in advance. Any significant detection systems using data-mining techniques to
deviations from the problem are then reported as create a decision tree model of their network using
possible attacks. Such deviations are not the 1999 DARPA Intrusion Detection Evaluation
necessarily actual attacks. The Data mining data set.
technique are used to extract implicit, unknown,
and useful information from data. Applications of Misuse Detectors:
data mining to anomaly detection include ADAM Misuse detectors are similar to modern anti-virus
(Audit Data Analysis and Mining), IDDM systems where an attack‘s signature is stored in
(Intrusion Detection using Data Mining), and the IDS for later use if needed.
IDS Using both Misuse and Anomaly Detection: Anomaly detectors determine what is ―normal‖ in
Following are the IDS‘s that use both misuse and a particular system or network and if an abnormal
anomaly intrusion detection techniques. Thus they or anomalous event occurs, then it is flagged for
The technique used is classification-tree individual rows represent network packets while
techniques to accurately predict probable attack the columns represent the associated variables
sessions as well as probable attack packets. found in that particular network packet
Overall, the goal is to model network traffic into
network sessions and packets and identify those Data Preparation:
instances that have a high-probability of being an They studied the data sets‘ patterns and modeled
attack and can be labeled as a ―suspect session‖ or the traffic patterns around a generated target
―suspect packet.‖ Then, they compare and contrast variable, ―TGT‖. This variable is a Boolean value
the two techniques at the end to see which (true/false) where a true value is returned for TGT
methodology is more accurate and suitable to their if the packet is a known attack packet as
needs. The ultimate goal would be to create a designated by the studies done in the Lincoln
layered system of network filters at the LAN Labs‘ tests. They used this variable as a predictor
gateway so all incoming and outgoing traffic must target variable for setting the stage when they
pass through these signature- and anomaly-based applied a new, separate data set to the model and
filters before being allowed to successfully pass scored the new data set against this model.
through the gateway.
Eventually, they made two separate data sets –
Problem with IDS: one for the packet-based modeling and the other
(1). False Negative. for the session based modeling – and used the
(2). False Positive. models in separate decision trees to see the results
attained. For the session-based data set, they
False Negative: molded it to become ―shorter‖ and ―wider‖ by
False negatives are the situation when intrusive adding more variables (columns) and reducing the
activities that are not anomalous result in a number of observations (rows) through the use of
flagged event. Structure Query Language (SQL) constructs to
mesh the associated packets into sessions, thereby
False Positive: reducing the size and complexity of the data while
False positives occur when anomalous activities maintaining the essential components of the data
that are not intrusive are flagged as intrusive. set. Since the victim computer was involved in an
attack session, its sending ―TGT‖ value was reset
The author contented that it is better to focus on to ―1‖ to describe that it‘s receiving an attack
anomaly detection techniques in the IDS field to from an outside source.
model normal behavior in order to detect these
novel attacks. The authors described two methods Classification Tree Modeling:
for learning anomaly detectors: rule learning They used a predictive, classification tree method
(LERAD) and clustering (CLAD). Both methods for their model. The advantage with using this
present interesting advantages and possibilities for method over traditional pattern-recognition
future research. methods is that the classification tree is an
intuitive tool and relatively easy to interpret.
They started their research towards developing a Data Modeling and Assessment:
robust network modeler through the use of data- They strived to accomplish their data-mining
mining techniques using the MIT Lincoln Lab goals through the use of supervised learning
data sets from their work in 1999. The MIT techniques on the binary target variable, ―TGT‖.
researchers created a synthetic network They also used the concept of over-sampling,
environment test bed to create their data sets. which allowed us to highlight the infrequent and
These data sets – which contained both attack- unusual attacks that occurred during the 1999
filled and attack-free data – allowed us to look at IDEVAL. From the initial data set they over
network traffic stored in tcpdump format where sampled the labeled attack sessions and created a
the attack types, duration, and times were known new data set.
and published. When looking at the data sets, the
From these revised data sets they further deleted a
few extraneous variables – date/time, source IP Now let‘s give an example, suppose in library
address, and destination IP address variables. All many customers who buy book also buy pencil
other header variables were kept, including the leads to an association rule bookpencil. There
source and destination port numbers, in order to are also two parameters involved with this:
allow the system to model the important facets of support and confidence. The association rule
the network better. They then partitioned both bookpencil has 30% support means that 30%
data sets using stratified random sampling and customer buy at least one item from book and
allocated 67% of the observations (network pencil. The rule has 10% confidence means that
sessions) to the training data set and 33% to the 10% customers buy both of them. The most
validation data set. The TGT variable was then difficult part of the association rules discovering
specified to form subsets of the original data to algorithm is to find the item sets X U Y (in this
improve the classification precision of their example book U pencil), that have strong support.
model. They created a decision tree using the Chi-
Square splitting criteria and modeled the data There are total two phases in this experiment
from the IDEVAL. Their decision tree model model. In the first phase they trained the classifier.
determined that the number of leaves required to This phase takes place only once offline before
maximize the ―profit‖ of the model was around using the system. In the second phase they use the
four or five leaves. Their model also shows that trained classifier to detect intrusions.
after five leaves of depth, the tree maximizes its
profit. Phase 1:
4. ADAM: DETECTING INTRUSIONS
BY DATA MINING:
ADAM uses a combination of association rules
and classification to detect any attack in a
TCPdump audit trails. First, ADAM collects
normal, known frequent datasets by mining into
this model. Secondly, it runs an on-line algorithm
to find last frequent connections and compare
them with the known mined data and discards
those which seem to be normal. With the
suspicious ones it then uses a classifier which is In the first step, a database of attack-free normal
previously trained to classify the suspicious frequent itemsets, those have a minimum support,
connections as known type of attack, unknown are created. This database serves as profile, which
type of attack or a false alarm. is compared later against frequent datasets found.
The profile database is populated with frequent
As this ADAM model uses association rule in itemsets with a specific format for the attack-free
audit data to improve classification task, so it will portion of the data. The algorithm used in this step
be better if I describe the association rule before can be a conventional association data mining
entering into the main design. algorithm although they used a customized
algorithm for better speed. So, in the first step
Suppose there are two sets of attributes X and Y. they are creating a profile of normal behavior.
Association rule uses some notation and let me
describe those notations. In the second step they again use the training data,
XY: means that Y happens if X happens. the normal profile and an on-line algorithm for
X∩Y = Ø: means that X and Y don‘t contain the association rule, whose output consists of frequent
same thing itemsets that may be attacks. These suspicious
||Y||: means that Y contains at least one item itemsets along with a set of features extracted
from the data by a feature selection module are is based on an improved version of the
used as training data for the classifier which is Expectation–Maximization (EM) algorithm. The
based on decision tree. Now let‘s look how the EM algorithm‘s speed of convergence was
dynamic on-line association rule algorithm works. accelerated by using a combination of Bloom
This algorithm is driven by a sliding window of filters, data summaries and associated arrays. The
tunable size. The algorithm outputs the itemsets clustering approach that they have proposed can
those have received strong support with the be used to process incomplete and unlabelled data.
profile within the specific window size time. They The salient features of SCAN are:
compare any datasets with profile database, if
there is match then it is normal. Otherwise they Improvements in speed of convergence: We
leave a counter that will track the support of the use a combination of data summaries,
itemset. If the support exceeds a certain threshold Bloom filters and arrays to accelerate the
then the itemset is flagged as suspicious. Then the rate of convergence of the EM algorithm.
classifier labels that suspicious dataset as known Ability to detect anomalies in the absence of
attack, false alarm or unknown attack. complete audit data: We use SCAN to
detect anomalies in network traffic in the
Phase 2: absence of complete audit data. The
improved EM clustering algorithm—which
forms the core of SCAN—is used to
compute the missing values in the audit
dataset. The advantage of using the EM
algorithm is that unlike other imputation
techniques, EM computes the maximum
likelihood estimates in parametric models
based on prior information.
System Model and Design:
SCAN has three major components:
Online Sampling and Flow Aggregation
Clustering and Data Reduction
In this phase our classifier is already trained and Those three parts are described in details below:
can categorized any attack as known, unknown or
false alarm. Here also the same dynamic on-line
algorithm is used to produce suspicious data and
along with feature selection module, profile those
suspicious data are sent to the trained classifier.
Classifier then outputs which kind of attack it is.
If it is false alarm then the classifier excludes
those from the attack list and don‘t sent those to
system monitoring officer.
5. DETECTING DENIAL-OF-SERVICE
ATTACKS WITH INCOMPLETE AUDIT
SCAN (Stochastic Clustering Algorithm for Figure: SCAN System Model
Network anomaly detection) is an unsupervised
network anomaly detection scheme. At the core of
SCAN is a stochastic clustering algorithm, which
Clustering and Data Reduction:
Online Sampling and Flow Aggregation: Clustering is a data mining technique, which can
In this module, network traffic is sampled and group similar types of data in same cluster. The
classified into flows. In the context of SCAN, a clustering technique in intrusion detection works
flow is all the connections with a particular as follows: when incoming data are coming, it is
destination IP address and a port number likely that most of them will be normal traffic. So,
combination. The measurement and charactera- the cluster that is bigger in size is less likely to be
tion of the sampled data proceeds in time slices. an intrusion. The goal of clustering is to divide
They processed the sampled packets into flows into natural groups. The instances contained
connection records. Because of its compact size as in a cluster are considered to be similar to one
compared to other types of records (e.g., packet another according to some metric based on the
logs), connection records are appropriate for data underlying domain from which the instances are
analysis. Connection records provide the drawn. A stochastic clustering method (such as
following fields: the EM algorithm) assigns a data point to each
(SrcIP,SrcPort,DstIP,DstPort,ConnStatus,Proto, group with a certain probability. Such statistical
Duration) approaches make sense in practical situations
where no amount of training data is sufficient to
make a completely firm decision about cluster
Stochastic clustering algorithms aim to find the
most likely set of clusters given the training data
and prior expectations. The underlying model,
used in the stochastic clustering algorithm, is
called a finite mixture model. A mixture is a set of
probability distributions—one for each cluster—
that models the attribute values of the members of
Figure: Online Sampling and Flow Aggregation that cluster.
EM Algorithm based Clustering Algorithm:
Each field can be used to correlate network traffic Input to the EM algorithm is the Dataset and
data and extract intrinsic features of each initial estimates and it outputs the clustered set of
connection. They employed a very simple flow points. It starts with an initial estimate or initial
aggregation algorithm. Bloom filters and guess for the parameters of the mixture model for
associated arrays are used to store, retrieve and each cluster and then iteratively applies a process.
run membership queries on incoming flow data in There are two steps:
a given time slice. When a packet is sampled, we 1. Expectation step
first identify the flow the packet belongs to by 2. Maximization step.
executing a membership query in the Bloom filter In the E-step, the EM algorithm first finds the
based on the flow ID. If an existing flow ID expected value of the complete-data log-
cannot be found, a new flow ID is generated based likelihood, given the observed data and the
on the information from the connection record. parameter estimates.
This is called the insertion phase. If the flow ID
exists, the corresponding flow array is updated. If In the M-step, the EM algorithm maximizes the
the flow ID does not exist, we run the flow ID expectation computed in the E-step.
through each of the k hash functions (that form a
part of the Bloom filter), and use the result as an The two steps of the EM algorithm are repeated
offset into the bit vector, turning on the bit we until an increase in the log-likelihood of the data,
find at that position. If the bit is already set, they given the current model, is less than the accuracy
left it on. threshold ε. The EM algorithm is guaranteed to
converge to a local maximum, which may or may
not be the same as the global maximum. Handling Missing Data in a Dataset:
Typically, the EM algorithm is executed multiple There are three ways to handle missing data in
times with different initial settings for the connection records:
parameter values. Then a given data point is Missing Completely At Random (MCAR)
assigned to the cluster with the largest log- Missing At Random (MAR)
likelihood score. Non-ignorable.
The technique of using EM algorithm to handle
missing data has been discussed in another paper
in details and is out of scope of this paper.
However the technique in summary is: In the E
step their system computes the expected value of
the complete data‘s log likelihood value based on
the complete data cases and in M step the system
inserts that value into that incomplete dataset.
In this model, elements of the network flow that
are anomalous are generated from the anomalous
distribution (A), and normal traffic elements are
generated from the normal distribution (N).
Therefore, the process of detecting anomalies
involves determining whether the incoming flow
Figure: EM Algorithm belongs to distribution A or distribution N. They
calculated the likelihood of the two cases to
Data Summaries for Data Reduction: determine the distribution to which the incoming
After the EM algorithm gives output of clustered flow belongs to. They assumed that for normal
set of data points then this step is executed to traffic, there exists a function FN which takes as
build summary for each time slice and reduce data parameters the set of normal elements Nt, at time
set. Basically this is an enhancement they have t, and outputs a probability distribution PNt over
presented in their report. Their enhancement the data T generated by distribution T. Similarly,
consists of three stages: they have a function FA for anomalous elements
1. At the end of each time slice they have which takes as parameters the set of anomalous
made a pass over the connection dataset elements At and outputs a probability distribution
and built data summaries. Data PAt . Assuming that p is the probability with
summaries are important in EM which an element xt belongs to N, the likelihood
algorithm framework as they reduce the Lt of the distribution T at time t is defined as
I/O time by avoiding repeated scans over
the data. This enhances the speed of
convergence of the EM algorithm.
2. In this stage, they made another pass over
the dataset to built cluster candidates The likelihood values are often calculated by
using the data summaries collected in the taking the natural logarithm of the likelihood
previous step. After the frequently rather than the likelihood itself because likelihood
appearing data have been identified, the values are often very small numbers that are close
algorithm builds all cluster candidates to zero. They calculated the log-likelihood as
during this pass. The clusters are built follows:
and hashed into bloom filter.
3. In this stage clusters are selected from the
set of candidates.
They measured the likelihood of each flow, xt ,
belonging to a particular group by comparing the
difference LLt (T)−LLt−1 (T). If this difference is
greater than some value α, we declare the flow
anomalous. They repeat this process for every
flow in the time interval.
6. TESTING METHODOLOGY:
In this paper the authors did not use any testing Overall, out of the approximately 500,000 packets
methodology. They described different kinds of with a 1.0000 probability, there are at least 50,000
data mining techniques and rules to implement packets that require further study. Retraining of
various kinds of data mining based IDS. our model and readjusting the model‘s prior
probabilities will allow us to see if those
Paper II: remaining packets are truly attack packets or just
The authors of this paper used MIT Lincoln Lab simply false alarms.
1999 intrusion detection evaluation (IDEVAL)
data sets. From this data set they over-sampled the Session-Based Results
labeled attack sessions and created a new data set They also scored the UCF data with the session-
that contained a mix of 43.6% attack sessions and based network model and found that
56.4% non-attack sessions. After refining their approximately 32.9% of the sessions were
network model using this data, they also have identified as having a probability of 1.0000 of
collected hours of traffic from the University of being an attack session.
Central Florida (UCL) network and looked at the
TCP/IP header information and ran that Conversely, a session that scored a probability
information against our newly created model. lower than 1.0000 does not also necessarily mean
Each day of our UCF data set contained 20-30 that session is a ―good‖ session and poses no
million packets, averaging 300 packets captured threat to their campus networks.
The vast majority of the sessions captured had a
Packet-Based Results low or non-existent probability of being an attack
They scored the UCF data with the packet-based session. Their studies showed that more than
network model and found that approximately sixty-six percent of the sessions captured have an
2.5% of the packets were identified as having a attack probability of 0.0129.
probability of 1.0000 of being an attack packet.
Most of the anomalous sessions that were tagged
Conversely, a packet that scored a probability of in the high-probability area were web-based port
0.0000 does not necessarily mean that packet is a 80 traffic. Overall, out of the approximately
―good‖ packet and poses no threat to their campus 30,000 sessions that were highly probable attack
networks. The following figure shows that more sessions, there are at least 10,000 sessions that
than seventy percent of the packets captured have require further study. Retraining of their model
an attack probability of 0.0000 and ninety-seven and readjusting the model‘s prior probabilities
percent of the packets have an attack probability will allow to see if those remaining sessions are
of 0.5000 or less. truly attack sessions or just simply false alarms.
The authors of this paper evaluated SCAN using
the 1999 DARPA intrusion detection evaluation
data. The dataset consists of five weeks of
TCPDump data. Data from week 1 and 3 consist
of normal attack-free network traffic. Week 2 data
consists of network traffic with labeled attacks.
The week 4 and 5 data are the ―Test Data‖ and
contain 201 instances of 58 different unlabelled
attacks, 177 of which are visible in the inside
TCPDump data. They trained SCAN on the
DARPA dataset using week 1 and 3 data, then
evaluated the detector on weeks 4 and 5.
Evaluation was done in an off-line manner.
Simulation Results with Complete Audit Data
An IDS is evaluated on the basis of accuracy and
efficiency. To judge the efficiency and accuracy
of SCAN, they used Receiver-Operating
Characteristic (ROC) curves. An ROC curve,
which graphs the rate of detection versus the false
The performance of SCAN at detecting a SSH
Process Table attack is shown in Fig. 5. The
attack is similar to the Process Table attack in that
the goal of the attacker is to cause the SSHD
daemon to spawn the maximum number of
processes that the operating system will allow.
In Fig. 6 the performance of SCAN at detecting a
SYN flood attack is evaluated. A SYN flood
attack is a type of a DoS attack. It occurs when an
attacker makes a large number of half-open Paper IV:
connections to a given port of a server during a The authors of this paper discussed that ADAM
short period of time. This causes the data structure participated in DARPA 1999 intrusion detection
in the ‗tcpd‘ daemon in the server to overflow. evaluation. It focused on detecting DOS and
Due to the buffer overflow, the system will be PROBE attacks from tcpdump data and performed
unable to accept any new incoming connections quite well. The following Figures 3 and 4 show
till the queue empties. They compared the the results of DARPA 1999 test data.
performance of SCAN with that of the K-Nearest
Neighbors (KNN) clustering algorithm.
Comparisons of the ROC curves seen in Fig. 5
and 6 suggests that SCAN outperforms the KNN
algorithm. The two techniques are similar to each
other in that both are unsupervised techniques that
work with unlabelled data.
Figures 5 and 6 show the results of the recent
1999 DARPA Intrusion Detection Evaluation,
administered by MIT Lincoln Labs. ADAM
entered the competition and performed extremely
well. We aimed to detect intrusions of the type
Denial of Service (DOS) and Probe attacks. In
those categories, as shown in Figure 5, ADAM
ranked third, after EMERALD and the University
of California Santa Barbara's STAT system. The 7. CONCLUSION:
attacks detected by those systems and missed by
ADAM were those that fell below our thresholds, In this report we have studied the details of four
being attacks that involved usually only one papers in this area. We tried to make summary of
connection and that can be best identified by those four papers, their system models, their
signature-based systems. ADAM came very close technologies and their validation methods. We did
to EMERALD in overall attack identification, as not go through all the cross-references given in
shown in Figure 6. those papers rather we kept the scope of this paper
limited into these four papers only. We strongly
believe that this paper will be able to give the
reader a view about what is currently developing
in this area and how data mining is evolving into
the field of network intrusion detection.
 Chang-Tien Lu, Arnold P. Boedihardjo, Prajwal
Manalwar, ―Exploiting Efficient Data Mining Techniques
to Enhance Intrusion Detection Systems. Information
Reuse and Integration, Conf, 2005. IRI -2005 IEEE
International Conference on.
 Caulkins, B.D.; Joohan Lee; Wang, M, ―Packet- vs.
session-based modeling for intrusion detection system‖,
Information Technology: Coding and Computing, 2005.
ITCC 2005. International Conference on Volume 1, 4-6
April 2005 Page(s):116 - 121 Vol. 1 Digital Object
Identifier 10.1109/ITCC.2005.222 .
 Patcha, A.; Park, J.-M., ―Detecting denial-of-service
attacks with incomplete audit data‖, Computer
Communications and Networks, 2005. ICCCN 2005.
Proceedings. 14th International Conference on 17-19 Oct.
2005 Page(s):263 - 268 Digital Object Identifier
 Daniel Barbara, Julia Couto, Sushil Jajodia, Leonard
Popyack, Ningning Wu, ―ADAM: Detecting Intrusions by
Data Mining‖, Proceedings of the 2001 IEEE Workshop
on Information Assurance and Security, United States
Military Academy, West Point, NY, 5-6 June 2001.