1. Using Analytics to Enhance
Intrusion Detection Systems
by
Jamie Sullivan
Final-Year Project - BSc in Computer Science
Supervisor: Prof. Gregory Provan
Second Reader: Dr. Derek Bridge
Department of Computer Science
University College Cork
April 2016
2. Abstract
By Jamie Sullivan
Network Intrusion Detection Systems suffer from high false-alarm rates, this
project uses predictive analytics approach to study how to reduce false alarms.
Predictive analytics will combine system models together with data to predict
the most likely actions of intruders, in order to distinguish real intrusion
signatures (triggering of alarms) from random signatures. The project will
combine learning from data with simulation models for signature analysis.
The project will provide both an experimental approach to study how predictive
analytics combined with machine learning algorithm/techniques and a Software
simulation on how the system is intended to operate using open source
software.
Predictive analysis carried out on the KDD Cup ’99 dataset implemented using
both supervised learning (Classification) and unsupervised learning (Clustering)
algorithms to create simulation models, approach to reduce false alarms (false
positives) by increasing the predictive accuracy of the these algorithms.
3. DeclarationofOriginality
In signing this declaration, you are confirming, in writing, that the submitted
work is entirely your own original work, except where clearly attributed
otherwise, and that it has not been submitted partly or wholly for any other
educational award.
I hereby declare that:
- this is all my own work, unless clearly indicated otherwise, with full and
proper accreditation;
- with respect to my own work: none of it has been submitted at any
educational institution contributing in any way towards an educational
award;
- with respect to another’s work: all text, diagrams, code, or ideas, whether
verbatim, paraphrased or otherwise modified or adapted, have been duly
attributed to the source in a scholarly manner, whether from books,
papers, lecture notes or any other student’s work, whether published or
unpublished, electronically or in print.
Name: Jamie Sullivan
Signed: _______________________________
4. Acknowledgements
I am using this opportunity to express my gratitude to everyone who supported
me throughout the course of this project. I am thankful for their aspiring
guidance, invaluably constructive criticism and friendly advice during the project
work. I am sincerely grateful to them for sharing their truthful and illuminating
views on a number of ideas related to the project.
I would like to express my sincere gratitude to my supervisor Prof. Gregory
Provan for the continuous support during the course of this project, for his
patience, motivation, enthusiasm, and immense knowledge. His guidance helped
me in all aspects of the project. I could not have imagined having a better advisor
and mentor for my project.
Thank you to everyone in lab 1.09, for the stimulating discussions, fantastic
atmosphere and work environment over the past several months.
Last but not the least, I would like to thank my friends and family for all the
support, guidance and advice throughout the entire process of this project, and
especially over my four years here in University College Cork.
6. 1
1 Introduction
With a growing level of intrusions on the internet and local networks a vast
amount of work is being invested in intrusion detection. Intrusion detection
systems coupled with intrusion prevention systems work to stop these attacks.
Unfortunately Intrusion Detection Systems (IDS) suffer from high false-alarm
rates known as false-positives, these false-positives result in genuine packets
being flagged as attacks which in a system is not ideal as it drowns out legitimate
attack detections with these false attack detections [1]. Another problem in IDS’s
is false-negatives which fail to identify a packet as an attack, this project will
focus on reducing false-positives.
This project will attempt to use predictive analytics to study how to reduce false
alarms. Predictive analytics is an analytical method which is used to make
predictions on unknown future events, using techniques like machine learning,
data mining, modelling and statistics to analyse stored data to make accurate
predictions about future data [2]. Machine learning algorithms and data mining
techniques combined with statistical methods implemented on the training
dataset to create predictive models which will improve the accuracy of the
models and thus reducing false-positives. Both misuse and anomaly based
intrusion techniques will be implemented to investigate the effects of the false-
positive reduction method researched.
The predictive analytics approach researched in this project will be implemented
on the free to use KDD Cup 99 dataset, thus research will be done on the dataset
to see if it is an effective dataset for intrusion detection, free to use data is
always of interest when performing a research project or developing a system as
it is cost effective and the resulting model/system can be implement by others
using the same free to us data.
The project was split into two sections, first section being an experiment to
research the capabilities and effects of predictive analytics for intrusion
detection, the second section being development of a system for intrusion
detection. Both the experiment and development sections were implemented
using the open source Weka data mining tool, which will be discussed in later
sections. The experiment was carried out first followed by the development of
the system using the model that returned the best results. The goal of the
developed system being to develop and easy to use and interpret IDS, it will not
have intrusion prevention functionality, the system should be able to handle
real-time/near real-time intrusion detection and be scalable for all network
types.
In the end the experiment showed that predictive analytics can be implemented
to improve detection accuracy and reduce false-positives. This was done by using
predictive machine learning/data mining algorithms combined with statistical
7. 2
methods to filter the data in the KDD Cup 99 dataset. Although the system
developed did not have real-time/near real-time intrusion detection functionality
a proposed method of implemented was discussed for future work, the system
did provide a simulation on how intrusion detection is performed using the
predictive analytic methods researched in the experiment. The experiment
revealed the downfalls of the KDD Cup 99 dataset which have been investigated
in other papers but showed the dataset still has merit as an effective dataset for
use in research, especially in cases where the data is skewed.
8. 3
2 Analysis
Based on the project abstract and introduction the objectives of this project is to
complete adequate research in the field of Intrusion Detection Systems, use
predictive analytics to increase the classification accuracy and reduce false-
positives and implement the method researched in a IDS. The environment this
research experiment and development will be carried out on will be the UCC
Computer Science department lab machines which have 8GB RAM and an Intel
Duo CPU 3.00 GHz processor.
Before research and development can begin for this project, research must be
done on the background ideologies:
Machine Learning
o Methods of pattern detection/recognition which implement
computational learning based on the data patterns [3].
Data Mining
o The process of gaining insightful knowledge from data in a
knowledgebase/dataset [4].
Predictive Analytics
o Analytical method which is used to make predictions on unknown
future events [2].
False-Positives
o Packet data that is flagged as an attack intrusion but is actually a
legitimate normal packet.
False-Negatives
o Packet data that is flagged as a normal packet but was actually a
malicious attack packet.
The research experiment will implement algorithms such as Naïve Bayes, J48,
Random Forest, EM and K-Means, these algorithms were chosen as together
they have a large range of functionality and different machine learning
approaches. The individual algorithms will be discussed more in sections 3.1.2
and 3.1.3. Data mining and statistical methods will be applied to the dataset to
optimise the data retrieved from the dataset, methods such as Discretization and
Standardization will be implemented. The IDS being developed will use a Java
architecture and requirement similar to those discussed in [5], but will also use
open-source data mining/machine learning discussed later sections.
The results of this project will hope to discover and implement an effective
means of improved network intrusion detection that when implemented will
reduce false-positives and classification accuracy while also not creating massive
overhead that would cause network congestion, in turn making the proposed
improvement method scalable for both small and large networks.
9. 4
3 Design
3.1 Experimental Design
Largest portion of this project was researching how to improve Intrusion
Detection using Analytics, research being carried out on two machine learning
techniques, Classification algorithms and Clustering algorithms discussed below.
3.1.1 Knowledgebase/Dataset
The Knowledgebase/Dataset that will be used in this Experiment is the KDD Cup
99 dataset, a dataset originally used for “The Third International Knowledge
Discovery and Data Mining Tools Competition”, the competitions task being to
build a network intrusion detector, a predictive model capable of distinguishing
between intrusions (attacks) and normal connections [6]. With this task being
very similar to the task set out in this project, the KDD Cup 99 is a good fit for the
experiment.
For the experiment a pre-made 10% section of the entire KDD Cup 99 dataset
will be used, due to the limitations of the system memory being used in the
experiment a sub section of the 10% of the dataset, this being a 8% version of
the KDD Cup 99 dataset with a total of 298,497 instances, 42 attributes and 23
distinct class values. The 10% training dataset has 19.69% normal and 80.31%
attack connections [7], the breakdown of attacks is show below:
Denial of Service Attack (DOS)
o Attacks that try to block legitimate connection request by making
computing or memory resources too busy for requests.
o ‘back’, ‘land’, ‘neptune’, ‘pod’, ‘smurf’ and ‘teardrop’
Users to Root Attack (U2R)
o Attacker has access to a normal user account, uses these attacks
to try gain access to root level access.
o ‘buffer_overflow’, ‘loadmodule’, ’perl’, and ‘rootkit’
Remote to Local Attack (R2L)
o Attacker does not have access to an account on a machine,
attacker tries to expose a vulnerability on the network using these
attacks to gain access to the machine.
o ‘ftp_write’, ‘guess_passwd’, ‘imap’, ‘multihop’, ‘phf’, ‘spy’,
‘warezclient’ and ‘warezmaster’
Probing Attack (PROBE)
o Attacks try to gain information about the network in order to use
this information to get past its security.
o ‘ipsweep’, ‘nmap’, ‘portsweep’ and ‘satan’
As stated above the ratio of normal and attack connections in the dataset is
19.69% to 80.31% respectively, thus this is a heavily skewed dataset to attack
connections, the dataset seems to favour more records of DOS attack instances
such as ‘neptune’ and ‘smurf’ rather than more harmful attacks such as U2L and
10. 5
R2l (which in practical uses are more favourable in a training dataset), which can
lead the results to be biased to frequent record detection methods, which is not
ideal for practical use but for this experiment the focus is more on the analytical
methods which can be implemented to improve the baseline result (result
received without the analytical method implemented). The downfalls of the KDD
Cup 99 dataset are discussed in [8]. Despite these downfalls, the KDD Cup 99
dataset is still an effect dataset for the purpose of this project.
Finally the KDD Cup 99 dataset contains the following connection protocols [7]:
TCP: Reliable connection-oriented protocol
UDP: Unreliable and connection-less protocol
ICMP: Communication over networked computers
3.1.2 Classification Algorithms Approach
Classification is a supervised machine learning technique that assigns instances in
a collection to target class, the end goal of a classifier is to accurately predict the
target class for each case in the data. In the following sections the classification
algorithms that will be used and the analytical approach to reduce false positives
will be discussed.
3.1.2.1 Analytical approach – Discretization
Discretization is the process of converting continuous attribute values to nominal
attribute intervals thus a smaller attribute selection. Discretization has been
shown to improve prediction accuracy in previous works [9] due to intervals
being a more concise representation of the data thus is easier to use and
comprehend rather than continuous values. Discretization is the planned
analytical approach to attempt to make an improvement on the results of the
algorithms discussed below.
3.1.2.2 Naïve Bayes
The Naïve Bayes algorithm is based on conditional probabilities and uses Bayes
Theorem, it finds the probability of an event occurring given the probability of
another event that has already occurred. In the case of this experiment it finds
the probability of a connection being either a normal or attack connection based
on the previously calculated probability of the connections in the
knowledgebase/ dataset.
Figure 1: Bayes Theorem.
11. 6
In this experiment Naïve Bayes will be the main algorithm being focused on
because it works surprisingly well as a classification algorithm given it works on
‘naïve’ assumptions and yet is still competitive against other more elegant
classifications, and it has a lower performance compared to the two decision tree
algorithms discussed in the following sections. For these two reasons, the results
of training/testing the Naïve Bayes classifier can be used as the baseline for the
experiment, and then by using the analytical approach discussed in section
3.1.2.1 to try and improve the results of the classifier as close to the superior
results of decision tree the algorithms.
3.1.2.3 J48
The J48 algorithm is a decision tree algorithm, it is an open source java
implementation of the C4.5 algorithm, which builds a classification tree based on
the principle of information entropy. In this experiment, research will be carried
out to see how the J48 algorithm compares to the results of the Naïve Bayes
classifier and it’s fellow decision tree algorithm Random Forest, the analytical
method used to try improve the results of the Naïve Bayes algorithm will also be
applied to J48 to see if the results have had the same outcome.
3.1.2.4 Random Forest
Random Forest is another decision tree algorithm that uses ensemble learning
method, builds a classification by combining tree predictors with random vectors
to create the decision trees [10]. Like J48, the results of the Random Forest
classifier will be used in comparison with Naïve Bayes and J48, as in the cases of
the Naïve Bayes and J48, Random Forest algorithm will also implement the
analytical method discussed above and the outcome compared.
3.1.3 Clustering Algorithms Approach
Clustering is an unsupervised machine learning technique that groups data
together into ‘clusters’ based on their likeness, unlike classification, clustering
algorithms can work off unlabelled datasets i.e. does not have a class attribute
value. Research on the performance of clustering algorithms will be carried out
on the K-Means and EM algorithms, comparisons will be carried out on the
results of both these algorithms, with and without the planned analytical
approach for improved performance applied.
3.1.3.1 Analytical Improvement – Standardize & Discretization
Standardization is a technique that transforms the mean of the dataset to 0 and
unit variance of the dataset to be 1 [11]. Like the classification algorithms,
Discretization will also be implemented on the Clustering algorithms so as to the
effects it has on their results, although in the case of the clustering algorithms an
unsupervised discrete filter will be used.
Both these analytical approaches will be implemented on the two algorithms
below to investigate if there is an improvement of their results.
12. 7
3.1.3.2 K-Means
K-Means clustering algorithm aims to partitions n objects/observations from the
dataset into k clusters, where each object/observation belongs to the cluster
with the nearest mean [12]. In simple, K-means assigns k centroids (centre point
of a cluster) that are used to define a cluster, an instance is defined to be in a
particular cluster if it is closer to that cluster's centroid than any other centroid,
much like nearest neighbour [13].
3.1.3.3 EM
The EM (Expectation Maximization) algorithm, an algorithm that evaluates
clusters using two stages, first calculating the expectation of the log-likelihood,
followed by computing the parameters maximizing the log-likelihood (Calculated
in previous step), which is used to calculate the distribution. This algorithm is
seen to be useful in real word datasets, and has been used after implementing K-
Means in other citations [12], so the comparison of results will be of interest.
3.2 Software Design
For this project a simulation of how the proposed Intrusion Detection System
using improvements researched was needed, the following sections will go into
greater detail of the high-level architecture.
Figure 2: IDS Software Architecture Overview [14]
13. 8
Figure 3: Software Architecture Overview of IDS Simulator
3.2.1 Packet Sniffer
Captures the network traffic, filters it for the particular traffic you want, then
stores the data in a buffer. The captured packets are then analysed/ decoded in
Real-Time or Near Real-Time [15]. Below possible packet sniffer tools discussed:
SNORT [16]
o An open source intrusion prevention system capable of real-time
traffic analysis and packet logging, libpcap-based, rule based. In
this project the full services Snort offers would not be
implemented instead only a subsection, in this case Snorts packet
sniffing/capture capabilities in Packet Logger Mode [17].
TCPDUMP [18]
o A powerful command-line tool that allows you to sniff/capture
network packets, much like Snort it is libpcap-based.
Scapy [19]
o An interactive packet manipulation program. The features of
interest in this project being the ability to decode packets of a
wide number of protocols and capture them.
14. 9
3.2.2 Data Pre-Processor
Data in the real world needs to be pre-processed as it can be incomplete (lack
attribute values and attributes of interest), inconsistent (contain discrepancies)
and contain errors. Thus the data needs to be passed through a Data Pre-
Processor to ‘clean up’ the data, otherwise poor quality data would lead to poor
quality results in the later stages of the project.
The open source machine learning software tool Weka will be used to pre-
process the data for use in the IDS Simulation.
3.2.3 Knowledgebase/Dataset
The knowledgebase/dataset being used in this simulation for training is the KDD
Cup 99 dataset, for the IDS simulation, the same amount of instances and
attributes will be used as discussed in section 3.1 Experimental Design.
3.2.4 Machine Learning Algorithm
The machine learning algorithm used in the IDS simulation will be selected from
the classification algorithms discussed from section 3.1.1 i.e. Naive Bayes, J48
and Random Forest. Classification algorithm approach was chosen over a
Clustering algorithm approach based on the decision that the KDD cup 99 dataset
has predefined labels (classes), which suits supervised learning. A Clustering
algorithm can also be implemented with the KDD Cup 99 dataset and give the
same quality results as Classification but for this IDS simulation Classification
preference was given.
3.2.5 Trained Model
Once training on the data is complete, the trained model will be deployed on the
network and used for comparison on captured incoming packets using one of the
Packet Sniffer techniques discussed in section 3.2.1, this model allows for
detection to be carried out in real-time/near real-time as the capture packets are
compared to the training model. In the case of a unsupervised trained model,
new packet values not seen before in the trained model will retrain the model so
that in future packet of similar instances can be detected.
3.2.6 Network
The domain of the IDS simulation will be Network based intrusion detection
rather than Host based intrusion detection. The network that will be used for the
simulation is a small scale network (Lab Machine connected to UCC CS network).
Even though testing of the system will be carried out in a small scale
environment, due to the technologies being open source and tested by their
development communities, the system will be scalable to larger Networks.
15. 10
4 Implementation
This section of the report will discuss in detail the process of implementation of
the work set out in the project brief. Prior to commencing work on this project,
the work environment and tools needed to be set up and installed, first the java
environment was downloaded and installed (Java used with Weka and
development of IDS simulation), next Python 3 was installed and the Scapy
library was downloaded for implementation of the Packet Sniffer discussed in
section 3.2.1. Weka was downloaded and installed for use in both the
experiment and development sections of this project.
The dataset was prepared by first downloading the KDD Cup 99 from the KDD
website [6], the dataset comes in a plaintext format, this was converted to .arff
format by adding the dataset attributes and features which can found on the
KDD website.
The work environment and dataset were now ready for work to be carried out
on implementing the research experiment and development stages of the
project.
4.1 Weka
This project was implemented using the Weka open source data mining software
tool, it is a collection of machine learning algorithms implemented using Java.
Weka contains tools for data pre-processing, classification, clustering and
visualization, making it the perfect tool for the task set out by this project [20].
Weka can be implemented in multiple ways via:
Command Line
Imported weka.jar library
Weka GUI
In the Weka algorithm collection, Weka contains versions of each of the
algorithms discussed in the section 3.1, algorithms in Weka being: NaiveBayes,
J48, RandomForest, EM and SimpleKMeans. Weka also contains the pre-
processing tools necessary for the analytical approaches discussed in sections
3.1.2.1 and 3.1.3.1.
The research side of the project was implemented using the Weka GUI for its
ease of use and for the tools ability to display and store created model. The
development side of the project on the other hand was implemented using the
imported weka.jar library which was used in the simulator java code.
4.2 Experiment
In the Weka GUI, the KDD Cup 99 dataset was loaded into the pre-processor
section of the GUI and prepared for use in the classification algorithm
implementations followed by the clustering algorithm implementations.
16. 11
4.2.1 Classification Implementation
As discussed in section 3, the results of the Naïve Bayes algorithm will be used as
a baseline for the rest of the algorithms seen and for a comparison of the rate of
improvement, if any, from implementing the algorithms again with discretization
applied to them.
In the classification tab of the Weka GUI, the NaïveBayes algorithm is selected in
the ‘Classifier’ section under the folder ‘Bayes’ which contains all Bayes style
algorithms. First, a simple test was carried out to see if the KDD Cup 99 dataset
could be classified correctly without errors by checking if a training model could
be created without testing the model. This was done by selecting ‘Use training
set’ in the test options field, the model was created successfully, thus the
experiment moved onto building and testing a Naïve Bayes model using 10 fold
Cross-validation. Cross-validation is another testing option in the test option
field, it splits the dataset into a 90/10 split (90% for training the model and 10%
for testing the model), and it does this for 10 iteration, each time selecting a
different 90/10 split [21]. Naïve Bayes with 10 fold cross-validations results in a
95.411% classification accuracy, this is now the benchmark.
Figure 4: Naive Bayes Model
Figure 5: Naive Bayes Confusion Matrix
17. 12
Following from Naïve Bayes, the J48 was run under the same conditions so as to
keep the results fair, 10 fold cross-validation was performed on the dataset and
returned a result of 99.9407% classification accuracy.
Figure 6: J48 Model
Figure 7: J48 Confusion Matrix
Like J48 and Naïve Bayes, the Random Forest classification algorithm was run
under the same conditions, 10 fold cross-validation was carried out on the
dataset and the result saved for comparison, the resulting accuracy of the
Random Forest algorithm was 99.9719%, which shows it is the best accuracy of
the three classifier algorithms being investigated in this experiment. Goal being
to see if either of the other algorithms can have competitive result when they
are implemented using Discretization.
18. 13
Figure 8: Random Forest Model
Figure 9: Random Forest Confusion Matrix
Results of implementing the classification algorithms shown below in figure:
Figure 10: Classification Algorithms Accuracy
95.411 99.9407 99.9719
0
20
40
60
80
100
120
% Accuracy
Classification Accuracy without
Discretization
Naïve Bayes J48 Random Forrest
19. 14
Figure 11: Classification Algorithms False Positive Rate
Figure 12: Algorithms Classification Time
After all three of the classification algorithms that had been chosen for this
experiment were implemented, the next stage was implementing them again,
instead this time the algorithms were implemented on the KDD Cup 99 dataset
with Discretization applied to it. This was done by going back to the pre-process
tab in the Weka GUI and selecting the ‘filter’ button, the discretization filter was
selected under the supervised filters folder and then applied.
Naïve Bayes algorithm was run again with Discretization applied, this filter is the
only change made to the run of the classifier, and all other conditions remain the
same to ensure any improvement in results is due to discretization being applied.
This time however the results seen were significantly improved with a 99.3966%
classification accuracy compared to that of 95.411% accuracy when
174
81
63
0
50
100
150
200
Number of False Positives
False Positives without
Discretization
Naïve Bayes J48 Random Forrest
3.2
47.93
380.87
0
100
200
300
400
Time in seconds
Classification Time without
Discretization
Naïve Bayes J48 Random Forrest
20. 15
Discretization was not applied to the Naïve Bayes algorithm, and a reduction in
the number of false-positives of 141.
Figure 13: Naive Bayes with Discretization Model
Figure 14: Naive Bayes with Discretization Confusion Matrix
As in the same case as Naïve Bayes, the J48 algorithm was also implemented
with Discretization, unlike the Naïve Bayes algorithm, the J48 algorithm showed
a reduction in performance with a classification accuracy of 99.932% compared
to the original J48 algorithm without Discretization having a classifier accuracy of
99.9407%.
Figure 15: J48 Model with Discretization
21. 16
Figure 16: J48 Confusion Matrix with Discretization
Given that J48 decision tree algorithm returned a reduction of classification
accuracy when Discretization was applied, it was of interest to see if the results
of the Random Forest algorithm would also result in a reduction of accuracy like
the J48 algorithm as both algorithms belong to the decision tree algorithm
family. The results of the Random Forest algorithm however did echo the same
result as the J48 algorithm with the accuracy of the classifier reducing slightly to
99.9752% but had a massive reduction in classification time by almost half with
the classification time of Radom Forest without Discretization being 380.87
seconds and the classification time with Discretization being 156.11 seconds.
Figure 17: Random Forest Model with Discretization
Figure 18: Random Forest Confusion Matrix with Discretization
22. 17
Figure 19: Classification Algorithms Accuracy using Discretization
Figure 20: Classification Algorithms False Positives Rate using Discretization
95.411
99.9407 99.9719
0
20
40
60
80
100
120
% Accuracy
Classification Accuracy with Discretization
Naïve Bayes J48 Random Forrest
33
121
90
0
20
40
60
80
100
120
140
Number of False Positives
False Positives with Discretization
Naïve Bayes J48 Random Forrest
23. 18
In the next section, clustering algorithms will be explored, the results of these
algorithms will be compared with the results of the classification algorithms
discussed in this section. The overall results of the experiment will be discussed
in section 5.1 Results.
4.2.2 Clustering Implementation
The results of the previous classification algorithms will be used in comparison
with the results from the K-Means and EM algorithms explored in this section,
both implemented with and without Normalization and Standardization applied
to the algorithms individually. Both the K-Means and EM algorithms will be run
under the same condition and environment as the classification algorithms
explored in section 4.2.1, except unlike the classification algorithms, direct cross-
validation cannot be done in the clustering tab of Weka, to implement cross-
validation in Weka for a clustering algorithm select the classification tab, under
the meta classier folder select ‘ClassificationViaClustering’, this gives the option
to select a clustering algorithm to implement, after this classification will be
performed on the cluster so an evaluation can be made.
To begin the KDD Cup 99 was reloaded into the Weka GUI to remove the
Discretization filter applied to the dataset in the previous section. After the KDD
Cup 99 dataset was loaded into Weka, the Cluster tab was selected, the first
clustering algorithm selected was K-Means. The K-Means algorithm was run
using ClassificationViaClustering’ with 12 clusters, the results being 77.4976%
correctly classified instances.
1.93 5.68
148.93
0
20
40
60
80
100
120
140
160
Time in seconds
Classification Time without Discretization
Naïve Bayes J48 Random Forrest
Figure 21: Algorithms Classification Time using Discretization
24. 19
Figure 22: K-Means Model
Figure 23: K-Mean Confusion Matrix
Following on from implementing the K-Means algorithm, the EM was also
implemented using ClassificationViaClustering’ with 12 clusters, to ensure test
fairness. Unfortunately, limitation of memory and CPU, resulted in this run
crashing the Weka, this was attempted several more times with the same results.
By lowering the amount of clusters used to 2 a result was achievable. As the
amount of clusters used in two algorithms used do not match, the results of the
EM will not be used in comparison with the other algorithms so as to ensure test
result fairness. However the EM algorithm can still implement with the analytical
filters to investigate there effect on the algorithm itself. The result of the EM run
was 83.7499% accuracy.
25. 20
Figure 24: EM Model
Figure 25: EM Confusion Matrix
After the two clustering algorithms (without Standardization or Discretization)
were implemented it was now time to implement them again but this time the
algorithms were filtered using Standardization. With Standardization applied to
the KDD Cup 99 dataset, the K-Means algorithm was implemented and led to an
accuracy result of 77.496%, giving it the same result of the original result. Next
the EM algorithm was implemented on the standardized KDD Cup 99 dataset,
the result being 83.7509%, which was marginally better.
Figure 26: K-Means Model with Standardization
26. 21
Figure 27: EM Model with Standardization
Next the KDD Cup 99 dataset implemented with Discretization applied, the same
filter applied to the Classification algorithms but in this case is unsupervised
Discretization. The same process of implementation that was carried out for
Standardization was performed for Discretization. The K-Means algorithm had a
new result of 81.4082% accuracy compared to the original result of 77.4976%,
which is an increased accuracy improvement, while the EM algorithm returned a
result of 73.7629% which was a reduction of over 9.987%.
Figure 28: K-Means Model with Discretization
Figure 29: K-means Confusion Matrix with Discretization
27. 22
Figure 30: EM Model with Discretization
As mentioned at the end of the previous section 4.2.1 the overall results of the
experiment will be discussed in section 5.1 Results.
4.3 Development
Continuing on from the proposed software design discussed in section 3.2,
development of the system began by creating a packet sniffer, the packet sniffer
was implemented using the Scapy approached, this choice was made because if
the proposed system were to be implemented in an open-source environment it
would be easier to incorporate Scapy rather than the likes of Snort which would
need a separate install.
The next stage of the development was implementing the pre-processing and a
machine learning algorithm for the simulator, this was done using the Weka
framework by importing the weka.jar library into a java class file, this class file is
where the IDS training model will be created and evaluated. In the java project
which was previously created when setting up the system environment at the
start of section 4, a java class file was created and called ‘IDS’, to start necessary
java libraries were imported, these being for the buffer reader, file reader and
file not found exceptions. Next the Weka libraries needed were imported for
classification and evaluation as the method of prediction being used in this
simulator was classification, as the results seen in section 4.2 showed
classification algorithms obtained greater prediction accuracy. Since the Random
Forest algorithm showed the best classification results without filtering, it was
chosen as the machine learning algorithm used in this implementation, to do
this, Weka libraries for rules and trees needed to implement the Weka algorithm
were imported. The final Weka libraries imported were those needed to handle
instances and make predictions.
In the IDS class file, methods for reading in the dataset, classifying (using cross-
validation), a check for intrusion and main method which is used to run and
output the results of the simulator. Once this class was tested and successful, a
GUI was created using the NetBeans GUI builder to make the simulator more
user friendly and easier to use. The GUI can be seen in figure 31.
28. 23
Figure 31: IDS Simulation GUI
When the simulator is run by clicking the ‘Monitor’ button it classifies the
dataset using 10 fold cross-validation and checks to see if any classified instances
differed from the predicted classified result, in this simulation this is seen as an
intrusion. For the actual system, this intrusion would be the result of the
incoming connection not matching the prediction, future work to achieve this
functionality is discussed in section 5.3.
29. 24
5 Evaluation
5.1 Results
In this section the results of the experiment in section 4.2 will be discussed and
explored, the algorithm that yielded the greatest result prior to being exposed to
an analytical filter was classification algorithm Random Forest, the lowest result
came from the clustering algorithm K-Means. The algorithm that had the best
improvement from an applied filter was Naïve Bayes with the discretization filter
of 3.9856%, narrowly beating the improvement of K-Means with discretization of
3.5844%, speaking of discretization, even though it showed a decrease in
performance for the decision tree algorithms and the clustering algorithm EM, it
showed promising results with Naïve Bayes and K-Means algorithms. Applying an
analytical filter to an algorithm resulted in the reduction in model classification
time for both Classification and Clustering algorithms, in the case of
Discretization, this reduction in classification time can be seen as a possible
cause for the reduced performance seen in the J48, Random Forest and EM
algorithms, which have a greater complexity than Naïve Bayes and K-Means.
Figure 32: Table showing the classification accuracy of Naive Bayes and K-Means algorithms with and
without Discretization applied. Classification accuracy shown in percentages
In the results there was a massive reduction in false positives (false alarms) when
Discretization was applied to Naïve Bayes, this is a direct correlation to the
increase in classification accuracy. What is interesting in the results is that when
Naïve Bayes is implemented using Discretization the number of false positives
recorded was lower than the number of false positives recorded by Random
Forest (Without Discretization) even though Random Forest still has a higher
0
20
40
60
80
100
No Filter Discretization
95.411 99.3966
77.4976
81.4082
Classification Accuracy Comparison in %
Naïve Bayes K-Means
30. 25
classification accuracy, thus discretised Naïve Bayes algorithm must have more
false negatives than Random Forest, this is also of concern in intrusion detection
but in this experiment the concern is on classification accuracy and the number
of false positives. A big surprise in the result was that although the classification
accuracy of the K-Means algorithm improved with Discretization, the number of
false positives rose by a massive 986, thus there was an equal reduction in false
negative in the algorithm. This result could be due to the amount of clusters
being used in this experiment or the regular result of running K-Means with
Discretization, future research can be explored into this problem with an
environment that has the memory and CPU power to handle more clusters.
Figure 33: Number of False Positives, Naive Bayes/ K-Means Comparison
The overall top 3 rankings classification/clustering algorithms both with and
without a filter based on classification accuracy are:
1. Random Forest – No Filter
2. J48 – No Filter
3. Naïve Bayes – Discretization
This does not take into account number of false positives or classification time,
ranking of algorithms which resulted in the lowest number of false-positives
shown below:
1. Naïve Bayes – Discretization
2. Random Forest – Discretization
3. J48 - Discretization
The development portion of the project resulted in a functional IDS Simulation
being developed, the code which can easily be augmented for other classification
0
500
1000
1500
2000
No Filter Discretization
174 33816
1802
Number of False Positives
Naïve Bayes K-Means
31. 26
algorithms resulted in a system that could take in a dataset, pre-process the
data, classify the data instances using Random Forest with 10 fold cross-
validation, checks the resulting classifier if an intrusion was detected and if so
output to the GUI a detection message. Real-time/near real-time intrusion in this
project was unable to be implemented but is discussed in section 5.3.
5.2 Conclusion
From the results shown in the previous section 5.1, a conclusion can be made on
the effects of using analytics to improve intrusion detection, this being that there
is no clear supreme machine learning algorithm for Intrusion Detection Systems,
for example the Classification algorithm Random Forest had the best classifier
accuracy (Resulting in the lowest number of false positives + false negatives) but
had a high classification time, this algorithm approach would be highly effective
on a network with high CPU and Memory resources, however this would not be
scalable for a network with lower CPU and Memory resources as a decision tree
algorithm will cause overhead (Large classification time) on the network. In the
results it can be seen that this overhead can be reduced using Discretization on
decision tree algorithms for a small loss in accuracy performance, this reduction
in performance still has the decision tree algorithms ahead of any other
algorithm explored in this project in accuracy performance. If accuracy
performance, low classification time and low number of false positives is what
the network needs than the Naïve Bayes algorithm with Discretization is the best
choice, with Naïve Bayes returning the lowest number of false-positives. In
conclusion, the best machine learning algorithm for IDS is the algorithm that best
fits the requirements and restrictions of the network the system will be
implemented on and whether the approach is either Misuse or Anomaly.
Furthermore, in the majority case (minority case being clustering algorithms),
implementing an algorithm with an analytical filter like Discretization or
Standardization resulted in one or more benefits to a machine learning
algorithm, whether this be in increase in accuracy performance, reduction in
classification time and reduced number of false positives (All these benefits in
the case of Naïve Bayes). Thus research into analytical filters for use on a dataset
is an invaluable use of time to ensure the IDS used gets the best results possible.
Finally, these results show that classification algorithms return the greater
performance compared to the clustering algorithms experimented on in this
project, this however does not mean classification algorithms are better than
clustering algorithm, instead it shows that the KDD Cup 99 dataset works well
using Classification compared to Clustering, also clustering algorithms work off
data with no classes, this type (unsupervised) of machine learning is much more
difficult and complex compared to that of Classification.
In the end this project showed that there are benefits and downfalls to most
analytical approaches for machine learning algorithms, but the benefits outweigh
the downfalls thus selecting the right machine learning algorithm
32. 27
implementation with a suitable analytical filter can indeed enhance intrusion
detection.
5.3 Future Work
This section of the project will discuss work that was not able to be
developed/implemented and changes that would be made now looking back on
the results of the project.
The main functionality that was not able to be implemented was real-time/near
real-time intrusion detection, the IDS Simulator did not have the functionality
needed to classify incoming network packets, and this is done by deploying a
created machine learning model which is then used to predict the type of the
incoming packet. The packet sniffer implemented (Scapy) outputs the incoming
packets IP address rather than giving the packets connection info, if real-
time/near real-time intrusion detection is to be performed on KDD Cup 99
dataset, the incoming packets must be converted to connection level data. This
continues on to the downfalls of the KDD Cup 99 dataset, it is seen to be a poor
dataset for use on real world data, as it is over 15 years old and as such is
outdated, if this project was undertaken again, a dataset collected over a few
months on the environment you intend to implement it on with rare attacks like
R2L and U2R being favoured rather than DOS attacks which is the case in KDD
Cup 99 dataset. The dataset would also be skewed with more normal packets
rather than the heavily attack skewed KDD Cup 99 dataset.
The IDS Simulator would be a better model of how Intrusion Detection Systems
are actually implemented if it was implemented using a clustering algorithm as
attacks these day are unknown so an anomaly approach would be more
beneficial.
Finally, the environment used in this project did not have the CPU processing
power and RAM needed to deal with complex algorithms classifying large
datasets (as seen with the EM clustering algorithm), which meant some results
couldn’t be recorded. If experiment was implemented on an environment with
high CPU processing power and RAM more accurate results could have been
generated on a possibly bigger dataset.
33. References
[1] SANS™ Institute, “What is a false positive and why are false positives a problem?,”
[Online]. Available: https://www.sans.org/security-resources/idfaq/what-is-a-false-
positive-and-why-are-false-positives-a-problem/2/8.
[2] Predictive Analytics Today, “What is Predictive Analytics,” [Online]. Available:
http://www.predictiveanalyticstoday.com/what-is-predictive-analytics/.
[3] A. Smola and S. Vishwanathan, “Introduction to Machine Learning”.
[4] M. J. Zaki and W. Meira Jr., “Data mining and Analysis: Fundamental Concepts and
Algorithms,” 2014.
[5] A. A. Rao, P. Srinivas, B. Chakravarthy, K. Marx and P. Kiran , “A Java Based Network
Intrusion Detection System (IDS)”.
[6] MIT Lincoln Labs, “KDD Cup 1999 Data,” Information and Computer Science
University of California, Irvine, 1999. [Online]. Available:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[7] M. K. Siddiqui and S. Naahid, “Analysis of KDD CUP 99 Dataset using Clustering
based Data,” International Journal of Database Theory and Application, vol. Vol.6,
no. No.5, pp. 24-25, 2013.
[8] M. Tavallaee, E. Bagheri, W. Lu and A. A. Ghorbani, “A Detailed Analysis of the KDD
CUP 99 Data Set,” 2009.
[9] H. Liu, F. Hussain, C. L. Tan and M. Dash, “Discretization: An Enabling Technique,”
Data Mining and Knowledge Discovery, 2000.
[10] M. Walker, “Random Forests Algorithm,” 2013. [Online]. Available:
http://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm.
[11] I. B. Mohamad and D. Usman, “Standardization and Its Effects on K-Means
Clustering Algorithm,” p. 3300, 2013.
[12] N. Sharma, A. Bajpai and R. Litoriya, “Comparison the various clustering algorithms
of weka tools,” International Journal of Emerging Technology and Advanced
Engineering, vol. 2, no. 5, pp. 76-79, 2012.
[13] C. Piech and A. Ng, “K Means,” [Online]. Available:
http://stanford.edu/~cpiech/cs221/handouts/kmeans.html.
[14] M. B. and M. B. , “An overview to Software Architecture in Intrusion Detection
System,” International Journal of Soft Computing And Software Engineering (JSCSE),
p. 4, 2011.
34. [15] D. Magers, “Packet Sniffing: An Integral Part of Network Defense,” 2002.
[16] M. Roesch. [Online]. Available: https://www.snort.org/.
[17] Penn State Berks, “Introduction to Snort,” [Online]. Available:
http://istinfo.bk.psu.edu/labs/Snort.pdf.
[18] . M. Richardson and B. Fenner. [Online]. Available: http://www.tcpdump.org/.
[19] P. Biondi. [Online]. Available: http://www.secdev.org/projects/scapy/.
[20] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, The
WEKA Data Mining Software: An Update, vol. 11, 2009.
[21] P. Refaeilzadeh, L. Tang and H. Liu, Cross-Validation, Arizona State University, 2008.
[22] S. K. Patro and K. K. Sahu, “Normalization: A Preprocessing Stage”.
Figure 1: Bayes Theorem 5
Figure 2: IDS Software Architecture Overview [12] 7
Figure 3: Software Architecture Overview of IDS Simulator 8
Figure 4: Naive Bayes Model 11
Figure 5: Naive Bayes Confusion Matrix 11
Figure 6: J48 Model 12
Figure 7: J48 Confusion Matrix 12
Figure 8: Random Forest Model 13
Figure 9: Random Forest Confusion Matrix 13
Figure 10: Classification Algorithms Accuracy 13
Figure 11: Classification Algorithms False Positive Rate 14
Figure 12: Algorithms Classification Time 14
Figure 13: Naive Bayes with Discretization Model 15
Figure 14: Naive Bayes with Discretization Confusion Matrix 15
Figure 15: J48 Model with Discretization 15
Figure 16: J48 Confusion Matrix with Discretization 16
Figure 17: Random Forest Model with Discretization 16
Figure 18: Random Forest Confusion Matrix with Discretization 16
Figure 19: Classification Algorithms Accuracy using Discretization 17
Figure 20: Classification Algorithms False Positives Rate using Discretization 17
Figure 21: Algorithms Classification Time using Discretization 18
Figure 22: K-Means Model 19
Figure 23: K-Mean Confusion Matrix 19
Figure 24: EM Model 20
Figure 25: EM Confusion Matrix 20
Figure 26: K-Means Model with Standardization 20
35. Figure 27: EM Model with Standardization 21
Figure 28: K-Means Model with Discretization 21
Figure 29: K-means Confusion Matrix with Discretization 21
Figure 30: EM Model with Discretization 22
Figure 31: IDS Simulation GUI 23
Figure 32: Table showing the classification accuracy of Naive Bayes and K-Means
algorithms with and without Discretization applied. Classification accuracy
shown in percentages 24
Figure 33: Number of False Positives, Naive Bayes/ K-Means Comparison 25