SlideShare a Scribd company logo
1 of 35
Download to read offline
Using Analytics to Enhance
Intrusion Detection Systems
by
Jamie Sullivan
Final-Year Project - BSc in Computer Science
Supervisor: Prof. Gregory Provan
Second Reader: Dr. Derek Bridge
Department of Computer Science
University College Cork
April 2016
Abstract
By Jamie Sullivan
Network Intrusion Detection Systems suffer from high false-alarm rates, this
project uses predictive analytics approach to study how to reduce false alarms.
Predictive analytics will combine system models together with data to predict
the most likely actions of intruders, in order to distinguish real intrusion
signatures (triggering of alarms) from random signatures. The project will
combine learning from data with simulation models for signature analysis.
The project will provide both an experimental approach to study how predictive
analytics combined with machine learning algorithm/techniques and a Software
simulation on how the system is intended to operate using open source
software.
Predictive analysis carried out on the KDD Cup ’99 dataset implemented using
both supervised learning (Classification) and unsupervised learning (Clustering)
algorithms to create simulation models, approach to reduce false alarms (false
positives) by increasing the predictive accuracy of the these algorithms.
DeclarationofOriginality
In signing this declaration, you are confirming, in writing, that the submitted
work is entirely your own original work, except where clearly attributed
otherwise, and that it has not been submitted partly or wholly for any other
educational award.
I hereby declare that:
- this is all my own work, unless clearly indicated otherwise, with full and
proper accreditation;
- with respect to my own work: none of it has been submitted at any
educational institution contributing in any way towards an educational
award;
- with respect to another’s work: all text, diagrams, code, or ideas, whether
verbatim, paraphrased or otherwise modified or adapted, have been duly
attributed to the source in a scholarly manner, whether from books,
papers, lecture notes or any other student’s work, whether published or
unpublished, electronically or in print.
Name: Jamie Sullivan
Signed: _______________________________
Acknowledgements
I am using this opportunity to express my gratitude to everyone who supported
me throughout the course of this project. I am thankful for their aspiring
guidance, invaluably constructive criticism and friendly advice during the project
work. I am sincerely grateful to them for sharing their truthful and illuminating
views on a number of ideas related to the project.
I would like to express my sincere gratitude to my supervisor Prof. Gregory
Provan for the continuous support during the course of this project, for his
patience, motivation, enthusiasm, and immense knowledge. His guidance helped
me in all aspects of the project. I could not have imagined having a better advisor
and mentor for my project.
Thank you to everyone in lab 1.09, for the stimulating discussions, fantastic
atmosphere and work environment over the past several months.
Last but not the least, I would like to thank my friends and family for all the
support, guidance and advice throughout the entire process of this project, and
especially over my four years here in University College Cork.
Table of Contents
1 Introduction .....................................................................................................1
2 Analysis.............................................................................................................3
3 Design...............................................................................................................4
3.1 Experimental Design .................................................................................4
3.1.1 Knowledgebase/Dataset....................................................................4
3.1.2 Classification Algorithms Approach...................................................5
3.1.2.1 Analytical approach – Discretization..............................................5
3.1.2.2 Naïve Bayes ....................................................................................5
3.1.2.3 J48 ..................................................................................................6
3.1.2.4 Random Forest...............................................................................6
3.1.3 Clustering Algorithms Approach........................................................6
3.1.3.1 Analytical Improvement – Standardize & Discretization...............6
3.1.3.2 K-Means .........................................................................................7
3.1.3.3 EM ..................................................................................................7
3.2 Software Design ........................................................................................7
3.2.1 Packet Sniffer.....................................................................................8
3.2.2 Data Pre-Processor ............................................................................9
3.2.3 Knowledgebase/Dataset....................................................................9
3.2.4 Machine Learning Algorithm .............................................................9
3.2.5 Trained Model....................................................................................9
3.2.6 Network .............................................................................................9
4 Implementation .............................................................................................10
4.1 Weka .......................................................................................................10
4.2 Experiment..............................................................................................10
4.2.1 Classification Implementation.........................................................11
4.2.2 Clustering Implementation..............................................................18
4.3 Development...........................................................................................22
5 Evaluation.......................................................................................................24
5.1 Results.....................................................................................................24
5.2 Conclusion...............................................................................................26
5.3 Future Work ............................................................................................27
References
1
1 Introduction
With a growing level of intrusions on the internet and local networks a vast
amount of work is being invested in intrusion detection. Intrusion detection
systems coupled with intrusion prevention systems work to stop these attacks.
Unfortunately Intrusion Detection Systems (IDS) suffer from high false-alarm
rates known as false-positives, these false-positives result in genuine packets
being flagged as attacks which in a system is not ideal as it drowns out legitimate
attack detections with these false attack detections [1]. Another problem in IDS’s
is false-negatives which fail to identify a packet as an attack, this project will
focus on reducing false-positives.
This project will attempt to use predictive analytics to study how to reduce false
alarms. Predictive analytics is an analytical method which is used to make
predictions on unknown future events, using techniques like machine learning,
data mining, modelling and statistics to analyse stored data to make accurate
predictions about future data [2]. Machine learning algorithms and data mining
techniques combined with statistical methods implemented on the training
dataset to create predictive models which will improve the accuracy of the
models and thus reducing false-positives. Both misuse and anomaly based
intrusion techniques will be implemented to investigate the effects of the false-
positive reduction method researched.
The predictive analytics approach researched in this project will be implemented
on the free to use KDD Cup 99 dataset, thus research will be done on the dataset
to see if it is an effective dataset for intrusion detection, free to use data is
always of interest when performing a research project or developing a system as
it is cost effective and the resulting model/system can be implement by others
using the same free to us data.
The project was split into two sections, first section being an experiment to
research the capabilities and effects of predictive analytics for intrusion
detection, the second section being development of a system for intrusion
detection. Both the experiment and development sections were implemented
using the open source Weka data mining tool, which will be discussed in later
sections. The experiment was carried out first followed by the development of
the system using the model that returned the best results. The goal of the
developed system being to develop and easy to use and interpret IDS, it will not
have intrusion prevention functionality, the system should be able to handle
real-time/near real-time intrusion detection and be scalable for all network
types.
In the end the experiment showed that predictive analytics can be implemented
to improve detection accuracy and reduce false-positives. This was done by using
predictive machine learning/data mining algorithms combined with statistical
2
methods to filter the data in the KDD Cup 99 dataset. Although the system
developed did not have real-time/near real-time intrusion detection functionality
a proposed method of implemented was discussed for future work, the system
did provide a simulation on how intrusion detection is performed using the
predictive analytic methods researched in the experiment. The experiment
revealed the downfalls of the KDD Cup 99 dataset which have been investigated
in other papers but showed the dataset still has merit as an effective dataset for
use in research, especially in cases where the data is skewed.
3
2 Analysis
Based on the project abstract and introduction the objectives of this project is to
complete adequate research in the field of Intrusion Detection Systems, use
predictive analytics to increase the classification accuracy and reduce false-
positives and implement the method researched in a IDS. The environment this
research experiment and development will be carried out on will be the UCC
Computer Science department lab machines which have 8GB RAM and an Intel
Duo CPU 3.00 GHz processor.
Before research and development can begin for this project, research must be
done on the background ideologies:
 Machine Learning
o Methods of pattern detection/recognition which implement
computational learning based on the data patterns [3].
 Data Mining
o The process of gaining insightful knowledge from data in a
knowledgebase/dataset [4].
 Predictive Analytics
o Analytical method which is used to make predictions on unknown
future events [2].
 False-Positives
o Packet data that is flagged as an attack intrusion but is actually a
legitimate normal packet.
 False-Negatives
o Packet data that is flagged as a normal packet but was actually a
malicious attack packet.
The research experiment will implement algorithms such as Naïve Bayes, J48,
Random Forest, EM and K-Means, these algorithms were chosen as together
they have a large range of functionality and different machine learning
approaches. The individual algorithms will be discussed more in sections 3.1.2
and 3.1.3. Data mining and statistical methods will be applied to the dataset to
optimise the data retrieved from the dataset, methods such as Discretization and
Standardization will be implemented. The IDS being developed will use a Java
architecture and requirement similar to those discussed in [5], but will also use
open-source data mining/machine learning discussed later sections.
The results of this project will hope to discover and implement an effective
means of improved network intrusion detection that when implemented will
reduce false-positives and classification accuracy while also not creating massive
overhead that would cause network congestion, in turn making the proposed
improvement method scalable for both small and large networks.
4
3 Design
3.1 Experimental Design
Largest portion of this project was researching how to improve Intrusion
Detection using Analytics, research being carried out on two machine learning
techniques, Classification algorithms and Clustering algorithms discussed below.
3.1.1 Knowledgebase/Dataset
The Knowledgebase/Dataset that will be used in this Experiment is the KDD Cup
99 dataset, a dataset originally used for “The Third International Knowledge
Discovery and Data Mining Tools Competition”, the competitions task being to
build a network intrusion detector, a predictive model capable of distinguishing
between intrusions (attacks) and normal connections [6]. With this task being
very similar to the task set out in this project, the KDD Cup 99 is a good fit for the
experiment.
For the experiment a pre-made 10% section of the entire KDD Cup 99 dataset
will be used, due to the limitations of the system memory being used in the
experiment a sub section of the 10% of the dataset, this being a 8% version of
the KDD Cup 99 dataset with a total of 298,497 instances, 42 attributes and 23
distinct class values. The 10% training dataset has 19.69% normal and 80.31%
attack connections [7], the breakdown of attacks is show below:
 Denial of Service Attack (DOS)
o Attacks that try to block legitimate connection request by making
computing or memory resources too busy for requests.
o ‘back’, ‘land’, ‘neptune’, ‘pod’, ‘smurf’ and ‘teardrop’
 Users to Root Attack (U2R)
o Attacker has access to a normal user account, uses these attacks
to try gain access to root level access.
o ‘buffer_overflow’, ‘loadmodule’, ’perl’, and ‘rootkit’
 Remote to Local Attack (R2L)
o Attacker does not have access to an account on a machine,
attacker tries to expose a vulnerability on the network using these
attacks to gain access to the machine.
o ‘ftp_write’, ‘guess_passwd’, ‘imap’, ‘multihop’, ‘phf’, ‘spy’,
‘warezclient’ and ‘warezmaster’
 Probing Attack (PROBE)
o Attacks try to gain information about the network in order to use
this information to get past its security.
o ‘ipsweep’, ‘nmap’, ‘portsweep’ and ‘satan’
As stated above the ratio of normal and attack connections in the dataset is
19.69% to 80.31% respectively, thus this is a heavily skewed dataset to attack
connections, the dataset seems to favour more records of DOS attack instances
such as ‘neptune’ and ‘smurf’ rather than more harmful attacks such as U2L and
5
R2l (which in practical uses are more favourable in a training dataset), which can
lead the results to be biased to frequent record detection methods, which is not
ideal for practical use but for this experiment the focus is more on the analytical
methods which can be implemented to improve the baseline result (result
received without the analytical method implemented). The downfalls of the KDD
Cup 99 dataset are discussed in [8]. Despite these downfalls, the KDD Cup 99
dataset is still an effect dataset for the purpose of this project.
Finally the KDD Cup 99 dataset contains the following connection protocols [7]:
 TCP: Reliable connection-oriented protocol
 UDP: Unreliable and connection-less protocol
 ICMP: Communication over networked computers
3.1.2 Classification Algorithms Approach
Classification is a supervised machine learning technique that assigns instances in
a collection to target class, the end goal of a classifier is to accurately predict the
target class for each case in the data. In the following sections the classification
algorithms that will be used and the analytical approach to reduce false positives
will be discussed.
3.1.2.1 Analytical approach – Discretization
Discretization is the process of converting continuous attribute values to nominal
attribute intervals thus a smaller attribute selection. Discretization has been
shown to improve prediction accuracy in previous works [9] due to intervals
being a more concise representation of the data thus is easier to use and
comprehend rather than continuous values. Discretization is the planned
analytical approach to attempt to make an improvement on the results of the
algorithms discussed below.
3.1.2.2 Naïve Bayes
The Naïve Bayes algorithm is based on conditional probabilities and uses Bayes
Theorem, it finds the probability of an event occurring given the probability of
another event that has already occurred. In the case of this experiment it finds
the probability of a connection being either a normal or attack connection based
on the previously calculated probability of the connections in the
knowledgebase/ dataset.
Figure 1: Bayes Theorem.
6
In this experiment Naïve Bayes will be the main algorithm being focused on
because it works surprisingly well as a classification algorithm given it works on
‘naïve’ assumptions and yet is still competitive against other more elegant
classifications, and it has a lower performance compared to the two decision tree
algorithms discussed in the following sections. For these two reasons, the results
of training/testing the Naïve Bayes classifier can be used as the baseline for the
experiment, and then by using the analytical approach discussed in section
3.1.2.1 to try and improve the results of the classifier as close to the superior
results of decision tree the algorithms.
3.1.2.3 J48
The J48 algorithm is a decision tree algorithm, it is an open source java
implementation of the C4.5 algorithm, which builds a classification tree based on
the principle of information entropy. In this experiment, research will be carried
out to see how the J48 algorithm compares to the results of the Naïve Bayes
classifier and it’s fellow decision tree algorithm Random Forest, the analytical
method used to try improve the results of the Naïve Bayes algorithm will also be
applied to J48 to see if the results have had the same outcome.
3.1.2.4 Random Forest
Random Forest is another decision tree algorithm that uses ensemble learning
method, builds a classification by combining tree predictors with random vectors
to create the decision trees [10]. Like J48, the results of the Random Forest
classifier will be used in comparison with Naïve Bayes and J48, as in the cases of
the Naïve Bayes and J48, Random Forest algorithm will also implement the
analytical method discussed above and the outcome compared.
3.1.3 Clustering Algorithms Approach
Clustering is an unsupervised machine learning technique that groups data
together into ‘clusters’ based on their likeness, unlike classification, clustering
algorithms can work off unlabelled datasets i.e. does not have a class attribute
value. Research on the performance of clustering algorithms will be carried out
on the K-Means and EM algorithms, comparisons will be carried out on the
results of both these algorithms, with and without the planned analytical
approach for improved performance applied.
3.1.3.1 Analytical Improvement – Standardize & Discretization
Standardization is a technique that transforms the mean of the dataset to 0 and
unit variance of the dataset to be 1 [11]. Like the classification algorithms,
Discretization will also be implemented on the Clustering algorithms so as to the
effects it has on their results, although in the case of the clustering algorithms an
unsupervised discrete filter will be used.
Both these analytical approaches will be implemented on the two algorithms
below to investigate if there is an improvement of their results.
7
3.1.3.2 K-Means
K-Means clustering algorithm aims to partitions n objects/observations from the
dataset into k clusters, where each object/observation belongs to the cluster
with the nearest mean [12]. In simple, K-means assigns k centroids (centre point
of a cluster) that are used to define a cluster, an instance is defined to be in a
particular cluster if it is closer to that cluster's centroid than any other centroid,
much like nearest neighbour [13].
3.1.3.3 EM
The EM (Expectation Maximization) algorithm, an algorithm that evaluates
clusters using two stages, first calculating the expectation of the log-likelihood,
followed by computing the parameters maximizing the log-likelihood (Calculated
in previous step), which is used to calculate the distribution. This algorithm is
seen to be useful in real word datasets, and has been used after implementing K-
Means in other citations [12], so the comparison of results will be of interest.
3.2 Software Design
For this project a simulation of how the proposed Intrusion Detection System
using improvements researched was needed, the following sections will go into
greater detail of the high-level architecture.
Figure 2: IDS Software Architecture Overview [14]
8
Figure 3: Software Architecture Overview of IDS Simulator
3.2.1 Packet Sniffer
Captures the network traffic, filters it for the particular traffic you want, then
stores the data in a buffer. The captured packets are then analysed/ decoded in
Real-Time or Near Real-Time [15]. Below possible packet sniffer tools discussed:
 SNORT [16]
o An open source intrusion prevention system capable of real-time
traffic analysis and packet logging, libpcap-based, rule based. In
this project the full services Snort offers would not be
implemented instead only a subsection, in this case Snorts packet
sniffing/capture capabilities in Packet Logger Mode [17].
 TCPDUMP [18]
o A powerful command-line tool that allows you to sniff/capture
network packets, much like Snort it is libpcap-based.
 Scapy [19]
o An interactive packet manipulation program. The features of
interest in this project being the ability to decode packets of a
wide number of protocols and capture them.
9
3.2.2 Data Pre-Processor
Data in the real world needs to be pre-processed as it can be incomplete (lack
attribute values and attributes of interest), inconsistent (contain discrepancies)
and contain errors. Thus the data needs to be passed through a Data Pre-
Processor to ‘clean up’ the data, otherwise poor quality data would lead to poor
quality results in the later stages of the project.
The open source machine learning software tool Weka will be used to pre-
process the data for use in the IDS Simulation.
3.2.3 Knowledgebase/Dataset
The knowledgebase/dataset being used in this simulation for training is the KDD
Cup 99 dataset, for the IDS simulation, the same amount of instances and
attributes will be used as discussed in section 3.1 Experimental Design.
3.2.4 Machine Learning Algorithm
The machine learning algorithm used in the IDS simulation will be selected from
the classification algorithms discussed from section 3.1.1 i.e. Naive Bayes, J48
and Random Forest. Classification algorithm approach was chosen over a
Clustering algorithm approach based on the decision that the KDD cup 99 dataset
has predefined labels (classes), which suits supervised learning. A Clustering
algorithm can also be implemented with the KDD Cup 99 dataset and give the
same quality results as Classification but for this IDS simulation Classification
preference was given.
3.2.5 Trained Model
Once training on the data is complete, the trained model will be deployed on the
network and used for comparison on captured incoming packets using one of the
Packet Sniffer techniques discussed in section 3.2.1, this model allows for
detection to be carried out in real-time/near real-time as the capture packets are
compared to the training model. In the case of a unsupervised trained model,
new packet values not seen before in the trained model will retrain the model so
that in future packet of similar instances can be detected.
3.2.6 Network
The domain of the IDS simulation will be Network based intrusion detection
rather than Host based intrusion detection. The network that will be used for the
simulation is a small scale network (Lab Machine connected to UCC CS network).
Even though testing of the system will be carried out in a small scale
environment, due to the technologies being open source and tested by their
development communities, the system will be scalable to larger Networks.
10
4 Implementation
This section of the report will discuss in detail the process of implementation of
the work set out in the project brief. Prior to commencing work on this project,
the work environment and tools needed to be set up and installed, first the java
environment was downloaded and installed (Java used with Weka and
development of IDS simulation), next Python 3 was installed and the Scapy
library was downloaded for implementation of the Packet Sniffer discussed in
section 3.2.1. Weka was downloaded and installed for use in both the
experiment and development sections of this project.
The dataset was prepared by first downloading the KDD Cup 99 from the KDD
website [6], the dataset comes in a plaintext format, this was converted to .arff
format by adding the dataset attributes and features which can found on the
KDD website.
The work environment and dataset were now ready for work to be carried out
on implementing the research experiment and development stages of the
project.
4.1 Weka
This project was implemented using the Weka open source data mining software
tool, it is a collection of machine learning algorithms implemented using Java.
Weka contains tools for data pre-processing, classification, clustering and
visualization, making it the perfect tool for the task set out by this project [20].
Weka can be implemented in multiple ways via:
 Command Line
 Imported weka.jar library
 Weka GUI
In the Weka algorithm collection, Weka contains versions of each of the
algorithms discussed in the section 3.1, algorithms in Weka being: NaiveBayes,
J48, RandomForest, EM and SimpleKMeans. Weka also contains the pre-
processing tools necessary for the analytical approaches discussed in sections
3.1.2.1 and 3.1.3.1.
The research side of the project was implemented using the Weka GUI for its
ease of use and for the tools ability to display and store created model. The
development side of the project on the other hand was implemented using the
imported weka.jar library which was used in the simulator java code.
4.2 Experiment
In the Weka GUI, the KDD Cup 99 dataset was loaded into the pre-processor
section of the GUI and prepared for use in the classification algorithm
implementations followed by the clustering algorithm implementations.
11
4.2.1 Classification Implementation
As discussed in section 3, the results of the Naïve Bayes algorithm will be used as
a baseline for the rest of the algorithms seen and for a comparison of the rate of
improvement, if any, from implementing the algorithms again with discretization
applied to them.
In the classification tab of the Weka GUI, the NaïveBayes algorithm is selected in
the ‘Classifier’ section under the folder ‘Bayes’ which contains all Bayes style
algorithms. First, a simple test was carried out to see if the KDD Cup 99 dataset
could be classified correctly without errors by checking if a training model could
be created without testing the model. This was done by selecting ‘Use training
set’ in the test options field, the model was created successfully, thus the
experiment moved onto building and testing a Naïve Bayes model using 10 fold
Cross-validation. Cross-validation is another testing option in the test option
field, it splits the dataset into a 90/10 split (90% for training the model and 10%
for testing the model), and it does this for 10 iteration, each time selecting a
different 90/10 split [21]. Naïve Bayes with 10 fold cross-validations results in a
95.411% classification accuracy, this is now the benchmark.
Figure 4: Naive Bayes Model
Figure 5: Naive Bayes Confusion Matrix
12
Following from Naïve Bayes, the J48 was run under the same conditions so as to
keep the results fair, 10 fold cross-validation was performed on the dataset and
returned a result of 99.9407% classification accuracy.
Figure 6: J48 Model
Figure 7: J48 Confusion Matrix
Like J48 and Naïve Bayes, the Random Forest classification algorithm was run
under the same conditions, 10 fold cross-validation was carried out on the
dataset and the result saved for comparison, the resulting accuracy of the
Random Forest algorithm was 99.9719%, which shows it is the best accuracy of
the three classifier algorithms being investigated in this experiment. Goal being
to see if either of the other algorithms can have competitive result when they
are implemented using Discretization.
13
Figure 8: Random Forest Model
Figure 9: Random Forest Confusion Matrix
Results of implementing the classification algorithms shown below in figure:
Figure 10: Classification Algorithms Accuracy
95.411 99.9407 99.9719
0
20
40
60
80
100
120
% Accuracy
Classification Accuracy without
Discretization
Naïve Bayes J48 Random Forrest
14
Figure 11: Classification Algorithms False Positive Rate
Figure 12: Algorithms Classification Time
After all three of the classification algorithms that had been chosen for this
experiment were implemented, the next stage was implementing them again,
instead this time the algorithms were implemented on the KDD Cup 99 dataset
with Discretization applied to it. This was done by going back to the pre-process
tab in the Weka GUI and selecting the ‘filter’ button, the discretization filter was
selected under the supervised filters folder and then applied.
Naïve Bayes algorithm was run again with Discretization applied, this filter is the
only change made to the run of the classifier, and all other conditions remain the
same to ensure any improvement in results is due to discretization being applied.
This time however the results seen were significantly improved with a 99.3966%
classification accuracy compared to that of 95.411% accuracy when
174
81
63
0
50
100
150
200
Number of False Positives
False Positives without
Discretization
Naïve Bayes J48 Random Forrest
3.2
47.93
380.87
0
100
200
300
400
Time in seconds
Classification Time without
Discretization
Naïve Bayes J48 Random Forrest
15
Discretization was not applied to the Naïve Bayes algorithm, and a reduction in
the number of false-positives of 141.
Figure 13: Naive Bayes with Discretization Model
Figure 14: Naive Bayes with Discretization Confusion Matrix
As in the same case as Naïve Bayes, the J48 algorithm was also implemented
with Discretization, unlike the Naïve Bayes algorithm, the J48 algorithm showed
a reduction in performance with a classification accuracy of 99.932% compared
to the original J48 algorithm without Discretization having a classifier accuracy of
99.9407%.
Figure 15: J48 Model with Discretization
16
Figure 16: J48 Confusion Matrix with Discretization
Given that J48 decision tree algorithm returned a reduction of classification
accuracy when Discretization was applied, it was of interest to see if the results
of the Random Forest algorithm would also result in a reduction of accuracy like
the J48 algorithm as both algorithms belong to the decision tree algorithm
family. The results of the Random Forest algorithm however did echo the same
result as the J48 algorithm with the accuracy of the classifier reducing slightly to
99.9752% but had a massive reduction in classification time by almost half with
the classification time of Radom Forest without Discretization being 380.87
seconds and the classification time with Discretization being 156.11 seconds.
Figure 17: Random Forest Model with Discretization
Figure 18: Random Forest Confusion Matrix with Discretization
17
Figure 19: Classification Algorithms Accuracy using Discretization
Figure 20: Classification Algorithms False Positives Rate using Discretization
95.411
99.9407 99.9719
0
20
40
60
80
100
120
% Accuracy
Classification Accuracy with Discretization
Naïve Bayes J48 Random Forrest
33
121
90
0
20
40
60
80
100
120
140
Number of False Positives
False Positives with Discretization
Naïve Bayes J48 Random Forrest
18
In the next section, clustering algorithms will be explored, the results of these
algorithms will be compared with the results of the classification algorithms
discussed in this section. The overall results of the experiment will be discussed
in section 5.1 Results.
4.2.2 Clustering Implementation
The results of the previous classification algorithms will be used in comparison
with the results from the K-Means and EM algorithms explored in this section,
both implemented with and without Normalization and Standardization applied
to the algorithms individually. Both the K-Means and EM algorithms will be run
under the same condition and environment as the classification algorithms
explored in section 4.2.1, except unlike the classification algorithms, direct cross-
validation cannot be done in the clustering tab of Weka, to implement cross-
validation in Weka for a clustering algorithm select the classification tab, under
the meta classier folder select ‘ClassificationViaClustering’, this gives the option
to select a clustering algorithm to implement, after this classification will be
performed on the cluster so an evaluation can be made.
To begin the KDD Cup 99 was reloaded into the Weka GUI to remove the
Discretization filter applied to the dataset in the previous section. After the KDD
Cup 99 dataset was loaded into Weka, the Cluster tab was selected, the first
clustering algorithm selected was K-Means. The K-Means algorithm was run
using ClassificationViaClustering’ with 12 clusters, the results being 77.4976%
correctly classified instances.
1.93 5.68
148.93
0
20
40
60
80
100
120
140
160
Time in seconds
Classification Time without Discretization
Naïve Bayes J48 Random Forrest
Figure 21: Algorithms Classification Time using Discretization
19
Figure 22: K-Means Model
Figure 23: K-Mean Confusion Matrix
Following on from implementing the K-Means algorithm, the EM was also
implemented using ClassificationViaClustering’ with 12 clusters, to ensure test
fairness. Unfortunately, limitation of memory and CPU, resulted in this run
crashing the Weka, this was attempted several more times with the same results.
By lowering the amount of clusters used to 2 a result was achievable. As the
amount of clusters used in two algorithms used do not match, the results of the
EM will not be used in comparison with the other algorithms so as to ensure test
result fairness. However the EM algorithm can still implement with the analytical
filters to investigate there effect on the algorithm itself. The result of the EM run
was 83.7499% accuracy.
20
Figure 24: EM Model
Figure 25: EM Confusion Matrix
After the two clustering algorithms (without Standardization or Discretization)
were implemented it was now time to implement them again but this time the
algorithms were filtered using Standardization. With Standardization applied to
the KDD Cup 99 dataset, the K-Means algorithm was implemented and led to an
accuracy result of 77.496%, giving it the same result of the original result. Next
the EM algorithm was implemented on the standardized KDD Cup 99 dataset,
the result being 83.7509%, which was marginally better.
Figure 26: K-Means Model with Standardization
21
Figure 27: EM Model with Standardization
Next the KDD Cup 99 dataset implemented with Discretization applied, the same
filter applied to the Classification algorithms but in this case is unsupervised
Discretization. The same process of implementation that was carried out for
Standardization was performed for Discretization. The K-Means algorithm had a
new result of 81.4082% accuracy compared to the original result of 77.4976%,
which is an increased accuracy improvement, while the EM algorithm returned a
result of 73.7629% which was a reduction of over 9.987%.
Figure 28: K-Means Model with Discretization
Figure 29: K-means Confusion Matrix with Discretization
22
Figure 30: EM Model with Discretization
As mentioned at the end of the previous section 4.2.1 the overall results of the
experiment will be discussed in section 5.1 Results.
4.3 Development
Continuing on from the proposed software design discussed in section 3.2,
development of the system began by creating a packet sniffer, the packet sniffer
was implemented using the Scapy approached, this choice was made because if
the proposed system were to be implemented in an open-source environment it
would be easier to incorporate Scapy rather than the likes of Snort which would
need a separate install.
The next stage of the development was implementing the pre-processing and a
machine learning algorithm for the simulator, this was done using the Weka
framework by importing the weka.jar library into a java class file, this class file is
where the IDS training model will be created and evaluated. In the java project
which was previously created when setting up the system environment at the
start of section 4, a java class file was created and called ‘IDS’, to start necessary
java libraries were imported, these being for the buffer reader, file reader and
file not found exceptions. Next the Weka libraries needed were imported for
classification and evaluation as the method of prediction being used in this
simulator was classification, as the results seen in section 4.2 showed
classification algorithms obtained greater prediction accuracy. Since the Random
Forest algorithm showed the best classification results without filtering, it was
chosen as the machine learning algorithm used in this implementation, to do
this, Weka libraries for rules and trees needed to implement the Weka algorithm
were imported. The final Weka libraries imported were those needed to handle
instances and make predictions.
In the IDS class file, methods for reading in the dataset, classifying (using cross-
validation), a check for intrusion and main method which is used to run and
output the results of the simulator. Once this class was tested and successful, a
GUI was created using the NetBeans GUI builder to make the simulator more
user friendly and easier to use. The GUI can be seen in figure 31.
23
Figure 31: IDS Simulation GUI
When the simulator is run by clicking the ‘Monitor’ button it classifies the
dataset using 10 fold cross-validation and checks to see if any classified instances
differed from the predicted classified result, in this simulation this is seen as an
intrusion. For the actual system, this intrusion would be the result of the
incoming connection not matching the prediction, future work to achieve this
functionality is discussed in section 5.3.
24
5 Evaluation
5.1 Results
In this section the results of the experiment in section 4.2 will be discussed and
explored, the algorithm that yielded the greatest result prior to being exposed to
an analytical filter was classification algorithm Random Forest, the lowest result
came from the clustering algorithm K-Means. The algorithm that had the best
improvement from an applied filter was Naïve Bayes with the discretization filter
of 3.9856%, narrowly beating the improvement of K-Means with discretization of
3.5844%, speaking of discretization, even though it showed a decrease in
performance for the decision tree algorithms and the clustering algorithm EM, it
showed promising results with Naïve Bayes and K-Means algorithms. Applying an
analytical filter to an algorithm resulted in the reduction in model classification
time for both Classification and Clustering algorithms, in the case of
Discretization, this reduction in classification time can be seen as a possible
cause for the reduced performance seen in the J48, Random Forest and EM
algorithms, which have a greater complexity than Naïve Bayes and K-Means.
Figure 32: Table showing the classification accuracy of Naive Bayes and K-Means algorithms with and
without Discretization applied. Classification accuracy shown in percentages
In the results there was a massive reduction in false positives (false alarms) when
Discretization was applied to Naïve Bayes, this is a direct correlation to the
increase in classification accuracy. What is interesting in the results is that when
Naïve Bayes is implemented using Discretization the number of false positives
recorded was lower than the number of false positives recorded by Random
Forest (Without Discretization) even though Random Forest still has a higher
0
20
40
60
80
100
No Filter Discretization
95.411 99.3966
77.4976
81.4082
Classification Accuracy Comparison in %
Naïve Bayes K-Means
25
classification accuracy, thus discretised Naïve Bayes algorithm must have more
false negatives than Random Forest, this is also of concern in intrusion detection
but in this experiment the concern is on classification accuracy and the number
of false positives. A big surprise in the result was that although the classification
accuracy of the K-Means algorithm improved with Discretization, the number of
false positives rose by a massive 986, thus there was an equal reduction in false
negative in the algorithm. This result could be due to the amount of clusters
being used in this experiment or the regular result of running K-Means with
Discretization, future research can be explored into this problem with an
environment that has the memory and CPU power to handle more clusters.
Figure 33: Number of False Positives, Naive Bayes/ K-Means Comparison
The overall top 3 rankings classification/clustering algorithms both with and
without a filter based on classification accuracy are:
1. Random Forest – No Filter
2. J48 – No Filter
3. Naïve Bayes – Discretization
This does not take into account number of false positives or classification time,
ranking of algorithms which resulted in the lowest number of false-positives
shown below:
1. Naïve Bayes – Discretization
2. Random Forest – Discretization
3. J48 - Discretization
The development portion of the project resulted in a functional IDS Simulation
being developed, the code which can easily be augmented for other classification
0
500
1000
1500
2000
No Filter Discretization
174 33816
1802
Number of False Positives
Naïve Bayes K-Means
26
algorithms resulted in a system that could take in a dataset, pre-process the
data, classify the data instances using Random Forest with 10 fold cross-
validation, checks the resulting classifier if an intrusion was detected and if so
output to the GUI a detection message. Real-time/near real-time intrusion in this
project was unable to be implemented but is discussed in section 5.3.
5.2 Conclusion
From the results shown in the previous section 5.1, a conclusion can be made on
the effects of using analytics to improve intrusion detection, this being that there
is no clear supreme machine learning algorithm for Intrusion Detection Systems,
for example the Classification algorithm Random Forest had the best classifier
accuracy (Resulting in the lowest number of false positives + false negatives) but
had a high classification time, this algorithm approach would be highly effective
on a network with high CPU and Memory resources, however this would not be
scalable for a network with lower CPU and Memory resources as a decision tree
algorithm will cause overhead (Large classification time) on the network. In the
results it can be seen that this overhead can be reduced using Discretization on
decision tree algorithms for a small loss in accuracy performance, this reduction
in performance still has the decision tree algorithms ahead of any other
algorithm explored in this project in accuracy performance. If accuracy
performance, low classification time and low number of false positives is what
the network needs than the Naïve Bayes algorithm with Discretization is the best
choice, with Naïve Bayes returning the lowest number of false-positives. In
conclusion, the best machine learning algorithm for IDS is the algorithm that best
fits the requirements and restrictions of the network the system will be
implemented on and whether the approach is either Misuse or Anomaly.
Furthermore, in the majority case (minority case being clustering algorithms),
implementing an algorithm with an analytical filter like Discretization or
Standardization resulted in one or more benefits to a machine learning
algorithm, whether this be in increase in accuracy performance, reduction in
classification time and reduced number of false positives (All these benefits in
the case of Naïve Bayes). Thus research into analytical filters for use on a dataset
is an invaluable use of time to ensure the IDS used gets the best results possible.
Finally, these results show that classification algorithms return the greater
performance compared to the clustering algorithms experimented on in this
project, this however does not mean classification algorithms are better than
clustering algorithm, instead it shows that the KDD Cup 99 dataset works well
using Classification compared to Clustering, also clustering algorithms work off
data with no classes, this type (unsupervised) of machine learning is much more
difficult and complex compared to that of Classification.
In the end this project showed that there are benefits and downfalls to most
analytical approaches for machine learning algorithms, but the benefits outweigh
the downfalls thus selecting the right machine learning algorithm
27
implementation with a suitable analytical filter can indeed enhance intrusion
detection.
5.3 Future Work
This section of the project will discuss work that was not able to be
developed/implemented and changes that would be made now looking back on
the results of the project.
The main functionality that was not able to be implemented was real-time/near
real-time intrusion detection, the IDS Simulator did not have the functionality
needed to classify incoming network packets, and this is done by deploying a
created machine learning model which is then used to predict the type of the
incoming packet. The packet sniffer implemented (Scapy) outputs the incoming
packets IP address rather than giving the packets connection info, if real-
time/near real-time intrusion detection is to be performed on KDD Cup 99
dataset, the incoming packets must be converted to connection level data. This
continues on to the downfalls of the KDD Cup 99 dataset, it is seen to be a poor
dataset for use on real world data, as it is over 15 years old and as such is
outdated, if this project was undertaken again, a dataset collected over a few
months on the environment you intend to implement it on with rare attacks like
R2L and U2R being favoured rather than DOS attacks which is the case in KDD
Cup 99 dataset. The dataset would also be skewed with more normal packets
rather than the heavily attack skewed KDD Cup 99 dataset.
The IDS Simulator would be a better model of how Intrusion Detection Systems
are actually implemented if it was implemented using a clustering algorithm as
attacks these day are unknown so an anomaly approach would be more
beneficial.
Finally, the environment used in this project did not have the CPU processing
power and RAM needed to deal with complex algorithms classifying large
datasets (as seen with the EM clustering algorithm), which meant some results
couldn’t be recorded. If experiment was implemented on an environment with
high CPU processing power and RAM more accurate results could have been
generated on a possibly bigger dataset.
References
[1] SANS™ Institute, “What is a false positive and why are false positives a problem?,”
[Online]. Available: https://www.sans.org/security-resources/idfaq/what-is-a-false-
positive-and-why-are-false-positives-a-problem/2/8.
[2] Predictive Analytics Today, “What is Predictive Analytics,” [Online]. Available:
http://www.predictiveanalyticstoday.com/what-is-predictive-analytics/.
[3] A. Smola and S. Vishwanathan, “Introduction to Machine Learning”.
[4] M. J. Zaki and W. Meira Jr., “Data mining and Analysis: Fundamental Concepts and
Algorithms,” 2014.
[5] A. A. Rao, P. Srinivas, B. Chakravarthy, K. Marx and P. Kiran , “A Java Based Network
Intrusion Detection System (IDS)”.
[6] MIT Lincoln Labs, “KDD Cup 1999 Data,” Information and Computer Science
University of California, Irvine, 1999. [Online]. Available:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[7] M. K. Siddiqui and S. Naahid, “Analysis of KDD CUP 99 Dataset using Clustering
based Data,” International Journal of Database Theory and Application, vol. Vol.6,
no. No.5, pp. 24-25, 2013.
[8] M. Tavallaee, E. Bagheri, W. Lu and A. A. Ghorbani, “A Detailed Analysis of the KDD
CUP 99 Data Set,” 2009.
[9] H. Liu, F. Hussain, C. L. Tan and M. Dash, “Discretization: An Enabling Technique,”
Data Mining and Knowledge Discovery, 2000.
[10] M. Walker, “Random Forests Algorithm,” 2013. [Online]. Available:
http://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm.
[11] I. B. Mohamad and D. Usman, “Standardization and Its Effects on K-Means
Clustering Algorithm,” p. 3300, 2013.
[12] N. Sharma, A. Bajpai and R. Litoriya, “Comparison the various clustering algorithms
of weka tools,” International Journal of Emerging Technology and Advanced
Engineering, vol. 2, no. 5, pp. 76-79, 2012.
[13] C. Piech and A. Ng, “K Means,” [Online]. Available:
http://stanford.edu/~cpiech/cs221/handouts/kmeans.html.
[14] M. B. and M. B. , “An overview to Software Architecture in Intrusion Detection
System,” International Journal of Soft Computing And Software Engineering (JSCSE),
p. 4, 2011.
[15] D. Magers, “Packet Sniffing: An Integral Part of Network Defense,” 2002.
[16] M. Roesch. [Online]. Available: https://www.snort.org/.
[17] Penn State Berks, “Introduction to Snort,” [Online]. Available:
http://istinfo.bk.psu.edu/labs/Snort.pdf.
[18] . M. Richardson and B. Fenner. [Online]. Available: http://www.tcpdump.org/.
[19] P. Biondi. [Online]. Available: http://www.secdev.org/projects/scapy/.
[20] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, The
WEKA Data Mining Software: An Update, vol. 11, 2009.
[21] P. Refaeilzadeh, L. Tang and H. Liu, Cross-Validation, Arizona State University, 2008.
[22] S. K. Patro and K. K. Sahu, “Normalization: A Preprocessing Stage”.
Figure 1: Bayes Theorem 5
Figure 2: IDS Software Architecture Overview [12] 7
Figure 3: Software Architecture Overview of IDS Simulator 8
Figure 4: Naive Bayes Model 11
Figure 5: Naive Bayes Confusion Matrix 11
Figure 6: J48 Model 12
Figure 7: J48 Confusion Matrix 12
Figure 8: Random Forest Model 13
Figure 9: Random Forest Confusion Matrix 13
Figure 10: Classification Algorithms Accuracy 13
Figure 11: Classification Algorithms False Positive Rate 14
Figure 12: Algorithms Classification Time 14
Figure 13: Naive Bayes with Discretization Model 15
Figure 14: Naive Bayes with Discretization Confusion Matrix 15
Figure 15: J48 Model with Discretization 15
Figure 16: J48 Confusion Matrix with Discretization 16
Figure 17: Random Forest Model with Discretization 16
Figure 18: Random Forest Confusion Matrix with Discretization 16
Figure 19: Classification Algorithms Accuracy using Discretization 17
Figure 20: Classification Algorithms False Positives Rate using Discretization 17
Figure 21: Algorithms Classification Time using Discretization 18
Figure 22: K-Means Model 19
Figure 23: K-Mean Confusion Matrix 19
Figure 24: EM Model 20
Figure 25: EM Confusion Matrix 20
Figure 26: K-Means Model with Standardization 20
Figure 27: EM Model with Standardization 21
Figure 28: K-Means Model with Discretization 21
Figure 29: K-means Confusion Matrix with Discretization 21
Figure 30: EM Model with Discretization 22
Figure 31: IDS Simulation GUI 23
Figure 32: Table showing the classification accuracy of Naive Bayes and K-Means
algorithms with and without Discretization applied. Classification accuracy
shown in percentages 24
Figure 33: Number of False Positives, Naive Bayes/ K-Means Comparison 25

More Related Content

What's hot

Microarray By Pushpita Saha
Microarray By Pushpita SahaMicroarray By Pushpita Saha
Microarray By Pushpita SahaPushpita Saha
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkImtiaz Siddique
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis PresentationWajdi Khattel
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Saeedeh Shekarpour
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksstellajoseph
 
Analysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniquesAnalysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniqueseSAT Journals
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Prakhar Rastogi
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGIRJET Journal
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentationPawan Singh
 
Exploratory Data Analysis of Airbnb London
Exploratory Data Analysis of Airbnb LondonExploratory Data Analysis of Airbnb London
Exploratory Data Analysis of Airbnb LondonYalin Yener
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering enginesYash Darak
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games ResearchJose Zagal
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetSungminYou
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion pathVitaly Bondar
 

What's hot (20)

Microarray By Pushpita Saha
Microarray By Pushpita SahaMicroarray By Pushpita Saha
Microarray By Pushpita Saha
 
CIFAR-10
CIFAR-10CIFAR-10
CIFAR-10
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Analysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniquesAnalysis of crop yield prediction using data mining techniques
Analysis of crop yield prediction using data mining techniques
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentation
 
Exploratory Data Analysis of Airbnb London
Exploratory Data Analysis of Airbnb LondonExploratory Data Analysis of Airbnb London
Exploratory Data Analysis of Airbnb London
 
Cnn method
Cnn methodCnn method
Cnn method
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 

Viewers also liked

Final Project Master In Computer Sciences
Final Project Master In Computer SciencesFinal Project Master In Computer Sciences
Final Project Master In Computer SciencesMohammad Qureshi
 
Network Security ,2014 and 2015 ieee projects list @ TMKS Infotech
Network Security ,2014 and 2015 ieee projects list @ TMKS InfotechNetwork Security ,2014 and 2015 ieee projects list @ TMKS Infotech
Network Security ,2014 and 2015 ieee projects list @ TMKS InfotechManju Nath
 
Final Year Project
Final Year ProjectFinal Year Project
Final Year Projectz060204
 
Final year computer science project |
Final year computer science project | Final year computer science project |
Final year computer science project | cegon12
 
GSM BASED E-NOTICE BOARD synopsis
GSM BASED E-NOTICE BOARD synopsisGSM BASED E-NOTICE BOARD synopsis
GSM BASED E-NOTICE BOARD synopsisVenkatesh Agnihotri
 
wireless electronic notice board using GSM
wireless electronic notice board using GSMwireless electronic notice board using GSM
wireless electronic notice board using GSMVijeeth Anitha
 
Final Year Projects (Computer Science 2013) - Syed Ubaid Ali Jafri
Final Year Projects (Computer Science 2013) - Syed Ubaid Ali JafriFinal Year Projects (Computer Science 2013) - Syed Ubaid Ali Jafri
Final Year Projects (Computer Science 2013) - Syed Ubaid Ali JafriSyed Ubaid Ali Jafri
 
Best Final Year Projects
Best Final Year ProjectsBest Final Year Projects
Best Final Year Projectsncct
 
Final Year Projects Computer Science (Information security) -2015
Final Year Projects Computer Science (Information security) -2015Final Year Projects Computer Science (Information security) -2015
Final Year Projects Computer Science (Information security) -2015Syed Ubaid Ali Jafri
 
Computer Science Investigatory Project Class 12
Computer Science Investigatory Project Class 12Computer Science Investigatory Project Class 12
Computer Science Investigatory Project Class 12Self-employed
 

Viewers also liked (11)

Final Project Master In Computer Sciences
Final Project Master In Computer SciencesFinal Project Master In Computer Sciences
Final Project Master In Computer Sciences
 
Network Security ,2014 and 2015 ieee projects list @ TMKS Infotech
Network Security ,2014 and 2015 ieee projects list @ TMKS InfotechNetwork Security ,2014 and 2015 ieee projects list @ TMKS Infotech
Network Security ,2014 and 2015 ieee projects list @ TMKS Infotech
 
Final Year Project
Final Year ProjectFinal Year Project
Final Year Project
 
Final year computer science project |
Final year computer science project | Final year computer science project |
Final year computer science project |
 
GSM BASED E-NOTICE BOARD synopsis
GSM BASED E-NOTICE BOARD synopsisGSM BASED E-NOTICE BOARD synopsis
GSM BASED E-NOTICE BOARD synopsis
 
wireless electronic notice board using GSM
wireless electronic notice board using GSMwireless electronic notice board using GSM
wireless electronic notice board using GSM
 
Final Year Projects (Computer Science 2013) - Syed Ubaid Ali Jafri
Final Year Projects (Computer Science 2013) - Syed Ubaid Ali JafriFinal Year Projects (Computer Science 2013) - Syed Ubaid Ali Jafri
Final Year Projects (Computer Science 2013) - Syed Ubaid Ali Jafri
 
Best Final Year Projects
Best Final Year ProjectsBest Final Year Projects
Best Final Year Projects
 
Final Year Projects Computer Science (Information security) -2015
Final Year Projects Computer Science (Information security) -2015Final Year Projects Computer Science (Information security) -2015
Final Year Projects Computer Science (Information security) -2015
 
List of research projects cmc vellore
List of research projects cmc velloreList of research projects cmc vellore
List of research projects cmc vellore
 
Computer Science Investigatory Project Class 12
Computer Science Investigatory Project Class 12Computer Science Investigatory Project Class 12
Computer Science Investigatory Project Class 12
 

Similar to FYP Thesis

complete_project
complete_projectcomplete_project
complete_projectAnirban Roy
 
An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...Jessica Navarro
 
AUTOMATED PROCTORING SYSTEM
AUTOMATED PROCTORING SYSTEMAUTOMATED PROCTORING SYSTEM
AUTOMATED PROCTORING SYSTEMIRJET Journal
 
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Gareth Setati
 
Better Software, Better Research
Better Software, Better ResearchBetter Software, Better Research
Better Software, Better ResearchCarole Goble
 
Project Report on Cloud Storage
Project Report on Cloud StorageProject Report on Cloud Storage
Project Report on Cloud StorageRachitSinghal17
 
CLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docx
CLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docxCLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docx
CLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docxHussainiHamza1
 
Accident detection and notification system
Accident detection and notification systemAccident detection and notification system
Accident detection and notification systemSolomon Mutwiri
 
TOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsTOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsSubin Mathew
 
SAP Development Object Testing
SAP Development Object TestingSAP Development Object Testing
SAP Development Object TestingShivani Thakur
 
Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408
Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408
Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408Daniel Araújo Melo
 
Accuracy and time_costs_of_web_app_scanners
Accuracy and time_costs_of_web_app_scannersAccuracy and time_costs_of_web_app_scanners
Accuracy and time_costs_of_web_app_scannersLarry Suto
 
Assignment2 nguyen tankhoi
Assignment2 nguyen tankhoiAssignment2 nguyen tankhoi
Assignment2 nguyen tankhoiVnhTngLPhc
 
A.R.C. Usability Evaluation
A.R.C. Usability EvaluationA.R.C. Usability Evaluation
A.R.C. Usability EvaluationJPC Hanson
 
Analyzing and implementing of network penetration testing
Analyzing and implementing of network penetration testingAnalyzing and implementing of network penetration testing
Analyzing and implementing of network penetration testingEngr Md Yusuf Miah
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identificationSebastian W. Cheah
 
Report of Previous Project by Yifan Guo
Report of Previous Project by Yifan GuoReport of Previous Project by Yifan Guo
Report of Previous Project by Yifan GuoYifan Guo
 

Similar to FYP Thesis (20)

MALWARE THREAT ANALYSIS
MALWARE THREAT ANALYSISMALWARE THREAT ANALYSIS
MALWARE THREAT ANALYSIS
 
complete_project
complete_projectcomplete_project
complete_project
 
An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...
 
AUTOMATED PROCTORING SYSTEM
AUTOMATED PROCTORING SYSTEMAUTOMATED PROCTORING SYSTEM
AUTOMATED PROCTORING SYSTEM
 
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
 
Better Software, Better Research
Better Software, Better ResearchBetter Software, Better Research
Better Software, Better Research
 
Project Report on Cloud Storage
Project Report on Cloud StorageProject Report on Cloud Storage
Project Report on Cloud Storage
 
CLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docx
CLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docxCLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docx
CLINICAL_MANAGEMENT_SYSTEM_PROJECT_DOCUM.docx
 
Accident detection and notification system
Accident detection and notification systemAccident detection and notification system
Accident detection and notification system
 
Final_Thesis
Final_ThesisFinal_Thesis
Final_Thesis
 
TOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsTOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRistics
 
SAP Development Object Testing
SAP Development Object TestingSAP Development Object Testing
SAP Development Object Testing
 
Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408
Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408
Dissertacao_ARCA_Alerts_Root_Cause_Analysis-versao-impressao-20150408
 
Accuracy and time_costs_of_web_app_scanners
Accuracy and time_costs_of_web_app_scannersAccuracy and time_costs_of_web_app_scanners
Accuracy and time_costs_of_web_app_scanners
 
Assignment2 nguyen tankhoi
Assignment2 nguyen tankhoiAssignment2 nguyen tankhoi
Assignment2 nguyen tankhoi
 
Thesis
ThesisThesis
Thesis
 
A.R.C. Usability Evaluation
A.R.C. Usability EvaluationA.R.C. Usability Evaluation
A.R.C. Usability Evaluation
 
Analyzing and implementing of network penetration testing
Analyzing and implementing of network penetration testingAnalyzing and implementing of network penetration testing
Analyzing and implementing of network penetration testing
 
smartwatch-user-identification
smartwatch-user-identificationsmartwatch-user-identification
smartwatch-user-identification
 
Report of Previous Project by Yifan Guo
Report of Previous Project by Yifan GuoReport of Previous Project by Yifan Guo
Report of Previous Project by Yifan Guo
 

FYP Thesis

  • 1. Using Analytics to Enhance Intrusion Detection Systems by Jamie Sullivan Final-Year Project - BSc in Computer Science Supervisor: Prof. Gregory Provan Second Reader: Dr. Derek Bridge Department of Computer Science University College Cork April 2016
  • 2. Abstract By Jamie Sullivan Network Intrusion Detection Systems suffer from high false-alarm rates, this project uses predictive analytics approach to study how to reduce false alarms. Predictive analytics will combine system models together with data to predict the most likely actions of intruders, in order to distinguish real intrusion signatures (triggering of alarms) from random signatures. The project will combine learning from data with simulation models for signature analysis. The project will provide both an experimental approach to study how predictive analytics combined with machine learning algorithm/techniques and a Software simulation on how the system is intended to operate using open source software. Predictive analysis carried out on the KDD Cup ’99 dataset implemented using both supervised learning (Classification) and unsupervised learning (Clustering) algorithms to create simulation models, approach to reduce false alarms (false positives) by increasing the predictive accuracy of the these algorithms.
  • 3. DeclarationofOriginality In signing this declaration, you are confirming, in writing, that the submitted work is entirely your own original work, except where clearly attributed otherwise, and that it has not been submitted partly or wholly for any other educational award. I hereby declare that: - this is all my own work, unless clearly indicated otherwise, with full and proper accreditation; - with respect to my own work: none of it has been submitted at any educational institution contributing in any way towards an educational award; - with respect to another’s work: all text, diagrams, code, or ideas, whether verbatim, paraphrased or otherwise modified or adapted, have been duly attributed to the source in a scholarly manner, whether from books, papers, lecture notes or any other student’s work, whether published or unpublished, electronically or in print. Name: Jamie Sullivan Signed: _______________________________
  • 4. Acknowledgements I am using this opportunity to express my gratitude to everyone who supported me throughout the course of this project. I am thankful for their aspiring guidance, invaluably constructive criticism and friendly advice during the project work. I am sincerely grateful to them for sharing their truthful and illuminating views on a number of ideas related to the project. I would like to express my sincere gratitude to my supervisor Prof. Gregory Provan for the continuous support during the course of this project, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all aspects of the project. I could not have imagined having a better advisor and mentor for my project. Thank you to everyone in lab 1.09, for the stimulating discussions, fantastic atmosphere and work environment over the past several months. Last but not the least, I would like to thank my friends and family for all the support, guidance and advice throughout the entire process of this project, and especially over my four years here in University College Cork.
  • 5. Table of Contents 1 Introduction .....................................................................................................1 2 Analysis.............................................................................................................3 3 Design...............................................................................................................4 3.1 Experimental Design .................................................................................4 3.1.1 Knowledgebase/Dataset....................................................................4 3.1.2 Classification Algorithms Approach...................................................5 3.1.2.1 Analytical approach – Discretization..............................................5 3.1.2.2 Naïve Bayes ....................................................................................5 3.1.2.3 J48 ..................................................................................................6 3.1.2.4 Random Forest...............................................................................6 3.1.3 Clustering Algorithms Approach........................................................6 3.1.3.1 Analytical Improvement – Standardize & Discretization...............6 3.1.3.2 K-Means .........................................................................................7 3.1.3.3 EM ..................................................................................................7 3.2 Software Design ........................................................................................7 3.2.1 Packet Sniffer.....................................................................................8 3.2.2 Data Pre-Processor ............................................................................9 3.2.3 Knowledgebase/Dataset....................................................................9 3.2.4 Machine Learning Algorithm .............................................................9 3.2.5 Trained Model....................................................................................9 3.2.6 Network .............................................................................................9 4 Implementation .............................................................................................10 4.1 Weka .......................................................................................................10 4.2 Experiment..............................................................................................10 4.2.1 Classification Implementation.........................................................11 4.2.2 Clustering Implementation..............................................................18 4.3 Development...........................................................................................22 5 Evaluation.......................................................................................................24 5.1 Results.....................................................................................................24 5.2 Conclusion...............................................................................................26 5.3 Future Work ............................................................................................27 References
  • 6. 1 1 Introduction With a growing level of intrusions on the internet and local networks a vast amount of work is being invested in intrusion detection. Intrusion detection systems coupled with intrusion prevention systems work to stop these attacks. Unfortunately Intrusion Detection Systems (IDS) suffer from high false-alarm rates known as false-positives, these false-positives result in genuine packets being flagged as attacks which in a system is not ideal as it drowns out legitimate attack detections with these false attack detections [1]. Another problem in IDS’s is false-negatives which fail to identify a packet as an attack, this project will focus on reducing false-positives. This project will attempt to use predictive analytics to study how to reduce false alarms. Predictive analytics is an analytical method which is used to make predictions on unknown future events, using techniques like machine learning, data mining, modelling and statistics to analyse stored data to make accurate predictions about future data [2]. Machine learning algorithms and data mining techniques combined with statistical methods implemented on the training dataset to create predictive models which will improve the accuracy of the models and thus reducing false-positives. Both misuse and anomaly based intrusion techniques will be implemented to investigate the effects of the false- positive reduction method researched. The predictive analytics approach researched in this project will be implemented on the free to use KDD Cup 99 dataset, thus research will be done on the dataset to see if it is an effective dataset for intrusion detection, free to use data is always of interest when performing a research project or developing a system as it is cost effective and the resulting model/system can be implement by others using the same free to us data. The project was split into two sections, first section being an experiment to research the capabilities and effects of predictive analytics for intrusion detection, the second section being development of a system for intrusion detection. Both the experiment and development sections were implemented using the open source Weka data mining tool, which will be discussed in later sections. The experiment was carried out first followed by the development of the system using the model that returned the best results. The goal of the developed system being to develop and easy to use and interpret IDS, it will not have intrusion prevention functionality, the system should be able to handle real-time/near real-time intrusion detection and be scalable for all network types. In the end the experiment showed that predictive analytics can be implemented to improve detection accuracy and reduce false-positives. This was done by using predictive machine learning/data mining algorithms combined with statistical
  • 7. 2 methods to filter the data in the KDD Cup 99 dataset. Although the system developed did not have real-time/near real-time intrusion detection functionality a proposed method of implemented was discussed for future work, the system did provide a simulation on how intrusion detection is performed using the predictive analytic methods researched in the experiment. The experiment revealed the downfalls of the KDD Cup 99 dataset which have been investigated in other papers but showed the dataset still has merit as an effective dataset for use in research, especially in cases where the data is skewed.
  • 8. 3 2 Analysis Based on the project abstract and introduction the objectives of this project is to complete adequate research in the field of Intrusion Detection Systems, use predictive analytics to increase the classification accuracy and reduce false- positives and implement the method researched in a IDS. The environment this research experiment and development will be carried out on will be the UCC Computer Science department lab machines which have 8GB RAM and an Intel Duo CPU 3.00 GHz processor. Before research and development can begin for this project, research must be done on the background ideologies:  Machine Learning o Methods of pattern detection/recognition which implement computational learning based on the data patterns [3].  Data Mining o The process of gaining insightful knowledge from data in a knowledgebase/dataset [4].  Predictive Analytics o Analytical method which is used to make predictions on unknown future events [2].  False-Positives o Packet data that is flagged as an attack intrusion but is actually a legitimate normal packet.  False-Negatives o Packet data that is flagged as a normal packet but was actually a malicious attack packet. The research experiment will implement algorithms such as Naïve Bayes, J48, Random Forest, EM and K-Means, these algorithms were chosen as together they have a large range of functionality and different machine learning approaches. The individual algorithms will be discussed more in sections 3.1.2 and 3.1.3. Data mining and statistical methods will be applied to the dataset to optimise the data retrieved from the dataset, methods such as Discretization and Standardization will be implemented. The IDS being developed will use a Java architecture and requirement similar to those discussed in [5], but will also use open-source data mining/machine learning discussed later sections. The results of this project will hope to discover and implement an effective means of improved network intrusion detection that when implemented will reduce false-positives and classification accuracy while also not creating massive overhead that would cause network congestion, in turn making the proposed improvement method scalable for both small and large networks.
  • 9. 4 3 Design 3.1 Experimental Design Largest portion of this project was researching how to improve Intrusion Detection using Analytics, research being carried out on two machine learning techniques, Classification algorithms and Clustering algorithms discussed below. 3.1.1 Knowledgebase/Dataset The Knowledgebase/Dataset that will be used in this Experiment is the KDD Cup 99 dataset, a dataset originally used for “The Third International Knowledge Discovery and Data Mining Tools Competition”, the competitions task being to build a network intrusion detector, a predictive model capable of distinguishing between intrusions (attacks) and normal connections [6]. With this task being very similar to the task set out in this project, the KDD Cup 99 is a good fit for the experiment. For the experiment a pre-made 10% section of the entire KDD Cup 99 dataset will be used, due to the limitations of the system memory being used in the experiment a sub section of the 10% of the dataset, this being a 8% version of the KDD Cup 99 dataset with a total of 298,497 instances, 42 attributes and 23 distinct class values. The 10% training dataset has 19.69% normal and 80.31% attack connections [7], the breakdown of attacks is show below:  Denial of Service Attack (DOS) o Attacks that try to block legitimate connection request by making computing or memory resources too busy for requests. o ‘back’, ‘land’, ‘neptune’, ‘pod’, ‘smurf’ and ‘teardrop’  Users to Root Attack (U2R) o Attacker has access to a normal user account, uses these attacks to try gain access to root level access. o ‘buffer_overflow’, ‘loadmodule’, ’perl’, and ‘rootkit’  Remote to Local Attack (R2L) o Attacker does not have access to an account on a machine, attacker tries to expose a vulnerability on the network using these attacks to gain access to the machine. o ‘ftp_write’, ‘guess_passwd’, ‘imap’, ‘multihop’, ‘phf’, ‘spy’, ‘warezclient’ and ‘warezmaster’  Probing Attack (PROBE) o Attacks try to gain information about the network in order to use this information to get past its security. o ‘ipsweep’, ‘nmap’, ‘portsweep’ and ‘satan’ As stated above the ratio of normal and attack connections in the dataset is 19.69% to 80.31% respectively, thus this is a heavily skewed dataset to attack connections, the dataset seems to favour more records of DOS attack instances such as ‘neptune’ and ‘smurf’ rather than more harmful attacks such as U2L and
  • 10. 5 R2l (which in practical uses are more favourable in a training dataset), which can lead the results to be biased to frequent record detection methods, which is not ideal for practical use but for this experiment the focus is more on the analytical methods which can be implemented to improve the baseline result (result received without the analytical method implemented). The downfalls of the KDD Cup 99 dataset are discussed in [8]. Despite these downfalls, the KDD Cup 99 dataset is still an effect dataset for the purpose of this project. Finally the KDD Cup 99 dataset contains the following connection protocols [7]:  TCP: Reliable connection-oriented protocol  UDP: Unreliable and connection-less protocol  ICMP: Communication over networked computers 3.1.2 Classification Algorithms Approach Classification is a supervised machine learning technique that assigns instances in a collection to target class, the end goal of a classifier is to accurately predict the target class for each case in the data. In the following sections the classification algorithms that will be used and the analytical approach to reduce false positives will be discussed. 3.1.2.1 Analytical approach – Discretization Discretization is the process of converting continuous attribute values to nominal attribute intervals thus a smaller attribute selection. Discretization has been shown to improve prediction accuracy in previous works [9] due to intervals being a more concise representation of the data thus is easier to use and comprehend rather than continuous values. Discretization is the planned analytical approach to attempt to make an improvement on the results of the algorithms discussed below. 3.1.2.2 Naïve Bayes The Naïve Bayes algorithm is based on conditional probabilities and uses Bayes Theorem, it finds the probability of an event occurring given the probability of another event that has already occurred. In the case of this experiment it finds the probability of a connection being either a normal or attack connection based on the previously calculated probability of the connections in the knowledgebase/ dataset. Figure 1: Bayes Theorem.
  • 11. 6 In this experiment Naïve Bayes will be the main algorithm being focused on because it works surprisingly well as a classification algorithm given it works on ‘naïve’ assumptions and yet is still competitive against other more elegant classifications, and it has a lower performance compared to the two decision tree algorithms discussed in the following sections. For these two reasons, the results of training/testing the Naïve Bayes classifier can be used as the baseline for the experiment, and then by using the analytical approach discussed in section 3.1.2.1 to try and improve the results of the classifier as close to the superior results of decision tree the algorithms. 3.1.2.3 J48 The J48 algorithm is a decision tree algorithm, it is an open source java implementation of the C4.5 algorithm, which builds a classification tree based on the principle of information entropy. In this experiment, research will be carried out to see how the J48 algorithm compares to the results of the Naïve Bayes classifier and it’s fellow decision tree algorithm Random Forest, the analytical method used to try improve the results of the Naïve Bayes algorithm will also be applied to J48 to see if the results have had the same outcome. 3.1.2.4 Random Forest Random Forest is another decision tree algorithm that uses ensemble learning method, builds a classification by combining tree predictors with random vectors to create the decision trees [10]. Like J48, the results of the Random Forest classifier will be used in comparison with Naïve Bayes and J48, as in the cases of the Naïve Bayes and J48, Random Forest algorithm will also implement the analytical method discussed above and the outcome compared. 3.1.3 Clustering Algorithms Approach Clustering is an unsupervised machine learning technique that groups data together into ‘clusters’ based on their likeness, unlike classification, clustering algorithms can work off unlabelled datasets i.e. does not have a class attribute value. Research on the performance of clustering algorithms will be carried out on the K-Means and EM algorithms, comparisons will be carried out on the results of both these algorithms, with and without the planned analytical approach for improved performance applied. 3.1.3.1 Analytical Improvement – Standardize & Discretization Standardization is a technique that transforms the mean of the dataset to 0 and unit variance of the dataset to be 1 [11]. Like the classification algorithms, Discretization will also be implemented on the Clustering algorithms so as to the effects it has on their results, although in the case of the clustering algorithms an unsupervised discrete filter will be used. Both these analytical approaches will be implemented on the two algorithms below to investigate if there is an improvement of their results.
  • 12. 7 3.1.3.2 K-Means K-Means clustering algorithm aims to partitions n objects/observations from the dataset into k clusters, where each object/observation belongs to the cluster with the nearest mean [12]. In simple, K-means assigns k centroids (centre point of a cluster) that are used to define a cluster, an instance is defined to be in a particular cluster if it is closer to that cluster's centroid than any other centroid, much like nearest neighbour [13]. 3.1.3.3 EM The EM (Expectation Maximization) algorithm, an algorithm that evaluates clusters using two stages, first calculating the expectation of the log-likelihood, followed by computing the parameters maximizing the log-likelihood (Calculated in previous step), which is used to calculate the distribution. This algorithm is seen to be useful in real word datasets, and has been used after implementing K- Means in other citations [12], so the comparison of results will be of interest. 3.2 Software Design For this project a simulation of how the proposed Intrusion Detection System using improvements researched was needed, the following sections will go into greater detail of the high-level architecture. Figure 2: IDS Software Architecture Overview [14]
  • 13. 8 Figure 3: Software Architecture Overview of IDS Simulator 3.2.1 Packet Sniffer Captures the network traffic, filters it for the particular traffic you want, then stores the data in a buffer. The captured packets are then analysed/ decoded in Real-Time or Near Real-Time [15]. Below possible packet sniffer tools discussed:  SNORT [16] o An open source intrusion prevention system capable of real-time traffic analysis and packet logging, libpcap-based, rule based. In this project the full services Snort offers would not be implemented instead only a subsection, in this case Snorts packet sniffing/capture capabilities in Packet Logger Mode [17].  TCPDUMP [18] o A powerful command-line tool that allows you to sniff/capture network packets, much like Snort it is libpcap-based.  Scapy [19] o An interactive packet manipulation program. The features of interest in this project being the ability to decode packets of a wide number of protocols and capture them.
  • 14. 9 3.2.2 Data Pre-Processor Data in the real world needs to be pre-processed as it can be incomplete (lack attribute values and attributes of interest), inconsistent (contain discrepancies) and contain errors. Thus the data needs to be passed through a Data Pre- Processor to ‘clean up’ the data, otherwise poor quality data would lead to poor quality results in the later stages of the project. The open source machine learning software tool Weka will be used to pre- process the data for use in the IDS Simulation. 3.2.3 Knowledgebase/Dataset The knowledgebase/dataset being used in this simulation for training is the KDD Cup 99 dataset, for the IDS simulation, the same amount of instances and attributes will be used as discussed in section 3.1 Experimental Design. 3.2.4 Machine Learning Algorithm The machine learning algorithm used in the IDS simulation will be selected from the classification algorithms discussed from section 3.1.1 i.e. Naive Bayes, J48 and Random Forest. Classification algorithm approach was chosen over a Clustering algorithm approach based on the decision that the KDD cup 99 dataset has predefined labels (classes), which suits supervised learning. A Clustering algorithm can also be implemented with the KDD Cup 99 dataset and give the same quality results as Classification but for this IDS simulation Classification preference was given. 3.2.5 Trained Model Once training on the data is complete, the trained model will be deployed on the network and used for comparison on captured incoming packets using one of the Packet Sniffer techniques discussed in section 3.2.1, this model allows for detection to be carried out in real-time/near real-time as the capture packets are compared to the training model. In the case of a unsupervised trained model, new packet values not seen before in the trained model will retrain the model so that in future packet of similar instances can be detected. 3.2.6 Network The domain of the IDS simulation will be Network based intrusion detection rather than Host based intrusion detection. The network that will be used for the simulation is a small scale network (Lab Machine connected to UCC CS network). Even though testing of the system will be carried out in a small scale environment, due to the technologies being open source and tested by their development communities, the system will be scalable to larger Networks.
  • 15. 10 4 Implementation This section of the report will discuss in detail the process of implementation of the work set out in the project brief. Prior to commencing work on this project, the work environment and tools needed to be set up and installed, first the java environment was downloaded and installed (Java used with Weka and development of IDS simulation), next Python 3 was installed and the Scapy library was downloaded for implementation of the Packet Sniffer discussed in section 3.2.1. Weka was downloaded and installed for use in both the experiment and development sections of this project. The dataset was prepared by first downloading the KDD Cup 99 from the KDD website [6], the dataset comes in a plaintext format, this was converted to .arff format by adding the dataset attributes and features which can found on the KDD website. The work environment and dataset were now ready for work to be carried out on implementing the research experiment and development stages of the project. 4.1 Weka This project was implemented using the Weka open source data mining software tool, it is a collection of machine learning algorithms implemented using Java. Weka contains tools for data pre-processing, classification, clustering and visualization, making it the perfect tool for the task set out by this project [20]. Weka can be implemented in multiple ways via:  Command Line  Imported weka.jar library  Weka GUI In the Weka algorithm collection, Weka contains versions of each of the algorithms discussed in the section 3.1, algorithms in Weka being: NaiveBayes, J48, RandomForest, EM and SimpleKMeans. Weka also contains the pre- processing tools necessary for the analytical approaches discussed in sections 3.1.2.1 and 3.1.3.1. The research side of the project was implemented using the Weka GUI for its ease of use and for the tools ability to display and store created model. The development side of the project on the other hand was implemented using the imported weka.jar library which was used in the simulator java code. 4.2 Experiment In the Weka GUI, the KDD Cup 99 dataset was loaded into the pre-processor section of the GUI and prepared for use in the classification algorithm implementations followed by the clustering algorithm implementations.
  • 16. 11 4.2.1 Classification Implementation As discussed in section 3, the results of the Naïve Bayes algorithm will be used as a baseline for the rest of the algorithms seen and for a comparison of the rate of improvement, if any, from implementing the algorithms again with discretization applied to them. In the classification tab of the Weka GUI, the NaïveBayes algorithm is selected in the ‘Classifier’ section under the folder ‘Bayes’ which contains all Bayes style algorithms. First, a simple test was carried out to see if the KDD Cup 99 dataset could be classified correctly without errors by checking if a training model could be created without testing the model. This was done by selecting ‘Use training set’ in the test options field, the model was created successfully, thus the experiment moved onto building and testing a Naïve Bayes model using 10 fold Cross-validation. Cross-validation is another testing option in the test option field, it splits the dataset into a 90/10 split (90% for training the model and 10% for testing the model), and it does this for 10 iteration, each time selecting a different 90/10 split [21]. Naïve Bayes with 10 fold cross-validations results in a 95.411% classification accuracy, this is now the benchmark. Figure 4: Naive Bayes Model Figure 5: Naive Bayes Confusion Matrix
  • 17. 12 Following from Naïve Bayes, the J48 was run under the same conditions so as to keep the results fair, 10 fold cross-validation was performed on the dataset and returned a result of 99.9407% classification accuracy. Figure 6: J48 Model Figure 7: J48 Confusion Matrix Like J48 and Naïve Bayes, the Random Forest classification algorithm was run under the same conditions, 10 fold cross-validation was carried out on the dataset and the result saved for comparison, the resulting accuracy of the Random Forest algorithm was 99.9719%, which shows it is the best accuracy of the three classifier algorithms being investigated in this experiment. Goal being to see if either of the other algorithms can have competitive result when they are implemented using Discretization.
  • 18. 13 Figure 8: Random Forest Model Figure 9: Random Forest Confusion Matrix Results of implementing the classification algorithms shown below in figure: Figure 10: Classification Algorithms Accuracy 95.411 99.9407 99.9719 0 20 40 60 80 100 120 % Accuracy Classification Accuracy without Discretization Naïve Bayes J48 Random Forrest
  • 19. 14 Figure 11: Classification Algorithms False Positive Rate Figure 12: Algorithms Classification Time After all three of the classification algorithms that had been chosen for this experiment were implemented, the next stage was implementing them again, instead this time the algorithms were implemented on the KDD Cup 99 dataset with Discretization applied to it. This was done by going back to the pre-process tab in the Weka GUI and selecting the ‘filter’ button, the discretization filter was selected under the supervised filters folder and then applied. Naïve Bayes algorithm was run again with Discretization applied, this filter is the only change made to the run of the classifier, and all other conditions remain the same to ensure any improvement in results is due to discretization being applied. This time however the results seen were significantly improved with a 99.3966% classification accuracy compared to that of 95.411% accuracy when 174 81 63 0 50 100 150 200 Number of False Positives False Positives without Discretization Naïve Bayes J48 Random Forrest 3.2 47.93 380.87 0 100 200 300 400 Time in seconds Classification Time without Discretization Naïve Bayes J48 Random Forrest
  • 20. 15 Discretization was not applied to the Naïve Bayes algorithm, and a reduction in the number of false-positives of 141. Figure 13: Naive Bayes with Discretization Model Figure 14: Naive Bayes with Discretization Confusion Matrix As in the same case as Naïve Bayes, the J48 algorithm was also implemented with Discretization, unlike the Naïve Bayes algorithm, the J48 algorithm showed a reduction in performance with a classification accuracy of 99.932% compared to the original J48 algorithm without Discretization having a classifier accuracy of 99.9407%. Figure 15: J48 Model with Discretization
  • 21. 16 Figure 16: J48 Confusion Matrix with Discretization Given that J48 decision tree algorithm returned a reduction of classification accuracy when Discretization was applied, it was of interest to see if the results of the Random Forest algorithm would also result in a reduction of accuracy like the J48 algorithm as both algorithms belong to the decision tree algorithm family. The results of the Random Forest algorithm however did echo the same result as the J48 algorithm with the accuracy of the classifier reducing slightly to 99.9752% but had a massive reduction in classification time by almost half with the classification time of Radom Forest without Discretization being 380.87 seconds and the classification time with Discretization being 156.11 seconds. Figure 17: Random Forest Model with Discretization Figure 18: Random Forest Confusion Matrix with Discretization
  • 22. 17 Figure 19: Classification Algorithms Accuracy using Discretization Figure 20: Classification Algorithms False Positives Rate using Discretization 95.411 99.9407 99.9719 0 20 40 60 80 100 120 % Accuracy Classification Accuracy with Discretization Naïve Bayes J48 Random Forrest 33 121 90 0 20 40 60 80 100 120 140 Number of False Positives False Positives with Discretization Naïve Bayes J48 Random Forrest
  • 23. 18 In the next section, clustering algorithms will be explored, the results of these algorithms will be compared with the results of the classification algorithms discussed in this section. The overall results of the experiment will be discussed in section 5.1 Results. 4.2.2 Clustering Implementation The results of the previous classification algorithms will be used in comparison with the results from the K-Means and EM algorithms explored in this section, both implemented with and without Normalization and Standardization applied to the algorithms individually. Both the K-Means and EM algorithms will be run under the same condition and environment as the classification algorithms explored in section 4.2.1, except unlike the classification algorithms, direct cross- validation cannot be done in the clustering tab of Weka, to implement cross- validation in Weka for a clustering algorithm select the classification tab, under the meta classier folder select ‘ClassificationViaClustering’, this gives the option to select a clustering algorithm to implement, after this classification will be performed on the cluster so an evaluation can be made. To begin the KDD Cup 99 was reloaded into the Weka GUI to remove the Discretization filter applied to the dataset in the previous section. After the KDD Cup 99 dataset was loaded into Weka, the Cluster tab was selected, the first clustering algorithm selected was K-Means. The K-Means algorithm was run using ClassificationViaClustering’ with 12 clusters, the results being 77.4976% correctly classified instances. 1.93 5.68 148.93 0 20 40 60 80 100 120 140 160 Time in seconds Classification Time without Discretization Naïve Bayes J48 Random Forrest Figure 21: Algorithms Classification Time using Discretization
  • 24. 19 Figure 22: K-Means Model Figure 23: K-Mean Confusion Matrix Following on from implementing the K-Means algorithm, the EM was also implemented using ClassificationViaClustering’ with 12 clusters, to ensure test fairness. Unfortunately, limitation of memory and CPU, resulted in this run crashing the Weka, this was attempted several more times with the same results. By lowering the amount of clusters used to 2 a result was achievable. As the amount of clusters used in two algorithms used do not match, the results of the EM will not be used in comparison with the other algorithms so as to ensure test result fairness. However the EM algorithm can still implement with the analytical filters to investigate there effect on the algorithm itself. The result of the EM run was 83.7499% accuracy.
  • 25. 20 Figure 24: EM Model Figure 25: EM Confusion Matrix After the two clustering algorithms (without Standardization or Discretization) were implemented it was now time to implement them again but this time the algorithms were filtered using Standardization. With Standardization applied to the KDD Cup 99 dataset, the K-Means algorithm was implemented and led to an accuracy result of 77.496%, giving it the same result of the original result. Next the EM algorithm was implemented on the standardized KDD Cup 99 dataset, the result being 83.7509%, which was marginally better. Figure 26: K-Means Model with Standardization
  • 26. 21 Figure 27: EM Model with Standardization Next the KDD Cup 99 dataset implemented with Discretization applied, the same filter applied to the Classification algorithms but in this case is unsupervised Discretization. The same process of implementation that was carried out for Standardization was performed for Discretization. The K-Means algorithm had a new result of 81.4082% accuracy compared to the original result of 77.4976%, which is an increased accuracy improvement, while the EM algorithm returned a result of 73.7629% which was a reduction of over 9.987%. Figure 28: K-Means Model with Discretization Figure 29: K-means Confusion Matrix with Discretization
  • 27. 22 Figure 30: EM Model with Discretization As mentioned at the end of the previous section 4.2.1 the overall results of the experiment will be discussed in section 5.1 Results. 4.3 Development Continuing on from the proposed software design discussed in section 3.2, development of the system began by creating a packet sniffer, the packet sniffer was implemented using the Scapy approached, this choice was made because if the proposed system were to be implemented in an open-source environment it would be easier to incorporate Scapy rather than the likes of Snort which would need a separate install. The next stage of the development was implementing the pre-processing and a machine learning algorithm for the simulator, this was done using the Weka framework by importing the weka.jar library into a java class file, this class file is where the IDS training model will be created and evaluated. In the java project which was previously created when setting up the system environment at the start of section 4, a java class file was created and called ‘IDS’, to start necessary java libraries were imported, these being for the buffer reader, file reader and file not found exceptions. Next the Weka libraries needed were imported for classification and evaluation as the method of prediction being used in this simulator was classification, as the results seen in section 4.2 showed classification algorithms obtained greater prediction accuracy. Since the Random Forest algorithm showed the best classification results without filtering, it was chosen as the machine learning algorithm used in this implementation, to do this, Weka libraries for rules and trees needed to implement the Weka algorithm were imported. The final Weka libraries imported were those needed to handle instances and make predictions. In the IDS class file, methods for reading in the dataset, classifying (using cross- validation), a check for intrusion and main method which is used to run and output the results of the simulator. Once this class was tested and successful, a GUI was created using the NetBeans GUI builder to make the simulator more user friendly and easier to use. The GUI can be seen in figure 31.
  • 28. 23 Figure 31: IDS Simulation GUI When the simulator is run by clicking the ‘Monitor’ button it classifies the dataset using 10 fold cross-validation and checks to see if any classified instances differed from the predicted classified result, in this simulation this is seen as an intrusion. For the actual system, this intrusion would be the result of the incoming connection not matching the prediction, future work to achieve this functionality is discussed in section 5.3.
  • 29. 24 5 Evaluation 5.1 Results In this section the results of the experiment in section 4.2 will be discussed and explored, the algorithm that yielded the greatest result prior to being exposed to an analytical filter was classification algorithm Random Forest, the lowest result came from the clustering algorithm K-Means. The algorithm that had the best improvement from an applied filter was Naïve Bayes with the discretization filter of 3.9856%, narrowly beating the improvement of K-Means with discretization of 3.5844%, speaking of discretization, even though it showed a decrease in performance for the decision tree algorithms and the clustering algorithm EM, it showed promising results with Naïve Bayes and K-Means algorithms. Applying an analytical filter to an algorithm resulted in the reduction in model classification time for both Classification and Clustering algorithms, in the case of Discretization, this reduction in classification time can be seen as a possible cause for the reduced performance seen in the J48, Random Forest and EM algorithms, which have a greater complexity than Naïve Bayes and K-Means. Figure 32: Table showing the classification accuracy of Naive Bayes and K-Means algorithms with and without Discretization applied. Classification accuracy shown in percentages In the results there was a massive reduction in false positives (false alarms) when Discretization was applied to Naïve Bayes, this is a direct correlation to the increase in classification accuracy. What is interesting in the results is that when Naïve Bayes is implemented using Discretization the number of false positives recorded was lower than the number of false positives recorded by Random Forest (Without Discretization) even though Random Forest still has a higher 0 20 40 60 80 100 No Filter Discretization 95.411 99.3966 77.4976 81.4082 Classification Accuracy Comparison in % Naïve Bayes K-Means
  • 30. 25 classification accuracy, thus discretised Naïve Bayes algorithm must have more false negatives than Random Forest, this is also of concern in intrusion detection but in this experiment the concern is on classification accuracy and the number of false positives. A big surprise in the result was that although the classification accuracy of the K-Means algorithm improved with Discretization, the number of false positives rose by a massive 986, thus there was an equal reduction in false negative in the algorithm. This result could be due to the amount of clusters being used in this experiment or the regular result of running K-Means with Discretization, future research can be explored into this problem with an environment that has the memory and CPU power to handle more clusters. Figure 33: Number of False Positives, Naive Bayes/ K-Means Comparison The overall top 3 rankings classification/clustering algorithms both with and without a filter based on classification accuracy are: 1. Random Forest – No Filter 2. J48 – No Filter 3. Naïve Bayes – Discretization This does not take into account number of false positives or classification time, ranking of algorithms which resulted in the lowest number of false-positives shown below: 1. Naïve Bayes – Discretization 2. Random Forest – Discretization 3. J48 - Discretization The development portion of the project resulted in a functional IDS Simulation being developed, the code which can easily be augmented for other classification 0 500 1000 1500 2000 No Filter Discretization 174 33816 1802 Number of False Positives Naïve Bayes K-Means
  • 31. 26 algorithms resulted in a system that could take in a dataset, pre-process the data, classify the data instances using Random Forest with 10 fold cross- validation, checks the resulting classifier if an intrusion was detected and if so output to the GUI a detection message. Real-time/near real-time intrusion in this project was unable to be implemented but is discussed in section 5.3. 5.2 Conclusion From the results shown in the previous section 5.1, a conclusion can be made on the effects of using analytics to improve intrusion detection, this being that there is no clear supreme machine learning algorithm for Intrusion Detection Systems, for example the Classification algorithm Random Forest had the best classifier accuracy (Resulting in the lowest number of false positives + false negatives) but had a high classification time, this algorithm approach would be highly effective on a network with high CPU and Memory resources, however this would not be scalable for a network with lower CPU and Memory resources as a decision tree algorithm will cause overhead (Large classification time) on the network. In the results it can be seen that this overhead can be reduced using Discretization on decision tree algorithms for a small loss in accuracy performance, this reduction in performance still has the decision tree algorithms ahead of any other algorithm explored in this project in accuracy performance. If accuracy performance, low classification time and low number of false positives is what the network needs than the Naïve Bayes algorithm with Discretization is the best choice, with Naïve Bayes returning the lowest number of false-positives. In conclusion, the best machine learning algorithm for IDS is the algorithm that best fits the requirements and restrictions of the network the system will be implemented on and whether the approach is either Misuse or Anomaly. Furthermore, in the majority case (minority case being clustering algorithms), implementing an algorithm with an analytical filter like Discretization or Standardization resulted in one or more benefits to a machine learning algorithm, whether this be in increase in accuracy performance, reduction in classification time and reduced number of false positives (All these benefits in the case of Naïve Bayes). Thus research into analytical filters for use on a dataset is an invaluable use of time to ensure the IDS used gets the best results possible. Finally, these results show that classification algorithms return the greater performance compared to the clustering algorithms experimented on in this project, this however does not mean classification algorithms are better than clustering algorithm, instead it shows that the KDD Cup 99 dataset works well using Classification compared to Clustering, also clustering algorithms work off data with no classes, this type (unsupervised) of machine learning is much more difficult and complex compared to that of Classification. In the end this project showed that there are benefits and downfalls to most analytical approaches for machine learning algorithms, but the benefits outweigh the downfalls thus selecting the right machine learning algorithm
  • 32. 27 implementation with a suitable analytical filter can indeed enhance intrusion detection. 5.3 Future Work This section of the project will discuss work that was not able to be developed/implemented and changes that would be made now looking back on the results of the project. The main functionality that was not able to be implemented was real-time/near real-time intrusion detection, the IDS Simulator did not have the functionality needed to classify incoming network packets, and this is done by deploying a created machine learning model which is then used to predict the type of the incoming packet. The packet sniffer implemented (Scapy) outputs the incoming packets IP address rather than giving the packets connection info, if real- time/near real-time intrusion detection is to be performed on KDD Cup 99 dataset, the incoming packets must be converted to connection level data. This continues on to the downfalls of the KDD Cup 99 dataset, it is seen to be a poor dataset for use on real world data, as it is over 15 years old and as such is outdated, if this project was undertaken again, a dataset collected over a few months on the environment you intend to implement it on with rare attacks like R2L and U2R being favoured rather than DOS attacks which is the case in KDD Cup 99 dataset. The dataset would also be skewed with more normal packets rather than the heavily attack skewed KDD Cup 99 dataset. The IDS Simulator would be a better model of how Intrusion Detection Systems are actually implemented if it was implemented using a clustering algorithm as attacks these day are unknown so an anomaly approach would be more beneficial. Finally, the environment used in this project did not have the CPU processing power and RAM needed to deal with complex algorithms classifying large datasets (as seen with the EM clustering algorithm), which meant some results couldn’t be recorded. If experiment was implemented on an environment with high CPU processing power and RAM more accurate results could have been generated on a possibly bigger dataset.
  • 33. References [1] SANS™ Institute, “What is a false positive and why are false positives a problem?,” [Online]. Available: https://www.sans.org/security-resources/idfaq/what-is-a-false- positive-and-why-are-false-positives-a-problem/2/8. [2] Predictive Analytics Today, “What is Predictive Analytics,” [Online]. Available: http://www.predictiveanalyticstoday.com/what-is-predictive-analytics/. [3] A. Smola and S. Vishwanathan, “Introduction to Machine Learning”. [4] M. J. Zaki and W. Meira Jr., “Data mining and Analysis: Fundamental Concepts and Algorithms,” 2014. [5] A. A. Rao, P. Srinivas, B. Chakravarthy, K. Marx and P. Kiran , “A Java Based Network Intrusion Detection System (IDS)”. [6] MIT Lincoln Labs, “KDD Cup 1999 Data,” Information and Computer Science University of California, Irvine, 1999. [Online]. Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [7] M. K. Siddiqui and S. Naahid, “Analysis of KDD CUP 99 Dataset using Clustering based Data,” International Journal of Database Theory and Application, vol. Vol.6, no. No.5, pp. 24-25, 2013. [8] M. Tavallaee, E. Bagheri, W. Lu and A. A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” 2009. [9] H. Liu, F. Hussain, C. L. Tan and M. Dash, “Discretization: An Enabling Technique,” Data Mining and Knowledge Discovery, 2000. [10] M. Walker, “Random Forests Algorithm,” 2013. [Online]. Available: http://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm. [11] I. B. Mohamad and D. Usman, “Standardization and Its Effects on K-Means Clustering Algorithm,” p. 3300, 2013. [12] N. Sharma, A. Bajpai and R. Litoriya, “Comparison the various clustering algorithms of weka tools,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 5, pp. 76-79, 2012. [13] C. Piech and A. Ng, “K Means,” [Online]. Available: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html. [14] M. B. and M. B. , “An overview to Software Architecture in Intrusion Detection System,” International Journal of Soft Computing And Software Engineering (JSCSE), p. 4, 2011.
  • 34. [15] D. Magers, “Packet Sniffing: An Integral Part of Network Defense,” 2002. [16] M. Roesch. [Online]. Available: https://www.snort.org/. [17] Penn State Berks, “Introduction to Snort,” [Online]. Available: http://istinfo.bk.psu.edu/labs/Snort.pdf. [18] . M. Richardson and B. Fenner. [Online]. Available: http://www.tcpdump.org/. [19] P. Biondi. [Online]. Available: http://www.secdev.org/projects/scapy/. [20] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, The WEKA Data Mining Software: An Update, vol. 11, 2009. [21] P. Refaeilzadeh, L. Tang and H. Liu, Cross-Validation, Arizona State University, 2008. [22] S. K. Patro and K. K. Sahu, “Normalization: A Preprocessing Stage”. Figure 1: Bayes Theorem 5 Figure 2: IDS Software Architecture Overview [12] 7 Figure 3: Software Architecture Overview of IDS Simulator 8 Figure 4: Naive Bayes Model 11 Figure 5: Naive Bayes Confusion Matrix 11 Figure 6: J48 Model 12 Figure 7: J48 Confusion Matrix 12 Figure 8: Random Forest Model 13 Figure 9: Random Forest Confusion Matrix 13 Figure 10: Classification Algorithms Accuracy 13 Figure 11: Classification Algorithms False Positive Rate 14 Figure 12: Algorithms Classification Time 14 Figure 13: Naive Bayes with Discretization Model 15 Figure 14: Naive Bayes with Discretization Confusion Matrix 15 Figure 15: J48 Model with Discretization 15 Figure 16: J48 Confusion Matrix with Discretization 16 Figure 17: Random Forest Model with Discretization 16 Figure 18: Random Forest Confusion Matrix with Discretization 16 Figure 19: Classification Algorithms Accuracy using Discretization 17 Figure 20: Classification Algorithms False Positives Rate using Discretization 17 Figure 21: Algorithms Classification Time using Discretization 18 Figure 22: K-Means Model 19 Figure 23: K-Mean Confusion Matrix 19 Figure 24: EM Model 20 Figure 25: EM Confusion Matrix 20 Figure 26: K-Means Model with Standardization 20
  • 35. Figure 27: EM Model with Standardization 21 Figure 28: K-Means Model with Discretization 21 Figure 29: K-means Confusion Matrix with Discretization 21 Figure 30: EM Model with Discretization 22 Figure 31: IDS Simulation GUI 23 Figure 32: Table showing the classification accuracy of Naive Bayes and K-Means algorithms with and without Discretization applied. Classification accuracy shown in percentages 24 Figure 33: Number of False Positives, Naive Bayes/ K-Means Comparison 25