NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique

NSL KDD Cup 99 dataset Anomaly Detection using
Machine Learning Technique
An Experiment and evaluation using Decision Tree
Under Guidance of
Dr. Kalpana Thakre
NATIONAL CONFERENCE
ON RECENT TRENDS AND
ADVANCES IN COMPUTING,
COMMUNICATION AND SECURITY
Presented by
Sujeet Raosaheb Suryawanshi
ME IT SEM III ; Roll No. 613012

Agenda
 Anomaly Detection
 Machine Learning
 IDPS
 Survey of algorithm
 Decision Tree
 Experiment with NSL KDD Cup 99
 Result
 Future research roadmap

Anomaly Detection
 Intrusion Detection System / Intrusion Prevention system are
used to protect trusted networks from untrusted networks
 One of the threat is Denial of Service (DoS) Attack
 Approaches to detect DoS attack
1. Signature based
2. Anomaly based
 Signature based deals with limited/fixed set of known threats
 Anomaly-based detection technique centres on the concept of
a baseline for network behaviour, any deviation from this
baseline is considered as an anomaly.

Machine Learning
 A scientific discipline that is concerned with the design and
development of algorithms that allow computers to learn
based on data. A major focus of machine learning research is
to automatically learn to recognize complex patterns.
 This is similar to the way human brain works, humans take
decision based on the learning or experiences they have.

Motivation & Objective
 To understand techniques available to support the vision
envisaged for “Anomaly Detection using Machine Learning
Technique”
 To experiment and evaluate NSL KDD Cup 99 dataset using
Decision Tree Classifier
 To understand various anomaly detection and machine
learning techniques
 Identify requirements for building platform for anomaly
detection system

Classification of IDPS
IntrusionDetection
System Data collection
techniques
HIDS
NIDS
Data analysis
techniques
Specification
based
Anomaly based
Nearest
neighbor based
Clustering
based
K-Means
Statistical based
Classification
based
SVM
Fuzzy Logic
Genetic Algo
Decision Tree
Naive Byesian
Neural Network
Others
Signature based

TECHNIQUES Nearest neighbor based
detection techniques
Clustering-based
anomalies detection
techniques
Statistical
techniques
Classification
techniques
Assumption Normal
data
instances
present in dense
neighbourhoods
belong to a cluster in the
data, lie close to their closest
cluster centroid, belong to
large and dense clusters,
occur in high
probability regions
of a stochastic
model
A classifier that
can distinguish
between normal
and anomalous
classes can be
learnt in the given
feature space.
Anomalies occur far from their closest
neighbours
does not belong to any
cluster, are far away from
their closest cluster centroid,
are either too small or too
sparse clusters.
occur in the low
probability regions
of the stochastic
model
Advantages  Unsupervised/semi-
supervised mode
 Simplest approach
 Unsupervised
 Fast comparison
 Unsupervised
and simple
 Confidence
interval is
provided with
anomaly score
 Fast testing
phase process
 Improved
efficiency with
ensemble
methods
Disadvantages  High computational cost
in testing phase
 Difficult where several
regions are with widely
differing densities.
 Difficult to identify in case
if anomalies are present
in groups.
 Dependent on the
proximity measures used
 High computation cost in
cluster formation phase
 A data object not
belonging to any cluster
may be a noise rather
than an anomaly
 Not suited for large
datasets
 Fail to label anomalies in
certain cases
 Fail to label
the anomalies
correctly in
certain cases
 Difficult to find
best statistic
 For
multivariate
data it fails to
capture the
interactions
between
different
 Heavy
dependency
and reliability
on training
data
 Class
imbalance
problem

Decision Tree SVM Naive Bayes ANN Fuzzy Logic GA K-Means
Technique Classification Classification
& Regression
Classification Classification Classification Classification Clustering
Computation cost High High Less - High - -
High dimensional
data
Yes Yes Yes Yes - - -
Advantages  Easy to
understand
for smaller
trees
 Handles
irrelevant and
missing data
 Compact after
pruning
 High
detection
accuracy.
 Learning
ability for
small set of
samples.
 High
training rate
and
decision
rate,
insensitiven
ess to
dimension
of input
data
 Easy
constructio
n
 Takes
short
computatio
n time;
 Works
efficiently
with large
dataset
 Ability to
generalize
from
limited,
noisy and
incomplete
data.
 Ease of use
 Detect
unknown
intrusions.
 Supports
multiclass
detection.
 Permits a
data point
to be in
more than
one cluster.
It has a
more
natural
representat
ion of the
behavior of
genes. It’s
effective,
especially
against port
scans and
probes.
 Derives
best
classificatio
n rules.
 Selects
optimal
parameters
.
 Simple to
use.
Disadvantage  Fails to
classify a
scattered
data
 Uses greedy
algorithm,
hence may
not find best
tree

 Positive &
negative
examples
req.
 High
dependenc
y on
selecting
good kernel
function.
 Training
takes a long
time.
 Difficult to
handle
continuous
features.
 Highly
dependent
on prior
knowledge.
 Training
required
 Needs to
be
emulated.
 Longer
training
process.
 Over-fitting
issue
 Need to
determine
membershi
p cutoff
value
 Clusters
are
sensitive to
initial
assignment
of centroids
 Can’t
assure
constant
optimizatio
n response
times.
 Over-fitting
issue
 Necessity
of
specifying
k.
 Sensitive
to noise
 Clusters
are
sensitive
to initial
assignme
nt of
centroids.

Decision Tree Classifier
Algorithm : Decision tree
1. Split(node, {example}):
2. A the best attribute for splitting the {examples}
3. Decision attribute for this node  A
4. For each value of A, create new child node
5. Split training {examples} to child nodes
6. For each child node/subset:
If subset is pure: STOP
Else: Split(node,{subset})

Entropy
 For selecting best attribute:
 At each step, find the attribute that can be used to partition the
dataset to minimise the entropy of the data
 A completely homogeneous sample has entropy of 0.
 An equally divided sample has entropy of 1.
 Entropy(s) = - p+log2 (p+) -p-log2 (p-) for a sample of negative and
positive elements.
 The formula for entropy is:

Decision Tree – Sample Dataset
Years
Experience Employed?
Previous
employers
Level of
Education
Top-tier
school Interned Hired
10 Y 4 BS N N Y
0 N 0 BS Y Y Y
7 N 6 BS N N N
2 Y 1 MS Y N Y
20 N 2 PhD Y N N
0 N 0 PhD Y Y Y
5 Y 2 MS N Y Y
3 N 1 BS N Y Y
15 Y 5 BS N N Y
0 N 0 BS N N N
1 N 1 PhD Y N N
4 Y 1 BS N Y Y
0 N 0 PhD Y N Y

Decision Tree – Sample Dataset Explained
1
2
3
45

Steps for creating and evaluating Model
 1) Import data
 2) Edit Metadata
 3) Convert Indicator Values
 4) Select Columns in dataset
 5) Feature selection
 6) “Decision Tree” on
separate partitions
 7) Score Model by adding
scored labels and scored
possibilities
 10) Evaluate model using
Precision, Recall and False
positive rate
 11) Compare performance
and conclude which model to
be used

Activity Diagram
Import Data
Read training set
Convert to indicator
Values
Replace Class column with
indicator values
Select Columns in Dataset
Remove diff level column
along with other
unnecessary columns
Import Data
Read training set
Convert to indicator
Values
Replace Class column with
indicator values
Select Columns in Dataset
Remove diff level column
along with other
unnecessary columns
Feature Selection
Select 15 most important
features
Two-Class
Decision
Tree
Two-Class
Decision Tree Tune Model
Tune Model
Score Model
Score Model
Evaluate Model
Generate and compare
scores
Generate Table that
summarises result
Evaluate Model
Generate and compare
scores
For model testing
ForModelcreationandtuning

Results
 Total Records = ~1.25 Lacs (125973)
 Model Building = ~75K (60%)
 Test Model = ~50K (40%)
Precision(Positive predictive value)= TP/(TP + FP)
Recall (True Positive Rate) = TP/(TP+FN)
False positive rate (FPR), Fall-out, probability of false alarm = FP/Total
Negative
Depth of a tree precision recall false positive rate precision recall false positive rate
5 0.986469 0.986458 0.014073 0.969968 0.969788 0.029288
10 0.996714 0.996713 0.003258 0.98458 0.984557 0.01519
15 0.998297 0.998297 0.00173 0.98616 0.986121 0.01346
20 0.998258 0.998258 0.001764 0.986866 0.986814 0.012658
25 0.998258 0.998258 0.001764 0.98705 0.986992 0.012443
All Features Selected Features

Future research roadmap
 Work with other algorithms - Random Forest, SVM, K-Means,
Logistic Regression and observe if ensemble methodology can
further enhance the model
 Build real time anomaly detection using the same approach
and methodology

[1] K. H. Rao, “Implementation of Anomaly Detection Technique Using Machine Learning Algorithms,” International Journal of Computer Science and Telecommunication, vol. 2, no. 3, pp. 25-31, 2011.
[2] D. K. &. M. Karami, “A Comprehensive Survey on Anomaly-Based Intrusion Detection,” Computer and Information Science, vol. 5, no. 4, pp. 132-140, 2012.
[3] S. S. Ravneet Kaur, “A survey of data mining and social network analysis based anomaly detection techniques,” Egyptian Informatics Journal, vol. 2016, no. 17, p. 199–216, 2016.
[4] A. M. V. M. Niharika Sharma, “Machine Learning Techniques Used in Detection of DOS Attacks: A Literature Review Attacks: A Literature Review,” International Journal of Advanced Research in
Computer Science and Software Engineering, vol. 6, no. 3, pp. 100-106, 2016.
[5] A. N. H. H. J. Salima Omar, “Machine Learning Techniques for Anomaly Detection: An Overview,” International Journal of Computer Applications (0975 8887), vol. 79, no. 2, 2013.
[6] M. H. Dunham, Data Minig, PEARSON, 2013.
[7] M. K. Rashmi Hebbar, “Network Attack Detection Using Machine Learning Approach,” in International Conference , “Computational Systems for Health & Sustainability”, Bangalore, 2015.
[8] M. J. N. Jayveer Singh, “A Survey on Machine Learning Techniques for Intrusion Detection Systems,” International Journal of Advanced Research in Computer and Communication Engineering, Pune,
2013.
[9] G. S. J. M. Harjinder Kaur, “A review of Machine Learning based Anamoly Detection Techniques,” International Journal of Computer Applications Technology and Research, vol. 2, no. 2, pp. 185-187,
2013.
[10] M. R. A. R. O. M. R. F. M. S. D. F. A. K. H. Nutan farah haq, “Application of Machine Learning Approaches in Intrusion Detection System: A Survey,” (IJARAI) International Journal of Advanced Research
in Artificial Intelligence, vol. 4, no. 3, pp. 9-19, 2015.
[11] S. J. Peyman Asgharzadeh, “A SURVEY ON INTRUSION DETECTION SYSTEM BASED SUPPORT VECTOR MACHINE ALGORITHM,” INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER
APPLICATIONS AND ROBOTICS, vol. 3, no. 12, pp. 42-50, 2015.
[12] J. A. Shikha Agrawal, “Survey on Anomaly Detection using Data Mining Techniques,” in International Conference on Knowledge Based and Intelligent Information and Engineering Systems, Department of
Computer Science and Engineering, Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal, India, 2015.
[13] M. S. H. M. D. A. Asghar Ali Shah, “Analysis of Machine Learning Techniques for Intrusion Detection System: A Review,” International Journal of Computer Applications, vol. 119, no. 3, pp. 19-40, June
2015.
[14] N. P. f. Intelligent, “Numenta,” 2015. [Online]. Available: https://numenta.com/assets/pdf/whitepapers/Numenta%20White%20Paper%20-%20Science%20of%20Anomaly%20Detection.pdf.
[15] A. B. a. V. K. VARUN CHANDOLA, “Anomaly Detection : A Survey,” ACM Computing Surveys, Minneapolis and St. Paul, Minnesota, 2009.
[16] J. W. B. Sergio Armando Gutierrez, Application of Machine Learning Techniques to Distributed Denial of Service (DDoS) Attack Detection: A Systematic Literature Review, Medell´ın, 2012.
[17] J. Goldberg, “RSA,” 2013. [Online]. Available: http://www.rsaconference.com/writable/presentations/file_upload/ht-t08-_big-data_-for-security-purposes_how-can-i-put-big-data-to-work-for-me_copy1.pdf.
[18] “splunk,” 2015. [Online]. Available: https://www.splunk.com/web_assets/pdfs/secure/Splunk_as_a_SIEM_Tech_Brief.pdf. [Accessed 15 April 2016].
[19] B. J. B. A. A. S. David J. Weller-Fahy, “A Survey of Distance and Similarity Measures Used Within Network Intrusion Anomaly Detection,” IEEE COMMUNICATION SURVEYS & TUTORIALS, vol. 17, no.
Bibliography

System GINI=0.21,
Gini(Employed)=0.15,
Gini(Interned)=0.15

Results
 Total Records = ~1.25 Lacs (125973)
 Model Building = ~75K (60%)
 Test Model = ~50K (40%)
Anomaly Normal
Anomaly 26887
(TP)
111 (FN) (Type II)
Normal 554 (FP) 22957 (TN)
Total 27441 23068
Accuracy = (TP+TN)/Total=49884/50509 = 0.9868
Precision = TP/(TP + FP)=26887/(26887+554)=0.9798
Actual
Predicted (All Features)
Anomaly Normal Total
26366
(TP)
632(FN) (Type II) 26998
698 (FP) 22813(TN) 23511
27064 23445 50509
Predicted (Selected Features)
Accuracy= 0.9736
Precision= 0.9742

Results
Description Precision Recall Area Under ROC
1. Decision Tree, Full
Data Accuracy
0.9951 0.9764 98.62%
2. Decision Tree,
Selected Feature Data
Accuracy
0.9730 0.9703 97.35%
• Precision (Positive Predictive Value) PPV=TP/TP+FP
• Recall (True Positive Rate) TPR =TP/TP+FN
• Area under ROC: Plot of true positive rate (TPR, or specificity)
against false positive rate (FPR, or 1 - sensitivity), which is all a
Receiver Operating Characteristics (ROC) curve.

Output of Anomaly Detection
 Scores
 Labels

Decision Trees
 Supervised technique
 Entropy
 A measure of dataset’s order-How same or different it is
 If we classify dataset into N different classes
 0=all classes are same
 1=classes are different
 At each step, find the attribute that can be used to partition the data set to
minimise the entropy of the data
 A completely homogeneous sample has entropy of 0.
 An equally divided sample has entropy of 1.
 Entropy(s) = - p+log2 (p+) -p-log2 (p-) for a sample of negative and positive elements.
 The formula for entropy is:
 Greedy algorithm is used
 Demo : Refer Excel

Support Vector Machines
 Supervised technique
 Works well for classifying higher-dimensional data
 Finds higher-dimensional support vectors across which to divide the data
 Kernels can be used to represent data in higher dimensional spaces to find
hyperplanes that might not be apparent in lower dimensions
 Types:
 Linear
 Polynomial (Curves)
 RBF
 Functions takes low dimensional input space and transform it to a higher dimensional
space i.e. it converts not separable problem to separable problems.
 Useful in non-linear separation problem. Simply put, it does some extremely complex
data transformations, then find out the process to separate the data based on the labels
or outputs you’ve defined.
 Computationally expensive
 Plot each data item as a point in n-dimensional space (where n is number of
features you have) with the value of each feature being the value of a particular
coordinate.
 Perform classification by finding the hyper-plane that differentiate the two classes
 Use Train test to decide the model

Support Vector Machines
 ADV:
 Works well when clear
separation exists
 It uses a subset of training
points in the decision function
(called support vectors), so it is
also memory efficient.
 Works well for high dimensional
data
 DISADV
 It doesn’t perform well,
 when we have large data set
because the required training time
is higher
 when the data set has more noise
i.e. target classes are overlapping
 SVM doesn’t directly provide
probability estimates, these are
calculated using an expensive
five-fold cross-validation.
 Noise may create issue

Naïve Bayes
 Classification technique based on Bayes Theorem
 Bayes Theorem
 P(A|B)=P(A)P(B|A)/P(B)
 Efficient in computation as compared to decision trees
 Naïve Bayesian Network can be represented in using DAG,
 Each node represents attribute
 Each link represents influence of one node to another
 Calculate probability and sum it up and as per threshold predict.
 Demo: Spam Classifier
 P(spam|free)=P(spam)P(free|spam) / P(free)
 Probability of message being spam and containing word ‘free’ / overall
probability of having word ‘free’

Naïve Bayes
 ADV:
 Construction is easy and also
takes short computation time;
 It can be applied to large
dataset since it does not
involve in complicated
parameter;
 Interpretation of knowledge
representation; &
 Encodes probabilistic
relationships among the
variables of interest. Ability to
incorporate both Prior
knowledge and data.
 DISADV
 Harder to handle continuous
features. May not contain any
good classifiers if prior
knowledge is wrong.

K-Means Clustering
 Iterative clustering technique based on splitting of data into K
groups that are closes to K centroids
 Unsupervised learning based on the position of each element
 Can uncover interesting grouping
 Randomly pick K centroids
 Assign each data point to its closes centroid
 Recompute the centroids based on their average position
 Iterate until points stop changing
 If want to predict cluster for new data points, just check it is closest to
which centroid

K-Means Clustering
 ADV:
 Less complex
 DISADV
 Choosing right value of K
 Labelling of cluster to be
done manually
 Sensitive to noise

Nature of Input Data
 binary, categorical or continuous.
 Univariate/multivariate
 Nature of attributes determines the applicability of anomaly
detection techniques
 E.g., Statistical techniques to be used for continuous and categorical
data.

Data Labels
 Based on the extent to which the labels are available, anomaly
detection techniques can operate in one of the following three
modes:
 Supervised
 Semi-Supervised
 Unsupervised

Types of Anomalies
 Point
 Contextual
 Collective

Challenges
 Defining a normal region
 Anomalous observations appear like normal
 Notion of an anomaly
 Availability of labeled data
 Noise

Key components
Research Areas
Machine Learning
Data Mining
Information Theory
Spectral Theory
……….
Anomaly
Detection
Technique
Problem
Characteristics
Nature of Data
Labels
Anomaly Type
Output
………
……….
Application Domains
Intrusion Detection
Fraud Detection
………
………
……….

Methodology
Monitored Environment
Parameterisation
Training
Model
Detection
Intrusion Reporting

Architecture
Apache Spark Streaming
MLibSource of Input
Data
Anomaly
detection model Detected Anomalies
(Outliers)
Real-time stream processing engine Real-time
stream
Real-time
stream

NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique

Similar to NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique (20)

Recently uploaded

Recently uploaded (20)

NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique

Editor's Notes