Ensemble of K-Nearest Neighbour Neighbour Classifiers for Intrusion
Detection
Presented By
Imran Ahmed Malik
M.Tech CSE Networking Final Year
Sys ID 2014016942
Under the Guidance
of
Mrs. Amrita
Asst. Professor
SHARDA UNIVERSITY, GREATER NOIDA
Contents
• Objective
• Problem Statement
• Proposed system
• Introduction to implemented algorithm.
• Results and Graphs
• Conclusion
Objective
• Can GP based numeric classifier show optimized performance
than individual K-NN classifiers?
• Can GP based combination technique produce a higher
performance OCC as compared to K-NN component
classifiers?
Problem Statement
OPTIMIZATION AND COMBINATION OF KNN CLASSIFIERS USING
GENETIC PROGRAMMING FOR INTRUSION DETECTION SYSTEM
Proposed Model
KDD CUP 1999 data set
K-NN Classifiers
Import KDD Dataset
Select Initial K-
Nearest Neighbors
Optimization
Possible?
Set GA Parameters
Generate initial random
population
Evaluate fitness of
each classifier
Parent selection for next
generation
Crossover
Is optimization met?
End
YES
No
No
Figure 3 shows the operations of a general genetic algorithm according to which GA is
implemented into our system.
GP Based Learning Algorithm
Training Pseudo Code
 Stst , St represents the test and training data.
 C(x): class of x instance
 OCC: a composite classifier
 Ck : kth component classifier
 Ck (x): Prediction of Ck
Train-Composite Classifier (St ,OCC)
Step 1: All input data examples x ∈ St are given to K component
classifiers.
Step 2: Collect [C1 (x),C2 ( x), ,Ck (x)] for all x ∈ St to form a set of
prediction Class
Step 3: Start GP combining method, while using predictions as unary
function in GP tree. Threshold T is used as a variable to compute
ROC curve.
GP Based Learning Algorithm………
Pseudo Code for Classification
1. Apply composite classifier (OCC, x )to data examples x
taken from Stst .
2. X= [C1 (x),C2 ( x), ,Ck (x)], stack the predictions to form new
derived data.
3. Compute OCC(x)
Working Of Genetic Programming
1. The algorithm begins by creating a random initial population.
2. The algorithm then creates a sequence of new populations. At each step, the
algorithm uses the individuals in the current generation to create the next
population. To create the new population, the algorithm performs the
following steps:
I. Scores each member of the current population by computing its fitness
value.
II. Scales the raw fitness scores to convert them into a more usable range of
values.
III. Selects members, called parents, based on their fitness.
IV. Some of the individuals in the current population that have lower fitness are
chosen as elite. These elite individuals are passed to the next population.
V. Produces children from the parents. Children are produced either by making
random changes to a single parent—mutation—or by combining the vector
entries of a pair of parents—crossover.
VI. Replaces the current population with the children to form the next
generation.
Dataset And Operations on Dataset
• KDD CUP 1999 dataset
• Remove Redundancy
• Conversion of values
• Normalization
• PCA
• Final Corrected data
Tools Used
• Genetic Programming Tool Kit
• Windows operating system
• 4 Gb Ram
• I5 processor
• Matlab
RESULTS GRAPHS AND
ANALYSIS
Fitness Function
• Records :records must be maximum
• Num folds :Number of folds must be minimum
• K_value: k should be closer optimal
• Time: time must be minimum negative
• Model : highest model is preferred
• Accuracy: top accurate model is preferred
f=records + num folds + K_value + Time +model + accuracy;
Current Best individual
records
Num-folds
model
time
K-value
accuracy
GP Stopping Criteria
GP Selection Function
Confusion Matrix For Normal Class
Confusion Matrix For DoS Class
Confusion Matrix For R2L Class
Confusion Matrix For U2R Class
Confusion Matrix For Probe Class
Confusion matrix
• Scatter Plot of Src byteswithCount ForClassusingKNN
• Scatter Plot of src bytes versus dst host same src port rate for Class using KNN
• Roc Curve
• ROC curvefor GPbasedClassifiershowing 0.99976 areaunder the curve
• Classification ResultsusingEnsemble of Classifiers
Conclusion
• Ensemble increase the performance
• It reduces the error rates
• GP based ensembler provides better results then individual
classifier
References
• Gianluigi Folino, Giandomenico Spezzano and Clara Pizzuti, Ensemble
Techniques for parallel Genetic Programming based classifier
• Michał Woz´niak, Manuel Grana, Emilio Corchado,2014, A survey of
multiple classifier systems as hybrid systems, ELSEVIER.
• Urvesh Bhowan, Mark Johnston, Member, IEEE, Mengjie Zhang, Senior
Member, IEEE, and Xin Yao, Fellow, IEEE, JUNE 2013, Evolving Diverse
Ensembles Using Genetic Programming for Classification With Unbalanced
Data, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 17,
NO. 3, JUNE 2013.
• H Nguyen, K Franke, S Petrovic Improving Effectiveness of Intrusion
Detection by Correlation Feature Selection, 2010 International Conference
on Availability, Reliability and Security, IEEE.
• Shelly Xiaonan Wu, Wolfgang Banzhaf. 2010. The use of computational
intelligence in intrusion detection systems: A review. Applied Soft
Computing 10, 1-35
• Ahmad Taher Azar, Hanaa Ismail Elshazly, Aboul Ella Hassanien, Abeer
Mohamed Elkorany. 2013. A random forest classifier for lymph diseases.
Computer Methods and Programs in Biomedicine.
Thank You

powerpoint feb

  • 1.
    Ensemble of K-NearestNeighbour Neighbour Classifiers for Intrusion Detection Presented By Imran Ahmed Malik M.Tech CSE Networking Final Year Sys ID 2014016942 Under the Guidance of Mrs. Amrita Asst. Professor SHARDA UNIVERSITY, GREATER NOIDA
  • 2.
    Contents • Objective • ProblemStatement • Proposed system • Introduction to implemented algorithm. • Results and Graphs • Conclusion
  • 3.
    Objective • Can GPbased numeric classifier show optimized performance than individual K-NN classifiers? • Can GP based combination technique produce a higher performance OCC as compared to K-NN component classifiers?
  • 4.
    Problem Statement OPTIMIZATION ANDCOMBINATION OF KNN CLASSIFIERS USING GENETIC PROGRAMMING FOR INTRUSION DETECTION SYSTEM
  • 5.
    Proposed Model KDD CUP1999 data set K-NN Classifiers Import KDD Dataset Select Initial K- Nearest Neighbors Optimization Possible? Set GA Parameters Generate initial random population Evaluate fitness of each classifier Parent selection for next generation Crossover Is optimization met? End YES No No Figure 3 shows the operations of a general genetic algorithm according to which GA is implemented into our system.
  • 6.
    GP Based LearningAlgorithm Training Pseudo Code  Stst , St represents the test and training data.  C(x): class of x instance  OCC: a composite classifier  Ck : kth component classifier  Ck (x): Prediction of Ck Train-Composite Classifier (St ,OCC) Step 1: All input data examples x ∈ St are given to K component classifiers. Step 2: Collect [C1 (x),C2 ( x), ,Ck (x)] for all x ∈ St to form a set of prediction Class Step 3: Start GP combining method, while using predictions as unary function in GP tree. Threshold T is used as a variable to compute ROC curve.
  • 7.
    GP Based LearningAlgorithm……… Pseudo Code for Classification 1. Apply composite classifier (OCC, x )to data examples x taken from Stst . 2. X= [C1 (x),C2 ( x), ,Ck (x)], stack the predictions to form new derived data. 3. Compute OCC(x)
  • 8.
    Working Of GeneticProgramming 1. The algorithm begins by creating a random initial population. 2. The algorithm then creates a sequence of new populations. At each step, the algorithm uses the individuals in the current generation to create the next population. To create the new population, the algorithm performs the following steps: I. Scores each member of the current population by computing its fitness value. II. Scales the raw fitness scores to convert them into a more usable range of values. III. Selects members, called parents, based on their fitness. IV. Some of the individuals in the current population that have lower fitness are chosen as elite. These elite individuals are passed to the next population. V. Produces children from the parents. Children are produced either by making random changes to a single parent—mutation—or by combining the vector entries of a pair of parents—crossover. VI. Replaces the current population with the children to form the next generation.
  • 9.
    Dataset And Operationson Dataset • KDD CUP 1999 dataset • Remove Redundancy • Conversion of values • Normalization • PCA • Final Corrected data
  • 10.
    Tools Used • GeneticProgramming Tool Kit • Windows operating system • 4 Gb Ram • I5 processor • Matlab
  • 11.
  • 12.
    Fitness Function • Records:records must be maximum • Num folds :Number of folds must be minimum • K_value: k should be closer optimal • Time: time must be minimum negative • Model : highest model is preferred • Accuracy: top accurate model is preferred f=records + num folds + K_value + Time +model + accuracy;
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    • Scatter Plotof Src byteswithCount ForClassusingKNN
  • 23.
    • Scatter Plotof src bytes versus dst host same src port rate for Class using KNN
  • 24.
    • Roc Curve •ROC curvefor GPbasedClassifiershowing 0.99976 areaunder the curve
  • 25.
  • 26.
    Conclusion • Ensemble increasethe performance • It reduces the error rates • GP based ensembler provides better results then individual classifier
  • 27.
    References • Gianluigi Folino,Giandomenico Spezzano and Clara Pizzuti, Ensemble Techniques for parallel Genetic Programming based classifier • Michał Woz´niak, Manuel Grana, Emilio Corchado,2014, A survey of multiple classifier systems as hybrid systems, ELSEVIER. • Urvesh Bhowan, Mark Johnston, Member, IEEE, Mengjie Zhang, Senior Member, IEEE, and Xin Yao, Fellow, IEEE, JUNE 2013, Evolving Diverse Ensembles Using Genetic Programming for Classification With Unbalanced Data, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 17, NO. 3, JUNE 2013. • H Nguyen, K Franke, S Petrovic Improving Effectiveness of Intrusion Detection by Correlation Feature Selection, 2010 International Conference on Availability, Reliability and Security, IEEE. • Shelly Xiaonan Wu, Wolfgang Banzhaf. 2010. The use of computational intelligence in intrusion detection systems: A review. Applied Soft Computing 10, 1-35 • Ahmad Taher Azar, Hanaa Ismail Elshazly, Aboul Ella Hassanien, Abeer Mohamed Elkorany. 2013. A random forest classifier for lymph diseases. Computer Methods and Programs in Biomedicine.
  • 28.