Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
1. UNIBA: http://www.uniba.it DIB: http://www.di.uniba.it KDDE: http://kdde.di.uniba.it
SYNTHESIS OF AN INTRUSION DETECTION
ALGORITHM BASED ON DEEP LEARNING AND
REASSIGNMENT OF TRAINING LABELS
Advisor
Prof.ssa Annalisa Appice
Co-Advisor
Dott.ssa Giuseppina Andresini
Department of Computer Science, University of Bari Aldo Moro
Student
Francesco Paolo Caforio
Via Orabona, 4 - 70125 Bari - Italy
2. Motivations
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels 1
o Today's computer systems are complex and prone to
vulnerabilities
o New types of attacks are designed and built to traverse
sophisticated prevention and detection mechanisms
o Hackers design new attacks whose behavior is as similar as
possible to normal network traffic
3. Thesis Objective
2
o Synthesis of a Intrusion Detection System
o Data segmentation
o Identifying examples on the segment boundary and changing
their labels
o Deep learning
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
4. CD-IDS
3
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
5. o One-hot-encoding: transforms attributes into a vector of
numeric attributes
o Scaling: constructs attributes with standard normal
distribution with mean 0 and standard deviation 1
o Sostituzione valori mancanti: removes undefined values
o 𝑁𝑢𝑙𝑙 → 0
o 𝐼𝑛𝑓𝑖𝑛𝑖𝑡𝑦 → 𝑀𝑎𝑥
4
CD-IDS: Pre-processing
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
6. o Multi-level autoencoder (2+1+2 layers)
o Encoder + Decoder
o Central layer with 10 neurons
5
CD-IDS: Dimensionality reduction
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
Data Reconstructed data
7. o Training a Support Vector Machine
o Estimation of the reliability (distance to support vectors) of
classifying an example into a class (normal/attack)
o Segment creation (normal vs attack)
o If 𝑐0 𝑥 > 0.50, 𝑥 ∈ 𝐶𝑎𝑡𝑡𝑎𝑐𝑐𝑜
o Otherwise 𝑥 ∈ 𝐶𝑛𝑜𝑟𝑚𝑎𝑙𝑒
6
CD-IDS: Segmentation
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
8. o Identification of examples less reliably assigned to class
𝐶𝑛𝑜𝑟𝑚𝑎𝑙𝑒 and change their class
7
CD-IDS: Class Reassignment
NLS-KDD – Training set
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
9. o Identification of examples less reliably assigned to class
𝐶𝑛𝑜𝑟𝑚𝑎𝑙𝑒 and change their class
8
CD-IDS: Class Reassignment
NLS-KDD – Training set
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
10. o Convolution Neural Network learns a classifier from labeled
data in the form of images (matrices)
o Each connection 3 × 10 grayscale image + class (normal/attack)
9
CD-IDS: CNN
Training example
Closest training normal example
Closest training attack example
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
12. Empirical evaluation: Data
11
Dataset Training set
Normal
Training set
Attack
Training set Testing set
Normal
Training set
Attack
Testing set
NLS-KDDTest+ 67343 58630 𝟏𝟐𝟓𝟗𝟕𝟑 9711 12833 𝟐𝟐𝟓𝟒𝟒
NLS-KDDTest-21 67313 58630 𝟏𝟐𝟓𝟗𝟕𝟑 2152 9698 𝟏𝟏𝟖𝟓𝟎
UNSW-NB15 56000 119341 𝟏𝟕𝟓𝟑𝟒𝟏 37000 45332 𝟖𝟐𝟑𝟑𝟐
CICIDS2017 80000 20000 𝟏𝟎𝟎𝟎𝟎𝟎 80000 20000 𝟏𝟎𝟎𝟎𝟎𝟎
o Dataset
o NLS-KDD
o UNSW-NB15
o CICIDS2017
o Dataset organized in 10 folders
o Every folder contains one training set and nine various testing set
o Each file contains 100.000 examples (80.000 genuine and 20.000 attacks)
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
13. 12
Layer - AE NLS-KDD UNSW-NB15 CICIDS2017
1 80 100 55
2 30 40 30
3 10 10 10
4 30 40 30
5 80 100 55
o Number of neurons per layer
Empirical evaluation: Autoencoder
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
14. o Which algorithm best separates normal examples from
attacks?
o Fuzzy C-Means (F C-M) - unsupervised
o Gaussian Mixture Model (GMM) - unsupervised
o Support Vector Machine (SVM) - supervised
o Purity Index
o Ω = 𝑤1, 𝑤2, … , 𝑤𝐾 - set of clusters
o 𝐶 = 𝑐1, 𝑐2, … , 𝑐𝐽 - set of classes
13
𝑝𝑢𝑟𝑖𝑡𝑦 Ω, 𝐶 =
1
𝑁
𝑘
max
𝑗
𝑤𝑘 ∩ 𝑐𝑗
Empirical evaluation: Segmentation
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
15. 14
Methods Well clustered
examples
Examples
Training set
Purity index
Fuzzy C-Means 111966 125973 0.8888
Gaussian Mixture Model 110672 125973 0.8785
Support Vector Machine 123582 125973 0.9810
o Purity index as the segmentation algorithm varies
Empirical evaluation: Segmentation
NLS-KDD
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
16. 15
o Classifier training after the re-labeling step as the number
of re-labeled examples varies
o Evaluation of accuracy on the training set to see if there is a
threshold
o Evaluation of accuracy on the testing set
o Metrics
o True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN)
o Overall Accuracy (OA)
o Precision (P)
o Recall (R)
o F-Measure (F1-Score)
o True Positive Rate (TPR)
o False Positive Rate (FPR)
Empirical evaluation: Classification
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
17. 16
Empirical evaluation: Classification
(TP -- training set)
Dataset Configuration OA F1-Score TPR FPR
NLS-KDDTest+ 4500 0.9128 0.9271 0.1683 0.9741
NLS-KDDTest-21 8500 0.8716 0.9256 0.5999 0.9761
UNSW-NB15 500 0.9163 0.9411 0.2229 0.9817
CICIDS2017 100 0.9813 0.9523 0.0071 0.9349
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
18. 17
Empirical evaluation: Classification
(TP -- training set)
Dataset Configuration OA F1-Score TPR FPR
NLS-KDDTest+ 4500 0.9128 0.9271 0.1683 0.9741
NLS-KDDTest-21 8500 0.8716 0.9256 0.5999 0.9761
UNSW-NB15 500 0.9163 0.9411 0.2229 0.9817
CICIDS2017 100 0.9813 0.9523 0.0071 0.9349
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
19. 18
Empirical evaluation: Classification
(TP -- training set)
Dataset Configuration OA F1-Score TPR FPR
NLS-KDDTest+ 4500 0.9128 0.9271 0.1683 0.9741
NLS-KDDTest-21 8500 0.8716 0.9256 0.5999 0.9761
UNSW-NB15 500 0.9163 0.9411 0.2229 0.9817
CICIDS2017 100 0.9813 0.9523 0.0071 0.9349
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
20. 19
Empirical evaluation: Classification
(TP -- training set)
o NLS-KDD
o Configuration with the highest number of TP in the training set
o UNSW-NB15 e CICIDS2017
o Configuration with the first local maximum peak in TP values in the
training set
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
23. 22
Dataset T.L.E.
OA
Senza T.L.E.
OA
T.L.E.
F1-Score
Senza TLE
F1-Score
NLS-KDDTest+ 𝟎. 𝟗𝟏𝟐𝟖 0.8875 𝟎. 𝟗𝟐𝟕𝟏 0.8987
NLS-KDDTest-21 𝟎. 𝟖𝟕𝟏𝟔 0.7911 𝟎. 𝟗𝟐𝟓𝟔 0.8677
UNSW-NB15 𝟎. 𝟗𝟏𝟔𝟑 0.9092 𝟎. 𝟗𝟒𝟏𝟏 0.9349
CICIDS2017 𝟎. 𝟗𝟖𝟏𝟑 0.9785 𝟎. 𝟗𝟓𝟐𝟑 0.9445
o Results
Empirical evaluation: Classification
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
24. 23
Metodo OA F1-Score
CD-IDS 𝟎. 𝟗𝟏𝟐𝟖 𝟎. 𝟗𝟐𝟕𝟏
Li et al. [12] 0.7914 0.7912
Kim et al. [10] − 0.8100
Know et al. [11] − 0.89
Naaser et al. [18] 0.85 −
Yan et al. [26] 0.793 −
Kherlenchimeg et al. [26] 0.80 −
NLS-KDDTest+
Metodo OA F1-Score
CD-IDS 𝟎. 𝟖𝟕𝟏𝟔 𝟎. 𝟗𝟐𝟓𝟔
Li et al. [12] 0.8184 0.9001
Kim et al. [10] − 0.79
Know et al. [11] − 0.62
Naaser et al. [18] 0.70 −
NLS-KDDTest-21
Metodo OA F1-Score
CD-IDS 𝟎. 𝟗𝟏𝟔𝟑 𝟎. 𝟗𝟒𝟏𝟏
Kim et al. [10] − 0.90
Yan et al. [26] 0.8825 −
UNSW-NB15
Metodo OA F1-Score
CD-IDS 𝟎. 𝟗𝟖𝟏𝟑 𝟎. 𝟗𝟓𝟐3
Kim et al. [10] − 0.89
CICIDS2017
Empirical evaluation: State of the Art
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels
25. o The technique of segmentation + re-labeling of examples
allows you to build a more accurate intrusion detection
model
o Future Developments
o Methodological: use of Generative Adversarial Network
(GAN) in which a generator creates "synthetic" data similar
to real data and a discriminator distinguishes the
constructed data from real data
o Technological: use of Apache Spark for image
construction
24
Conclusions and future developments
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels