Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels

UNIBA: http://www.uniba.it DIB: http://www.di.uniba.it KDDE: http://kdde.di.uniba.it
SYNTHESIS OF AN INTRUSION DETECTION
ALGORITHM BASED ON DEEP LEARNING AND
REASSIGNMENT OF TRAINING LABELS
Advisor
Prof.ssa Annalisa Appice
Co-Advisor
Dott.ssa Giuseppina Andresini
Department of Computer Science, University of Bari Aldo Moro
Student
Francesco Paolo Caforio
Via Orabona, 4 - 70125 Bari - Italy

Motivations
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels 1
o Today's computer systems are complex and prone to
vulnerabilities
o New types of attacks are designed and built to traverse
sophisticated prevention and detection mechanisms
o Hackers design new attacks whose behavior is as similar as
possible to normal network traffic

Thesis Objective
2
o Synthesis of a Intrusion Detection System
o Data segmentation
o Identifying examples on the segment boundary and changing
their labels
o Deep learning
Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels

CD-IDS
3

o One-hot-encoding: transforms attributes into a vector of
numeric attributes
o Scaling: constructs attributes with standard normal
distribution with mean 0 and standard deviation 1
o Sostituzione valori mancanti: removes undefined values
o 𝑁𝑢𝑙𝑙 → 0
o 𝐼𝑛𝑓𝑖𝑛𝑖𝑡𝑦 → 𝑀𝑎𝑥
4
CD-IDS: Pre-processing

o Multi-level autoencoder (2+1+2 layers)
o Encoder + Decoder
o Central layer with 10 neurons
5
CD-IDS: Dimensionality reduction
Data Reconstructed data

o Training a Support Vector Machine
o Estimation of the reliability (distance to support vectors) of
classifying an example into a class (normal/attack)
o Segment creation (normal vs attack)
o If 𝑐0 𝑥 > 0.50, 𝑥 ∈ 𝐶𝑎𝑡𝑡𝑎𝑐𝑐𝑜
o Otherwise 𝑥 ∈ 𝐶𝑛𝑜𝑟𝑚𝑎𝑙𝑒
6
CD-IDS: Segmentation

o Identification of examples less reliably assigned to class
𝐶𝑛𝑜𝑟𝑚𝑎𝑙𝑒 and change their class
7
CD-IDS: Class Reassignment
NLS-KDD – Training set

o Identification of examples less reliably assigned to class
𝐶𝑛𝑜𝑟𝑚𝑎𝑙𝑒 and change their class
8
CD-IDS: Class Reassignment
NLS-KDD – Training set

o Convolution Neural Network learns a classifier from labeled
data in the form of images (matrices)
o Each connection  3 × 10 grayscale image + class (normal/attack)
9
CD-IDS: CNN
Training example
Closest training normal example
Closest training attack example

10
CD-IDS: CNN
Layer Hyperparameters Output shape
input (𝑁𝑜𝑛𝑒, 3, 10, 1)
conv0 Conv 2D level
32 filters, size: (2 × 2)
Activation function: 𝑟𝑒𝑙𝑢
(𝑁𝑜𝑛𝑒, 2, 9, 32)
dropout_0 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 = 0.3 (𝑁𝑜𝑛𝑒, 2, 9, 32)
conv1 Conv 2D level
16 filters, size: (2 × 4)
Activation function: 𝑟𝑒𝑙𝑢
(𝑁𝑜𝑛𝑒, 1, 6, 16)
dropout_1 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 = 0.3 (𝑁𝑜𝑛𝑒, 1, 6, 16)
flatten_1 (𝑁𝑜𝑛𝑒, 96)
dense_1 𝑛_𝑐𝑙𝑎𝑠𝑠𝑒𝑠 = 2
Activation function: 𝑆𝑜𝑓𝑡𝑚𝑎𝑥
(𝑁𝑜𝑛𝑒, 2)
output

Empirical evaluation: Data
11
Dataset Training set
Normal
Training set
Attack
Training set Testing set
Normal
Training set
Attack
Testing set
NLS-KDDTest+ 67343 58630 𝟏𝟐𝟓𝟗𝟕𝟑 9711 12833 𝟐𝟐𝟓𝟒𝟒
NLS-KDDTest-21 67313 58630 𝟏𝟐𝟓𝟗𝟕𝟑 2152 9698 𝟏𝟏𝟖𝟓𝟎
UNSW-NB15 56000 119341 𝟏𝟕𝟓𝟑𝟒𝟏 37000 45332 𝟖𝟐𝟑𝟑𝟐
CICIDS2017 80000 20000 𝟏𝟎𝟎𝟎𝟎𝟎 80000 20000 𝟏𝟎𝟎𝟎𝟎𝟎
o Dataset
o NLS-KDD
o UNSW-NB15
o CICIDS2017
o Dataset organized in 10 folders
o Every folder contains one training set and nine various testing set
o Each file contains 100.000 examples (80.000 genuine and 20.000 attacks)

12
Layer - AE NLS-KDD UNSW-NB15 CICIDS2017
1 80 100 55
2 30 40 30
3 10 10 10
4 30 40 30
5 80 100 55
o Number of neurons per layer
Empirical evaluation: Autoencoder

o Which algorithm best separates normal examples from
attacks?
o Fuzzy C-Means (F C-M) - unsupervised
o Gaussian Mixture Model (GMM) - unsupervised
o Support Vector Machine (SVM) - supervised
o Purity Index
o Ω = 𝑤1, 𝑤2, … , 𝑤𝐾 - set of clusters
o 𝐶 = 𝑐1, 𝑐2, … , 𝑐𝐽 - set of classes
13
𝑝𝑢𝑟𝑖𝑡𝑦 Ω, 𝐶 =
1
𝑁
𝑘
max
𝑗
𝑤𝑘 ∩ 𝑐𝑗
Empirical evaluation: Segmentation

14
Methods Well clustered
examples
Examples
Training set
Purity index
Fuzzy C-Means 111966 125973 0.8888
Gaussian Mixture Model 110672 125973 0.8785
Support Vector Machine 123582 125973 0.9810
o Purity index as the segmentation algorithm varies
Empirical evaluation: Segmentation
NLS-KDD

15
o Classifier training after the re-labeling step as the number
of re-labeled examples varies
o Evaluation of accuracy on the training set to see if there is a
threshold
o Evaluation of accuracy on the testing set
o Metrics
o True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN)
o Overall Accuracy (OA)
o Precision (P)
o Recall (R)
o F-Measure (F1-Score)
o True Positive Rate (TPR)
o False Positive Rate (FPR)
Empirical evaluation: Classification

16
(TP -- training set)
Dataset Configuration OA F1-Score TPR FPR
NLS-KDDTest+ 4500 0.9128 0.9271 0.1683 0.9741
NLS-KDDTest-21 8500 0.8716 0.9256 0.5999 0.9761
UNSW-NB15 500 0.9163 0.9411 0.2229 0.9817
CICIDS2017 100 0.9813 0.9523 0.0071 0.9349

17
NLS-KDDTest+ 4500 0.9128 0.9271 0.1683 0.9741
NLS-KDDTest-21 8500 0.8716 0.9256 0.5999 0.9761
UNSW-NB15 500 0.9163 0.9411 0.2229 0.9817
CICIDS2017 100 0.9813 0.9523 0.0071 0.9349

18
NLS-KDDTest+ 4500 0.9128 0.9271 0.1683 0.9741
NLS-KDDTest-21 8500 0.8716 0.9256 0.5999 0.9761
UNSW-NB15 500 0.9163 0.9411 0.2229 0.9817
CICIDS2017 100 0.9813 0.9523 0.0071 0.9349

19
o NLS-KDD
o Configuration with the highest number of TP in the training set
o UNSW-NB15 e CICIDS2017
o Configuration with the first local maximum peak in TP values in the
training set

20
Configurazione OA F1-Score TPR FPR
100 0,901215 0,912697 0,10658 0,907114
500 0,888396 0,900095 0,104727 0,883192
1000 0,893985 0,905913 0,109463 0,896595
1500 0,903078 0,915129 0,116569 0,917946
2000 0,90228 0,914906 0,12491 0,922855
2500 0,900816 0,913821 0,129544 0,92379
3000 0,902369 0,915427 0,131809 0,928232
3500 0,905696 0,918966 0,138812 0,939375
4000 0,901127 0,914745 0,13943 0,931816
𝟒𝟓𝟎𝟎 0,912793 0,927099 0,168263 0,974129
5000 0,912349 0,926999 0,173926 0,977636
5500 0,90849 0,92287 0,161878 0,961739
6000 0,904143 0,919877 0,178457 0,966648
6500 0,90157 0,917812 0,182885 0,96548
7000 0,90228 0,919147 0,194831 0,975766
7500 0,898066 0,916369 0,211616 0,981064
8000 0,89709 0,914762 0,199362 0,970077
8500 0,895538 0,914547 0,218721 0,982
9000 0,895271 0,914286 0,218309 0,98122
9500 0,883783 0,905843 0,246113 0,982077
10000 0,886622 0,907812 0,237669 0,980675
100 0,820759 0,889005 0,433086 0,877088
500 0,795865 0,871446 0,427509 0,845432
1000 0,808523 0,880648 0,437732 0,863168
1500 0,829283 0,895252 0,450743 0,891421
2000 0,832068 0,897454 0,464684 0,897917
2500 0,830211 0,896566 0,480483 0,899154
3000 0,834768 0,899651 0,481877 0,905032
3500 0,842025 0,905032 0,508364 0,919777
4000 0,832911 0,899113 0,513476 0,909775
4500 0,86616 0,921941 0,582714 0,965766
5000 0,868692 0,923643 0,589684 0,970406
5500 0,855359 0,914845 0,568309 0,949371
6000 0,858059 0,916823 0,582714 0,955867
6500 0,857468 0,916382 0,578996 0,95432
7000 0,866835 0,922465 0,588755 0,967932
7500 0,870211 0,924785 0,601766 0,974943
8000 0,861013 0,918767 0,586896 0,960404
𝟖𝟓𝟎𝟎 0,871561 0,925596 0,599907 0,976181
9000 0,870633 0,925026 0,600372 0,97515
9500 0,870127 0,924835 0,608271 0,976284
10000 0,869114 0,92416 0,605483 0,974428
NLS-KDDTest+ NLS-KDDTest-21
(testing set)

21
100 0,912907 0,938727 0,230518 0,980208
500 0,91634 0,941082 0,222875 0,981666
1000 0,906559 0,933416 0,212196 0,962285
1500 0,914355 0,939302 0,211982 0,973639
2000 0,912673 0,938682 0,235214 0,982068
2500 0,907021 0,935274 0,263393 0,986987
3000 0,913523 0,93972 0,250214 0,990355
3500 0,907865 0,935232 0,240196 0,977342
4000 0,907238 0,935232 0,256357 0,984004
4500 0,907455 0,935188 0,24925 0,980987
5000 0,903993 0,933535 0,280607 0,990615
5500 0,898563 0,928939 0,262464 0,974125
6000 0,889763 0,921867 0,250286 0,955481
6500 0,885497 0,919927 0,286875 0,966382
7000 0,890368 0,924324 0,308554 0,983711
7500 0,890699 0,924224 0,298214 0,979345
8000 0,879857 0,918621 0,368286 0,996296
8500 0,885572 0,921033 0,316625 0,980451
9000 0,884294 0,920315 0,323304 0,981708
9500 0,881602 0,918665 0,333214 0,982403
10000 0,88 0,91795 0,346429 0,986249
100 0,981319 0,952298 0,007089 0,934952
500 0,975459 0,937567 0,011868 0,924763
1000 0,969574 0,922991 0,016626 0,914375
1500 0,966377 0,916227 0,023496 0,925869
2000 0,963825 0,912081 0,031327 0,944437
UNSW-NB15
CICIDS2017
(testing set)

22
Dataset T.L.E.
OA
Senza T.L.E.
OA
T.L.E.
F1-Score
Senza TLE
F1-Score
NLS-KDDTest+ 𝟎. 𝟗𝟏𝟐𝟖 0.8875 𝟎. 𝟗𝟐𝟕𝟏 0.8987
NLS-KDDTest-21 𝟎. 𝟖𝟕𝟏𝟔 0.7911 𝟎. 𝟗𝟐𝟓𝟔 0.8677
UNSW-NB15 𝟎. 𝟗𝟏𝟔𝟑 0.9092 𝟎. 𝟗𝟒𝟏𝟏 0.9349
CICIDS2017 𝟎. 𝟗𝟖𝟏𝟑 0.9785 𝟎. 𝟗𝟓𝟐𝟑 0.9445
o Results

23
Metodo OA F1-Score
CD-IDS 𝟎. 𝟗𝟏𝟐𝟖 𝟎. 𝟗𝟐𝟕𝟏
Li et al. [12] 0.7914 0.7912
Kim et al. [10] − 0.8100
Know et al. [11] − 0.89
Naaser et al. [18] 0.85 −
Yan et al. [26] 0.793 −
Kherlenchimeg et al. [26] 0.80 −
NLS-KDDTest+
Metodo OA F1-Score
CD-IDS 𝟎. 𝟖𝟕𝟏𝟔 𝟎. 𝟗𝟐𝟓𝟔
Li et al. [12] 0.8184 0.9001
Kim et al. [10] − 0.79
Know et al. [11] − 0.62
Naaser et al. [18] 0.70 −
NLS-KDDTest-21
Metodo OA F1-Score
CD-IDS 𝟎. 𝟗𝟏𝟔𝟑 𝟎. 𝟗𝟒𝟏𝟏
Kim et al. [10] − 0.90
Yan et al. [26] 0.8825 −
UNSW-NB15
Metodo OA F1-Score
CD-IDS 𝟎. 𝟗𝟖𝟏𝟑 𝟎. 𝟗𝟓𝟐3
Kim et al. [10] − 0.89
CICIDS2017
Empirical evaluation: State of the Art

o The technique of segmentation + re-labeling of examples
allows you to build a more accurate intrusion detection
model
o Future Developments
o Methodological: use of Generative Adversarial Network
(GAN) in which a generator creates "synthetic" data similar
to real data and a discriminator distinguishes the
constructed data from real data
o Technological: use of Apache Spark for image
construction
24
Conclusions and future developments

Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels

Recommended

Recommended

More Related Content

Similar to Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels

Similar to Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels (20)

Recently uploaded

Recently uploaded (20)

Synthesis of an intrusion detection algorithm based on deep learning and reassignment of training labels