SlideShare a Scribd company logo
1 of 13
Download to read offline
IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX 1
Learning Neural Representations for Network
Anomaly Detection
Van Loi Cao, Miguel Nicolau and James McDermott
Abstract—This paper proposes latent representation models for
improving network anomaly detection. Well-known anomaly de-
tection algorithms often suffer from challenges posed by network
data, such as high dimension and sparsity, and a lack of anomaly
data for training, model selection, and hyperparameter tuning.
Our approach is to introduce new regularizers to a classical
Autoencoder (AE) and a Variational Autoencoder (VAE), which
force normal data into a very tight area centered at the origin in
the non-saturating area of the bottleneck unit activations. These
trained AEs on normal data will push normal points towards
the origin, whereas anomalies, which differ from normal data,
will be put far away from the normal region. The models are
very different from common regularized AEs, Sparse AE and
Contractive AE, in which the regularized AEs tend to make
their latent representation less sensitive to changes of the input
data. The bottleneck feature space is now used as a new data
representation. A number of one-class learning algorithms are
used for evaluating the proposed models. The experiments testify
that our models help these classifiers to perform efficiently and
consistently on high-dimensional and sparse network datasets,
even with relatively few training points. More importantly, the
models can minimize the effect of model selection on these
classifiers since their performance is insensitive to a wide range
of hyperparameter settings.
Index Terms—Anomaly detection, latent representation, high
dimension, one-class classification, autoencoders.
I. INTRODUCTION
THE rapid growth of computer networks has enabled them
to function as a central information system in modern
life. The increase in the size, services and applications, and
infrastructure of computer networks such as the Internet of
Things (IoT), has made them complex and heterogeneous.
Thus, they confront various critical threats such as malicious
activities, network intruders and cyber criminals. Identifying
and preventing these detrimental cyber activities have high pri-
ority these days [1]. Analyzing and monitoring network traffic
to identify such malicious actions in large-scale networks are
crucial tasks, and ideally should be carried out automatically
with little supervision by network administrators [2]. Anomaly
detection is a data analysis task where the goal is to detect
patterns deviating greatly from normal data. It is suitable for
automatically identifying illegal, malicious activities and other
forms of network abuse from the normal behaviors of network
systems [3], [4]. Many machine learning algorithms have been
Manuscript received December 22, 2017; revised March 13, 2018. This
work is funded by Vietnam International Education Development (VIED) and
by agreement with the Irish Universities Association.
VL. Cao is with the School of Computer Science, University College
Dublin, Dublin, Ireland (e-mail: loi.cao@ucdconnect.ie).
J. McDermott and M. Nicolau are with University College Dublin, Dublin,
Ireland (e-mail: james.mcdermott2@ucd.ie and miguel.nicolau@ucd.ie).
employed for developing anomaly detection models [1], [2],
[3]. However, several issues, such as the high dimension and
complex types of network data, the lack of labelled anomalous
traffic, and the rapid evolution of intrusion methods, make
network anomaly detection a challenging task. In this work,
we aim to cope with these issues by proposing latent repre-
sentation models which compress normal data into a specific
region of a latent feature space. This is expected to facilitate
modelling of normal data.
As stated, one of the major issues is that labelled anomalous
data tends not be available for constructing network anomaly
detection models [3]. Collecting anomalies is extremely dif-
ficult due to privacy and security concerns of computer net-
works, and the shortage of intrusion network traffic and events
in host logs [5], [6]. Network administrators tend to avoid
divulging data that could compromise the privacy of their
clients or privileged information of their networks. Labeling a
huge volume of anomalous data covering all possible kinds of
attacks from a real-world network would be a challenging and
time-consuming task. Moreover, malicious actions or intrusive
methods are evolving over time. Thus, it may require a
significant amount of time to gather and label these data after
the awareness of the detailed information and behavior of new
attacks becomes available. Furthermore, new anomalies, such
as zero-day vulnerabilities, often cause serious damage to net-
work systems. Thus anomaly detection models are required to
cope with new anomalous actions efficiently. Most supervised
learning algorithms using knowledge of previous anomalies
are unable to detect novelties [1]. These issues strongly suggest
that the training process should be as independent as possible
from the availability of anomalous data, and anomaly detection
models should be able to respond in a flexible and timely way
to any new anomalous actions.
However, the absence of anomalies implies the crucial issue
that no validation set is available for estimating hyperparam-
eters. Most well-known anomaly detection algorithms, such
as one-class Support Vector Machine (OCSVM) [7] or Local
Outlier Factor (LOF) [8], are highly dependent on the choice of
parameters [8], [9] (more details will be discussed in Section II
and III). Supposing a small proportion of anomalies are
available for estimating parameters, this may damage the per-
formance of anomaly detection models since new, completely
different anomalies may appear in the future. Therefore, it
is desirable that network anomaly detection models should
provide a good prediction on unseen data on a wide range
of parameter settings, and have the ability to detect any new
forms of anomalies instantly as they appear.
2 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
The high dimension and complexity of network data is
another challenge to network anomaly detection. Network
traffic is typically described by a huge number of features,
such as in CISCO NetFlow data, and in different data types,
such as hierarchies (IP addresses), categories (protocols and
services) or continuous attributes [3], [10]. Anomaly detection
techniques often require some preprocessing on input data,
which may result in a higher-dimensional and sparser version
of the data. The curse of dimensionality is a problem for
anomaly detection algorithms [11]. This leads to a high pro-
portion of irrelevant features effectively producing noise that
conceals true anomalies in network data. If enough subspaces
that contain a subset of features are given, at least one
subspace (mostly relevant features) can be found in which
anomalies appear far from normal data. However, the search
for such subspaces is systematically difficult in high dimension
since the number of subspaces increases exponentially with
the dimensionality, which is called the exponential search
space problem. The curse of dimensionality also results in
concentration of distances. The relative difference between
the pairwise distance of any two datapoints and that of others
vanishes with increasing dimensionality. This is a challenge
to distance-based anomaly detection algorithms. Therefore,
network anomaly detection algorithms are required to deal
with high-dimensional and sparse data1
, by discovering more
robust and relevant features.
Unsupervised learning techniques, such as Support Vector
Data Description (SVDD), OCSVM and LOF, have been
widely used for anomaly detection [3]. These techniques
have successfully addressed the task of modeling normal
data without any assumption about its underlying distribu-
tion. LOF [8] is an advanced technique for high-dimensional
anomaly detection, which uses the local density deviation of a
given datapoint from its neighbors as an anomaly score. When
LOF is trained on only normal data, it can be used as a one-
class classifier. Recently, Kernel Density Estimation (KDE)
has been employed for building anomaly detection models,
and proven to efficiently model normal data with unknown
underlying distributions [12], [13]. In practice however, these
anomaly detection algorithms have some drawbacks: less
generalization ability in high dimension due to the curse of
dimensionality phenomenon [11], [14], and the difficulty of
tuning hyperparameters. These algorithms are non-parametric
methods, thus their query time is potentially high (more details
in Sections II and III).
Autoencoders (AEs) [15], [16] are a neural network archi-
tecture which have emerged as a suitable approach to anomaly
detection [5], [17], [18], [19] and as building blocks in deep
learning [20], [21], [22] in recent years. An AE is a feed-
forward neural network which attempts to reconstruct the
original input data at the output layer. The middle hidden
layer, sometimes called the bottleneck layer, like a nonlin-
ear PCA, compresses the redundancies while preserving and
differentiating non-redundant information in the input [17].
1A data with a majority of zero elements is considered as a sparse dataset.
Sparsity is a term used to represent the ratio of the number of zero entries
to the total number of entries in a dataset, and it is in the range of [0, 1]. In
this paper, a dataset with a sparsity above 0.5 is regarded as a sparse one.
In the anomaly detection context, an AE trained on normal
data will behave well on normal instances and will result
in small reconstruction errors (REs), but poorly reconstruct
anomalies giving large REs. Thus, RE is commonly used
as a measure of anomaly score. Alternatively, the middle
hidden layer of a trained AE can be used as a new feature
representation (called a latent representation) for improving
the performance of density-based anomaly detection [13] or
anomaly detection based on self-organizing maps [23]. The
central idea is that the latent representation which is lower-
dimension, and more robust to capture normal behaviors,
would help simple classifiers to identify anomalies. However,
the normal data is allowed to be freely distributed in the latent
feature space. The AE encoder could learn to map points from
the normal class into very different regions of the latent feature
space. Thus, the distribution of normal data in the latent feature
space may have an arbitrary shape which may not encourage
the stability of anomaly detection algorithms.
In order to overcome the limitations of the well-known
anomaly detection algorithms, we aim to find a new data
representation for facilitating simple anomaly detection al-
gorithms. The new representation is aimed to have useful
characteristics: lower dimension, straightforward to capture the
structure of normal data, a similar shape of normal data in
the new representation for different input distributions, and
normal data to be distributed in a small region in the feature
space and anomalies to be expected to appear in the rest
of the space. This will potentially improve the performance
of anomaly detection algorithms, and may make them less
sensitive to parameter settings. Our approach is to develop two
AEs, a classical AE and a Variational Autoencoder (VAE),
for constructing such a data representation by introducing
some constraints on the distribution of normal data in the
bottleneck layer. The new regularizers will encourage these
AEs to learn to represent latent data in a more meaningful
way - training data (which is assumed to be normal) appears
close together, and is distributed in a specific region in the
latent feature space. The bottleneck layers of these trained
AEs will then be used as the new data representation. Fig. 1
gives an example of data representation in the original space
(a), in the latent feature space of AEs (b), and in the latent
feature space of our models (c). The normal data shown in
Fig 1(b) is closer together than that in Fig 1(a), and has an
arbitrary shape. In Fig 1(c), the normal data is constrained to
be distributed in a good shape close to the origin. A number
of one-class classification algorithms are then employed to
capture the region representing normal behavior in the latent
feature space, and identify any datapoint not belonging to
this region as anomalies. More details will be presented in
Section IV.
The remainder of the paper is organized as follows. In
Section II and III, we briefly describe several anomaly detec-
tion algorithms, and highlight some related work in anomaly
detection. Our methods are presented in Section IV. This is
followed by Section V showing the evaluation and discussion
of our models. Section VI draws some conclusions and sug-
gests future work.
VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 3
x
x
0
1
(a)
z
z
0
1
(b)
Normal Anomaly
z0
z1
(c)
Fig. 1. Illustrations of data in the original feature space (a), the latent feature
space of AEs (b), and the latent feature space of our models (c).
II. MATHEMATICS OF ONE-CLASS CLASSIFICATION
ALGORITHMS
This section is to briefly describe anomaly detection al-
gorithms used in this paper. This includes Centroid, Mean
distance, KDE, LOF and OCSVM as well as autoencoders.
A. Anomaly detection algorithms
Centroid (CEN): This is a parametric method which uses
a single Gaussian to model training data. The distance (i.e.
radius) from the centroid (the origin) to an observation reflects
the degree of abnormality of the observation. A larger value
implies a higher probability that the datapoint is an anomaly.
By imposing a threshold on the distance, a query datapoint
can be classified as either normal or an anomaly. This method
has no hyperparameters, and works under the assumption that
the training data has a Gaussian distribution.
Mean Distance (MDIS): The mean of the Euclidean distance
from a datapoint to normal training set can be used as
anomaly score. By imposing a threshold on the mean distance,
the anomaly score of a given datapoint above the threshold
indicates an anomaly. MDIS has no hyperparameters, and is a
non-parametric method.
Kernel Density Estimation (KDE): KDE is used for estimat-
ing the probability density function of a sample in data [24].
KDE can be used for constructing an anomaly detection model
as presented in [12]. However, the main drawback of the model
is its computational cost at querying stage, especially on large
datasets. The performance in terms of classification accuracy
of KDE-based classifiers will depend on the choice of the
bandwidth h of a kernel function [12].
Local Outlier Factor (LOF): LOF [8] considers the data-
points that have a considerably lower local density than their
neighbors as anomalies. It estimates a density deviation score,
called local outlier factor, of a given datapoint with respect to
its neighbors. The larger the LOF score a given datapoint
has, the higher the probability the datapoint is anomalous.
The algorithm has shown its power on network anomaly
detection [25]. In practice however, it has some limitations
when dealing with high-dimensional data [2], and the choice
of the number of neighbors k is still an open question.
One-class Support Vector Machine (OCSVM): OCSVM [7]
first maps the normal data into a feature space via a kernel
function, and searches for a hyperplane with maximum margin
between the region containing most of normal data (normal
region) and the origin in the feature space. The idea behind this
is to allocate the region encompassing the origin for anomalies
to appear. That is to say, the OCSVM decision function returns
a positive value in the normal region far from the origin, and
a negative value in the anomaly region near the origin.
B. Autoencoder
An autoencoder [15], [16] is a neural network which con-
sists of two parts: encoder and decoder as shown in Fig. 2(a).
The encoder is defined as a feature extractor that allows the
explicit representation of an input x in a feature space. Let
f✓ denote the encoder, and X = x1
, x2
, ...xn
be a dataset.
The encoder f✓ will map the input xi
2 X into a latent vector
zi
= f✓(xi
), where zi
is the code or latent representation. The
decoder g✓ will map the latent representation zi
back into the
input space, which forms a reconstruction x̂i
= g✓(zi
). The
encoder and decoder are commonly represented as single-layer
neural networks in the form of non-linear functions of affine
mappings as follows:
f✓ (x) = sf (Wx + b) (1)
g✓(z) = sg
⇣
W
0
z + b
0
⌘
(2)
where W and W
0
are the weight matrices of the encoder and
decoder, and b and b
0
are the bias vectors of the encoder and
decoder. sf and fg are the activation functions of the encoder
and decoder, such as a logistic sigmoid or hyperbolic tangent
non-linear function, or a linear identity function.
Autoencoders learn to minimize the loss function in (3)
with respect to the parameters ✓ = {W, W
0
, b, b
0
}, using a
learning algorithm such as Stochastic Gradient Descent (SGD)
with back-propagation. The reconstruction loss function over
training instances can be written as:
LAE(✓; x) =
1
n
n
X
i=0
l(xi
, x̂i
) =
1
n
n
X
i=0
l(xi
, g✓(f✓(xi
))) (3)
where l(xi
, x̂i
) is the discrepancy between the input xi
and
its reconstruction x̂i. The choice of the reconstruction loss
depends largely on the appropriate distributional assumptions
on given data. The mean squared error (MSE)2
is commonly
used for real-valued data, whereas a cross-entropy loss3
can be
used for binary data. By compressing input data into a lower
dimensional space, the classical autoencoder avoids simply
learning the identity, and removes redundant information [17].
Denoising autoencoders (DAEs) [26], [27] are regularized
autoencoders that are trained to reconstruct the original input
from a corrupted version of the input. This will allow DAEs
to capture the structure of the input distribution, and again
prevent them from learning the identity. The loss function of
AEs in (3) is rewritten for DAEs as follows:
LDAE(✓; x) =
n
X
i=0
Ep(x̃|xi)
⇥
l(xi
, g✓(f✓(x̃)))
⇤
(4)
where x̃ is the corrupted version of xi
drawn from p(x̃|xi
).
Ep(x̃|xi) is the expectation of a reconstruction loss at xi
over
a number of samples x̃ drawn from p(x̃|xi
). This is because the
2LAE(✓; x) = 1
n
Pn
i=1 k xi x̂i k2
3LAE(✓; x) = 1
n
Pn
i=1 xi log(x̂i) + (1 xi) log(1 x̂i)
4 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
corruption process is performed stochastically on the original
input each time a point xi
is considered. There are many ways
to corrupt the input, such as Gaussian noise or salt and pepper
noise, but randomly masking features of the input to zero is
the most commonly used. This loss function can be optimized
by a SGD as in optimizing the AEs loss function.
C. Variational Autoencoder
The Variational Autoencoder (VAE) [28] is a neural network
that consists of two parts: a probabilistic encoder representing
the approximate posterior q (z|x) to the intractable true pos-
terior p✓(z|x), and a probabilistic decoder that refers to the
generative model p✓(x|z) as shown in Fig 2(b). The objective
of VAE is to optimize the variational lower bound on the
marginal likelihood of data w.r.t. variational parameters and
generative parameters ✓. The marginal likelihood is computed
as a sum over the marginal likelihoods of individual datapoint
since it is intractable, log p✓(x1
, ..., xn
) =
Pn
i=1 log p✓(xi
),
where log p✓(xi
) can be written as:
log p✓(xi
) = DKL q (z|xi
)kp✓(z|xi
) + L(✓, ; xi
) (5)
The term L(✓, ; xi
) is the lower bound on the marginal likeli-
hood of datapoint xi
since the first term, the Kullback-Leibler
divergence (KL-divergence) of the approximate posterior from
the true posterior, is non-negative. The lower bound can be
written as follows:
L(✓, ; xi
) = Eq (z|x)[ log q (z|x) + log p✓(x, z)]
= DKL q (z|xi
)kp✓(z) + Eq (z|xi)[log p✓(xi
|z)] (6)
where p✓(xi
|z) is the likelihood of xi
given the latent variable
z, and p✓(z) is the prior over latent variables.
However, the second term in (6) requires a random latent
variable z sampling from the approximate posterior q (z|x).
This is problematic since back-propagation can not flow
through a random node z. When q (z|x) is restricted to
some kinds of parametric distributions, e.g. Gaussian, the
random variable z can be reparameterized as a deterministic
function z = g (✏, x) where ✏ is an auxiliary variable with
independent marginal p(✏). This yields a lower-variance lower
bound estimator called SGVB (Stochastic Gradient Variational
Bayes): L̃(✓, ; xi
)
= DKL q (z|xi
)kp✓(z) +
1
L
L
X
l=1
log p✓(xi
|zi,l
) (7)
where zi,l
= g (✏i,l
, xi
) and ✏l
⇠ p(✏). In (7), the KL-
divergence term forces q (z|x) to be as close as possible to
p✓(z) and works as a regularizer, whereas the second term is
an expected negative reconstruction error.
For analytically integrating the KL-divergence in (7), the
true posterior p✓(z|x) is assumed to be an approximate Gaus-
sian with approximately diagonal covariance. Let the prior
p✓(z) = N(0, I), and the approximate posterior is multivariate
Gaussian with a diagonal covariance structure q (z|xi
) =
N(µi
, ( i
)2
), where µi
and i
are mean and s.d. evaluated
at datapoint i. Let µi
j and i
j denote the j-th element of µi
Encoder
Bottleneck
Decoder
(a)
z	=! + #. %
! #
z
% ∼ ' 0,1
Encoder
Bottleneck
Decoder
(b)
One-class	
Classifiers
Latent
representation
(c)
Fig. 2. The architectures of AEs (a), VAEs (b), and the hybrids of the latent
representation models and one-class classifiers (c).
and i
respectively, where J is the dimensionality of z. The
KL-divergence in (7) is written as follows:
DKL q (z|xi
)kp✓(z) = DKL N(µi
, ( i
)2
)kN(0, I)
=
1
2
J
X
j=1
✓
( i
j)2
+ (µi
j)2
1 log(( i
j)2
)
◆
(8)
Taking DKL q (z|xi
)kp✓(z) in (7), we get the objective
function of VAE at datapoint i as follows:
L(✓, ; xi
) w
1
2
J
X
j=1
✓
( i
j)2
+ (µi
j)2
1 log(( i
j)2
)
◆
+
1
L
L
X
l=1
log p✓(xi
|zi,l
) (9)
where zi,l
= µi
+ i
✏l
and ✏l
⇠ N(0, I). L is the number of
samples per datapoint. In practice, it can be set to 1 as in [28].
When optimizing (maximizing) the objective function at (9)
by Stochastic Gradient Ascent, VAEs learn the recognition
model parameters jointly with the generative model param-
eters ✓. Given datapoint xi
, the probabilistic encoder outputs
the parameters of the approximate posterior at this datapoint
µi
and i
. An actual value zi,l
⇠ q (z|xi
) obtained through
zi,l
= µi
+ i
✏l
is the input for the probabilistic decoder. The
output of the decoder is the reconstruction x̂i
. The distribution
of the encoder output is Gaussian, whereas that of the decoder
depends on the type of data (Gaussian for real-value data or
Bernoulli for binary).
III. RELATED WORK
In this section, we discuss recent trends and some state-of-
the-art anomaly detection algorithms. This includes Support
Vector Machines [7], [29], [30], and autoencoder-based meth-
ods [5], [14], [17], [18], [19], [31].
Schölkopf et al. [7] and Campbell et al. [30] presented
hyperplane-based one-class SVM approaches as already dis-
cussed. In [7], their aim is to map the input data into the
feature space via a kernel function, and then find a hyperplane
with a maximum margin between the region containing normal
data and the origin in the feature space. The half space
VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 5
containing the origin is identified as the anomalous region.
The trade-off between the two objectives, maximizing the
margin and minimizing the number of target vectors falling
into the anomalous region, is controlled by the outlier fraction
⌫ 2 (0, 1). The larger the value of ⌫, the more normal vectors
are rejected as outliers and the more normal vectors become
support vectors. When ⌫ approaches 1 almost all normal
vectors become support vectors. The method was evaluated
on the US postal service database of handwritten digits, and
the results show that the classifier performed well. However,
how to choose values for the hyperparameter ⌫ and kernel
parameters such as gamma (related to bandwidth h in KDE)
is still an open question. Instead of allocating the origin region
for anomalies, Campbell et al. [30] proposed a model that
learns to capture the region containing normal instances in
feature space. They attempted to find a hyperplane with respect
to the center of the distribution of normal data, and anomalies
were assumed to appear in the other side. Linear programming
techniques are employed instead of the quadratic programming
in Schölkopf’s approach, that can make their model learn large
datasets rapidly.
Tax and Duin [29] proposed a method called Support Vector
Data Description for anomaly detection. In this approach,
normal data is again first mapped into a feature space corre-
sponding to a kernel function. It then finds a hypersphere with
minimum radius which encompasses almost all normal vectors
in the feature space. Any query datapoints lying inside the
hypersphere are considered as normal and others as anomalies.
In order to achieve good classification accuracy, it is desir-
able to reduce the volume of the hypersphere by rejecting
some fraction of training data (the outlier fraction known
as parameter C) when training this model. This illustrates
a theme present in all one-class classification research, the
trade-off between false positive and false negative rates. They
introduced different kernel functions to SVDD that make the
method more flexible, and the Gaussian kernel was found to be
the most suitable for many datasets. When using the Gaussian
kernel, the method is comparable to OCSVM [7]. However,
the technique requires a large number of normal examples,
and extra outlier objects for training in order to improve the
classification accuracy [29]. Both SVDD and OCSVM have
demonstrated their effectiveness on anomaly detection, but
their limitations are the ability to model large-scale and high-
dimensional data due to their time and space complexity [32].
The approach of using stand-alone AEs to build anomaly de-
tection systems was proposed in [5], [18], [19], in which AEs
act as either anomaly detection methods or feature reduction
techniques. Hawkins et al. [18] trained an AE (also known as a
replicator neural network) with three narrow hidden layers on
normal data. Its RE was used as an “outlier score”: an outlier
score above a predetermined threshold indicated an anomaly.
A step-wise activation function was used for the neurons in the
middle hidden layer, which mapped input data into a number
of possible clusters. Each of these clusters was associated with
an active state of these neurons. These neurons were active
with specific steps on a particular class of data (normal or
anomaly). Thus, the labels of these clusters can be used as
an alternative approach for indicating anomalies. The model
was evaluated on the Wisconsin Breast Cancer (WBC) and
the KDD’99 datasets, and both of these models (RE-based
and cluster-based) produced high accuracy. Furthermore, Fiore
et al. [5] constructed an AE using Discriminative Restricted
Boltzmann Machines to test the hypothesis that there is a
deep similarity among normal behaviors. They expected that
their model can describe all the characteristics of normal
traffic when comparing it against unseen anomalous traffic.
Their experiments involving real-world network traces and
the KDD’99 datasets confirmed that its performance suffered
when testing in a network greatly different from that where
training data was collected. In contrast, Sakurada et al. [19]
employed an AE as a nonlinear feature reduction technique for
anomaly detection. They attempted to clarify the properties
of AEs by comparing a classical AE and a DAE to linear
PCA and Kernel PCA. These techniques were evaluated on
an artificial dataset and on spacecraft telemetry data. They
concluded that DAEs not only outperform linear PCA and
Kernel PCA in terms of accuracy, but also can avoid the high
computation costs of kernel PCA.
Hybrid approaches or extensions of AEs have been recently
proposed for anomaly detection [14], [31]. Veeramachaneni
et al. [31] proposed an ensemble learner to combine three
single one-class classifiers: AE-based, density-based, and ma-
trix decomposition-based techniques. They also used a human
expert to provide ongoing correct labels from which the
algorithms can learn. The models were tested on a large
network log file dataset, and achieved promising results. Erfani
et al. [14] introduced a hybrid of a Deep Belief Network
(DBN) and OCCs, such as OCSVM and SVDD, for solving
the problem of high-dimensional anomaly detection. The DBN
was pre-trained in the greedy layer-wise fashion, that is unsu-
pervised training of each Restricted Boltzmann Machine one-
by-one. OCSVM [7] and SVDD [29] were then built on top of
the pre-trained DBN. This structure takes advantages of high
decision classification accuracy from these OCCs and nonlin-
ear feature reduction from DBNs. The model was evaluated
on eight high-dimensional UCI datasets. The results showed
that the performance of the hybrid models was comparable to
AEs and better than stand-alone OCSVM and SVDD, and the
training and testing times improved significantly.
IV. PROPOSED MODEL
We aim to find a new data representation that facilitates
simple anomaly detection algorithms. This section clarifies
how to construct the data representation by introducing new
regularizers to an AE and a VAE. The new regularizers
together with reconstruction loss will help these AEs to give
a robust representation of normal behavior. The regularizers
will encourage the encoders of these AEs to condense normal
data as close together as possible at a particular region in the
latent feature space, while reconstruction loss promotes these
AEs to keep normal points from overlapping each other. In
order to separate the normal region from anomalies, normal
points will be “pushed” towards the origin at the non-saturating
area of the bottleneck unit outputs by the regularizers. That
is, each coordinate (given by the output of the bottleneck unit
6 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
activation) of an encoded point will tend to be pushed closer
to the non-saturating value (zero) of the activation function.
Thus, a trained AE on normal data can keep normal datapoints
close to the origin, whereas anomalous datapoints, if they
differ from normal datapoints, will therefore tend to differ
greatly, and appear in other regions. A number of one-class
classifiers are employed for evaluating the proposed models.
Fig. 2(c) illustrates the hybrid of the data representation
models and one-class classifiers. More details are shown in
Subsections IV-A and IV-B.
Our models are very different from other common
regularized AEs, including Sparse AEs and Contractive AEs.
Sparse AEs attempt to construct a sparse representation in
an overcomplete setting in which a few of the outputs of
the hidden unit activations can vary at a time, and others
are set to a saturating value [33]. Thus, the latent data is
penalized close to the saturating value at zero [34], or the
hidden bias vectors are controlled [35]. Contractive AEs seek
a latent representation that is as insensitive as possible w.r.t the
variances in the input data [36]. Thus, the outputs of the hidden
units are constrained to be close to their marginal values (e.g.
0 or 1 in sigmoid function).
A. Shrink Autoencoder
A new regularizer is added to the loss function of an AE
which encourages the AE to construct a representation of
normal data which will be easy for one-class classification
algorithms. The regularizer is designed to penalize normal
datapoints whose vectors in the latent space are of large
magnitude, that is it will restrict the normal data to lie close
to the origin. Hence, this is called a shrink regularizer, and
the AE is named Shrink AE (SAE). The loss function in (3)
can be redefined for this situation as follows:
LSAE(✓; xi
, z) =
1
n
n
X
i=1
l(xi
, x̂i
) +
1
n
n
X
i=1
k zi
k2
(10)
where x̂i and zi are the reconstruction and the latent vector
of the observation xi respectively. The first term is the recon-
struction error, 1
n
Pn
i=1 k xi
x̂i
k2
, and the second term is
the shrink regularizer. The parameter controls the trade-off
between the two terms in the loss function.
B. Dirac delta Variational Autoencoder
VAEs attempt to encode data so that it is distributed as a
standard Gaussian in the latent space. Thus, normal data will
reside in a large area centered at the origin. Our strategy is
to compress normal data into a smaller area near the origin.
Therefore, we redesign the KL-divergence at (8) by forcing
the approximate posterior q (z|x) to be as close as possible
to a new prior p✓(z) with very small standard deviation.
Let us recall the KL-divergence between two multivariate
Gaussian distributions in Rn
, P1 = N(µ1, ⌃1) and P2 =
N(µ2, ⌃2), defined in [37] as:
DKL (P1kP2) =
1
2

tr(
⌃1
⌃2
) + (µ2 µ1)T
⌃ 1
2 (µ2 µ1)
n + log
✓
det(⌃2)
det(⌃1)
◆
(11)
Let µi
and ⌃i
denote the variational mean and the covariance
matrix evaluated at datapoint i, q (z|xi
) = N(µi
, ⌃i
), and J
be the dimensionality of z. Consider a constant ↵ (↵ ⌧ 1.0)
to be the variance of the prior probability, p✓(z) = N(0, ↵I). I
is a identity matrix. Applying these to (11), the KL-divergence
between q (z|xi
) and p✓(z) can be written as follows:
DKL q (z|xi
)kp✓(z) =
1
2

tr((↵I) 1
⌃i
)+(µi
)T
(↵I) 1
(µi
)
J + log
✓
det(↵I)
det(⌃i)
◆
(12)
Taking I and ↵ in (12), we get: DKL q (z|xi
)kp✓(z)
=
1
2

tr((↵) 1
⌃i
)+(↵) 1
(µi
)T
(µi
) J+log
✓
(↵)J
det(⌃i)
◆
=
1
2↵
[tr(⌃i
)+(µi
)T
(µi
) ↵J+↵J log ↵ ↵ log(det(⌃i
))]
(13)
Because ⌃i
is a diagonal matrix of size J ⇥ J, ⌃i
can be
used as a vector of its J diagonal elements. Let µi
j and ( i
j)2
denote the j–th element of µi
and ⌃i
respectively.
Taking tr(⌃i
) and det(⌃i
), we get:
DKL q (z|xi
)kp✓(z) =
1
2↵
 J
X
j=1
( i
j)2
+
J
X
j=1
(µi
j)2
↵
J
X
j=1
1
+ ↵
J
X
j=1
log ↵ ↵ log(
J
Y
j=1
( i
j)2
)
=
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵+ ↵ log ↵ ↵ log(( i
j)2
)] (14)
Now we apply the KL-divergence in (14) to (7). The
negative log likelihood loss in (7) is replaced by MSE between
xi
and its reconstruction x̂i
since we will apply our models
only on real-valued datasets. The objective function given
at (7) can be rewritten as follows:
L(✓, ; xi
) w
1
n
n
X
1
k xi
x̂i
k
2
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵ + ↵ log ↵ ↵ log(( i
j)2
)] (15)
The prior can be seen as a Dirac delta distribution because
↵ is very small. Thus, this VAE is named Dirac delta Varia-
tional Autoencoder (DVAE). Maximizing (15) is equivalent
to minimizing its KL-divergence and RE components. We
introduce a parameter to control the trade-off between
the two components in (15). The objective function can be
rewritten in a form of the loss function of DVAE as follows:
LDVAE(✓, ; xi
) =
1
n
n
X
1
k xi
x̂i
k
2
+
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵ + ↵ log ↵ ↵ log(( i
j)2
)] (16)
VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 7
V. EVALUATION AND DISCUSSION
This section is to evaluate the SAE and DVAE algorithms on
constructing the data representation for improving the perfor-
mance of anomaly detection algorithms. This is demonstrated
by the experimental results produced from five simple one-
class classification (OCC) algorithms LOF, CEN, KDE, MDIS,
OCSVM using the latent representations of SAE and DVAE on
fourteen problems. In order to highlight the strengths of SAE
and DVAE, the results are also compared to those from: (1)
the stand-alone OCCs (without any AE latent representation),
(2) the OCCs using the latent representations of a denoising
AE (DAE) and a VAE, and (3) the RE-based OCC. For
measuring the accuracy of the models, we evaluate the area
under the resulting ROC curve (AUC) by trying many different
thresholds, and create a confusion matrix by choosing only one
threshold. A number of experiments and analysis for exploring
different aspects of the latent representations of SAE and
DVAE are carried out as follows:
• Evaluate the effect of dimensionality and sparsity on
the classification accuracy of the OCCs using the latent
representations given by SAE and DVAE.
• Explore the effect on classification accuracy of OCSVM
and LOF of their parameters ⌫, , and k. Investigate the
distribution of latent vectors on normal and anomaly data.
• Measure the effect of training size on the AUCs and query
time created by SAE-OCCs and DVAE-OCCs.
• Evaluate the AUCs from the OCCs on specific categories
of attack types in NSL-KDD and UNSW-NB15.
A. Experiments
1) Datasets: The experiments are conducted on fourteen
datasets including network problems as shown in Table I. The
eight network datasets are mostly well-known problems in the
domain of network security. Although the main objective is
to cope with the challenges arising in high-dimensional net-
work data, the models are also evaluated on six non-network
datasets from the UCI Machine Learning Repository [38].
This is because we intend to evaluate the performance of
our models on a diversity of data, and expect to emphasize
their strength on high-dimensional network-related datasets.
The normal traffic in CTU13, UNSW-NB15 and NSL-KDD
is considered as normal data, whereas all the attacks are
treated as anomalies. In PenDigits, the digits ‘0’ and ‘2’ are
chosen as the normal and anomalous classes respectively. For
GLASS, window glass is considered as the normal class, and
other classes as the anomalous class. In the other datasets, the
normal and anomalous classes are indicated following [39].
The CTU13 is a publicly available botnet dataset provided
in 2011 [40]. The data covers a wide range of real-world
botnet traffic mixed with normal traffic and background traf-
fic (unlabeled data). The CTU13 consists of thirteen botnet
scenarios, and each of them involves a specific type of
malware. We choose four scenarios in CTU13, and split each
of them into 40% for training (normal traffic) and 60% for
evaluating (normal and botnet traffic) following [41]. We use
most of the 14 features in CTU13 except source/destination
IP addresses. Three categorical features, protocol, sTos and
dTos, are encoded by one-hot-encoding, which results in higher
dimensional versions of these scenarios.
TABLE I
FOURTEEN DATASETS FOR EVALUATING THE PROPOSED MODELS
Dataset Dimension4 Training
set
Normal
Test
Anomaly
Test
PageBlocks 10 3930 983 112
WPBC 32 118 30 10
PenDigits 16 780 363 364
GLASS 9 130 33 11
Shuttle 9 3410 11478 3022
Arrhythmia 259 189 48 37
Rbot (CTU13-10) 38 6338 9509 63812
Murlo (CTU13-8) 40 29128 43694 3677
Neris (CTU13-9) 41 11986 17981 110993
Virut (CTU13-13) 40 12775 19164 24002
Spambase 57 2230 558 363
UNSW-NB155 196 56000 37000 45332
NSL-KDD5 122 67343 9711 12833
InternetAds 1558 1582 396 77
NSL-KDD is a filtered version of the KDD’99 dataset [42],
which was suggested to address the inherent issues mentioned
in [43]. Although NSL-KDD still suffers from some problems
discussed in [44], it can be reasonable to use the data as
an effective benchmark for comparing anomaly detection
algorithms in this work due to the shortage of public intrusion
data. Each 41-feature record in NSL-KDD is labeled as either
normal or a specific attack group in the four main categories:
Denial of Service (DoS), Remote to Local (R2L), User to
Local (U2R) and Probe. NSL-KDD consists of two parts:
KDDTrain+
and KDDTest+
which are drawn from differ-
ent distributions (additional 14 types of attacks in KDDTest+
only). Three categorical features, protocol type, service and
flag, are preprocessed by one-hot-encoding which increases
the number of features to 122.
UNSW-NB15 has been recently provided and is expected to
address the inherent issues in the KDD’99 dataset and NSL-
KDD [45]. Each record comprising 47 features is labeled
either as realistic normal traffic or one of the nine modern
attack categories: Fuzzers, Analysis, Backdoor, DoS, Exploit,
Generic, Reconnaissance, Shellcode and Worm. The dataset
is decomposed into two sets, UNSW NB15 training-set and
UNSW NB15 testing-set, for training and evaluating. The
categorical attributes, such as protocol, service and state, are
preprocessed by one-hot-encoding which increases the number
of features to 196. The labelled anomalies in the training parts
of NSL-KDD and UNSW-NB15 are discarded.
PenDigits and Shuttle are already partitioned into training
and testing parts, thus we simply delete labelled anomalies
in the training parts to form training sets. For Spambase,
InternetAds, PageBlocks, WPBC, GLASS and Arrhythmia, we
take 80% of normal data for training and 20% of normal and
anomalies for testing. All datasets are normalized into [-1, 1]
since the activation function of the output layer of these AEs
is the tanh function, and missing values are discarded.
4The dimensions of the four CTU13 datasets, UNSW-NB15 and NSL-KDD
are preprocessing by on-hot-encoding.
5The training sets of UNSW-NB15 and NSL-KDD are much larger than
other datasets, thus we will sample a small proportion (10%) for training.
8 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
2) Parameter Settings: Anomalies are not available during
training, so cross-validation can not be used to tune hyperpa-
rameters. This is one of the major difficulties for this task.
We configure the hyperparameters of AEs and OCCs using
common values and rules of thumb, and then confirm that
performance is not sensitive to these values.
OCC Parameters: The Gaussian kernel is used for KDE and
OCSVM. The scaling parameter related to the bandwidth h
by = 1
2h2 is set by a default value, = 1
nf as in [46], where
nf is the number of input features. The trade-off parameter
⌫ is set to two separate values6
, 0.1 and 0.5, which refers
to OCSVM⌫=0.1 and OCSVM⌫=0.5. In LOF, the number of
nearest neighbors k is chosen as 10% of the training size.
AE Parameters: The architectures of SAE and DVAE are
configured as follows: the number of hidden layers is equal to
5 as in [14], the size of the bottleneck layer m is chosen by
the rule of thumb presented in [13], m = [1 +
p
n], where n is
the number of input features. The choice of mini-batch size is
dependent on the size of training sets. This is needed because
the sizes of the datasets vary by a factor of 500. For small
training sets (< 2000), we split into 20 batches. For large, we
set mini-batch size to 100. We also want to provide a similar
number of batches for each iteration in training processes
which will help early-stopping work efficiently. In order to
eliminate learning rate and the number of training iterations,
we employ the Adadelta algorithm [47] together with early-
stopping techniques [48] for training these networks, which
enables the training processes to operate automatically and
avoid over-fitting. The hyperbolic tangent function is chosen
as the activation function for these AEs. Weights are initialized
following the scheme in [49].
In practice, the KL-divergence in the DVAE loss function
is scaled by log10 since its value is extremely large in early
epochs. The distribution of latent data before training seems to
be very similar to the standard Gaussian distribution. The prior
p✓(z) is a Dirac delta distribution, thus the KL-divergence is
very large, especially at early iterations of the training process.
Fig. 3 (also Fig. 5 in the supplementary material) illustrates
the distribution of latent data (the first feature z0) during the
training process. Therefore, the log10 scaling is expected to
reduce the domination of this term on the loss function.
Fig. 3. Histogram of latent data (the first feature z0) during the training of
DVAE (↵ = 10 8) on Spambase.
SAE and DVAE are trained to minimize the loss functions
in (10) and (16) by an adaptive SGD algorithm (Adadelta) as in
the training of MLPs. We do not apply a pretraining procedure
for these networks since modern back-propagation methods
(weight initialization [49] and Adadelta [47]), together with
6This is expected to show the influence of ⌫ on the performance of OCSVM.
the new regularization terms, are expected to encourage the
networks to learn the parameters in hidden layers effectively.
Early stopping is controlled by two parameters. Training will
terminate when the loss does not improve by an absolute value
of 10 3
for t iterations. t is calculated as 2000 / number
of batches (where number of batches is already defined in
this section). Note that only normal data is employed for the
training process.
We use the same model selection for setting up a five hidden
layer DAE and a five hidden layer VAE7
. However, the DAE
is trained in greedy layer-wise fashion following the original
scheme proposed in [20], [21]. In the pretraining procedure,
each single denoising autoencoder is trained to minimize MSE
between the reconstruction formed from a corrupted version8
of the input, and the original input. This is optimized by
SGD with a common value for learning rate, 10 2
, and 200
iterations9
to initialize weights and biases for the DAE. The
DAE and VAE are then fine-tuned (end-to-end) as in the
training of SAE and DVAE.
Estimating : This is carried out for estimating the param-
eter in the loss functions of SAE (10) and DVAE (16). The
regularizers (shrink in SAE and KL-divergence in DVAE),
force normal datapoints as close together as possible at the
origin, whereas the reconstruction loss attempts to keep them
from overlapping in order to reconstruct them at the output
layer. The two components tend to conflict with each other.
Thus, an appropriate value of should be chosen to bal-
ance the two components. However, anomalous data is not
available for tuning or determining the number of training
iterations in order to avoid overfitting. According to [50], there
are three phases in the training process of a feed-forward
network. The generalization error includes two components
called approximation error and complexity error. In the first
phase, the approximation error dominates the complexity error,
and the generalization error decreases gradually. In phase 2,
these components are approximately balanced, and the gener-
alization error continues to decrease further. The complexity
error is increasingly large after phase 2, and dominates the
approximation error due to large network weights, which can
lead to oscillation and high generalization errors (phase 3).
Thus, the training process should be stopped in phase 2.
Therefore, we investigate these loss functions and their two
components on five values, SAE 2 {0.1, 1, 5, 10, 50} and
DVAE 2 {0.001, 0.01, 0.05, 0.1, 0.5} on four datasets over
1000 epochs. Firstly, we observe three phases on the SAE
training error curves. The larger the value of , the longer
phase 2 will last, which makes it easy to choose early stopping
parameters. When is large (about 10) phase 2 is longer, but
= 50 makes the training error less stable on phase 2. = 10
seems to be a good value which allows us to choose common
values for early stopping parameters. When we apply early
stopping with SAE = 10, we see that the stopping point is
7The equation (9) is rewritten in a form of the VAE loss function since the
VAE is trained under the same training scheme in DVAE: LVAE(✓, ; xi) =
1
n
Pn
1 k xi x̂i k
2
+ 1
2
PJ
j=1[( i
j)2 + (µi
j)2 1 log(( i
j)2)].
8It is obtained by randomly setting 10% of the input features to zero.
9There is no need for using early-stopping here since this is aimed to
initialize weights and biases to be close to a good solution.
VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 9
mostly in phase 2. We also observe AUC curves, and the early
stopping appears to perform well. Even AUCs are very good
at the first few epochs on some datasets, but we are not using
AUCs to choose . Similarly, we choose DVAE = 0.05. For
brevity we present only the curves of SAE on CTU13-10 with
SAE = 10 in Fig. 4, and on the four datasets in Figs. 1–4 in
the supplementary material.
Fig. 4. SAE loss function and its components (RE and Shrink losses) (w.r.t the
left y-axis), and the AUCs created by SAE-LOF, SAE-CEN and SAE-OCSVM
(w.r.t the right y-axis) during the training process of SAE on CTU13-10.
3) Main experiments: The bottleneck layers of the trained
DAE, VAE, SAE and DVAE are used as latent representa-
tions for six one-class classifiers LOF, CEN, MDIS, KDE,
OCSVM⌫=0.1 and OCSVM⌫=0.5. We use the terms DAE-
OCCs, VAE-OCCs, SAE-OCCs, and DVAE-OCCs to refer to
the six one-class classifiers when using the latent representa-
tions of DAE, VAE, SAE and DVAE respectively. The REs of
these AEs are also used as anomaly score that produces four
further RE-based classifiers. The performance of these stand-
alone one-class classifiers on original data are considered as
baselines. All experiments are implemented in Python 2.7
and run on a machine with an Intel Core 2 Duo i5-3360M
CPU at 2.8 GHz, 8 GB RAM and RAM frequency of 1600
MHz, and the implementation of our algorithms is available on
GitHub (https://github.com/vanloicao/SAEDVAE). The OCCs
provided by scikit-learn are employed [46]. The main results
are shown in Table II.
B. Analysis and discussion
Discussion: Table II presents the AUCs achieved by DAE-
OCCs, VAE-OCCs, SAE-OCCs and DVAE-OCCs, and their
corresponding RE-based classifiers from the 2nd
to the 5th
rows respectively. The results created by the six stand-alone
one-class classifiers are shown in the first row. Each column
represents the AUCs created by a number of classifiers on the
same problem. We use gray-scale to present the performance
of these classifiers on each dataset. In each column, the highest
AUC is highlighted by the lightest gray. The fourteen datasets
are arranged in ascending sparsity order.
Table II shows that when working on the latent repre-
sentations produced by SAE and DVAE, the six one-class
classifiers perform better in terms of classification accuracy
than those using DAE, VAE or stand-alone OCCs on the eight
network-related datasets. These datasets are typically very
high-dimensional and sparse, such as InternetAds with 1558
features. This suggests that the latent representations produced
by SAE and DVAE facilitate these one-class classifiers in deal-
ing with high-dimensional and sparse network-related datasets.
However, VAE-OCCs produces relatively poor performance.
This can be explained as follows: the VAE regularizer has less
influence on learning the representation since the latent data
is already in a good shape before training (see Fig.3). Thus,
most of the representation power of the VAE may be used
for reconstruction. Moreover, normal data resides in a large
region that may give more “room” for anomalies to appear
inside the region. The normal data is also not forced on the
non-saturated part of the activation function.
The hybrid SAE-OCCs and DVAE-OCCs also yield very
similar AUCs on each network-related dataset, even though
these one-class classifiers originate from different algorithms,
and their parameters (e.g. ⌫) are set to different values. This
is clear to see in the 4th
and 5th
rows where sparsity > 0.50.
This implies that SAE and DVAE may constrain normal data
in their latent representations in a well-shaped distribution that
is straightforward for these classification algorithms to capture
normal behaviors, and less sensitive to parameter settings.
Moreover, SAE-OCCs and DVAE-OCCs produce comparable
or superior AUCs in comparison to the RE-based DAE classi-
fier on the network-related datasets, especially for high sparsity
and dimensionality. The influence of OCC parameters and the
distribution of latent vectors are explored later.
The influence of dimensionality and sparsity: We next inves-
tigate the influence of sparsity and dimensionality of data on
the classification accuracy produced from hybrid DAE-OCCs,
SAE-OCCs and DVAE-OCCs. We use the term AUC-DIFF to
refer to the difference in AUC between a classifier (e.g. LOF)
on the original data and on the data encoded by an AE. A
positive value of AUC-DIFF indicates an improvement due to
the AE encoding. AUC-DIFF is plotted against sparsity and
dimensionality of datasets shown in Fig. 5(a) and Fig. 5(b).
It can be seen from Fig. 5(a) that there is a clear increasing
trend in the AUC-DIFF lines of SAE-OCCs and DVAE-OCCs,
while the AUC-DIFF graph of DAE-OCCs tends to decrease.
Similar patterns can also be found when investigating the
influence of dimensionality, shown in Fig. 5(b). The ranking of
datasets by sparsity is similar to the ranking by dimensionality,
therefore these two pieces of evidence are partly overlapping.
The conclusion is that the benefit of the new AE encodings
is greater for sparse, high-dimension datasets, whereas the
benefit of the existing DAE encoding is greater for small, non-
sparse datasets.
The influence of OCC parameters: This is to assess the
influence of OCC parameters, ⌫, and k, on the perfor-
mance in terms of classification accuracy of OCSVM and
LOF when using the latent representations of DAE, SAE
and DVAE. The parameter is fixed being equal to 1
nf for
investigating ⌫, whereas ⌫ is set to 0.1 when examining .
Each of these parameters is examined on fifty different values,
⌫ 2 [0.01, 0.5] and 2 [2⇥10 4
, 2⇥104
]. We plot AUCs from
DAE-OCSVM, SAE-OCSVM and DVAE-OCSVM against ⌫
in Fig. 6(a), and against in Fig. 6(b). The figures show that
the AUC curves of SAE-OCSVM and DVAE-OCSVM tend to
be stable while those of DAE-OCSVM vary according to the
values of ⌫ or . This implies that the latent representations
10 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
TABLE II
AUCS FROM THE STAND-ALONE ONE-CLASS CLASSIFIERS, HYBRID DAE-OCCS, SAE-OCCS AND DVAE-OCCS, AND THE RE-BASED CLASSIFIERS.
Represen-
-tation
Methods
One-class
Classifiers
Datasets (Sparsity)
P
a
g
e
B
lo
c
k
s
(0
.0
0
)
W
P
B
C
(0
.0
2
)
P
e
n
D
ig
it
s
(0
.1
3
)
G
L
A
S
S
(0
.1
8
)
S
h
u
tt
le
(0
.2
2
)
A
rr
h
y
th
m
ia
(0
.5
0
)
C
T
U
1
3
-1
0
(0
.7
1
)
C
T
U
1
3
-0
8
(0
.7
3
)
C
T
U
1
3
-0
9
(0
.7
3
)
C
T
U
1
3
-1
3
(0
.7
3
)
S
p
a
m
b
a
s
e
(0
.8
1
)
U
N
S
W
-N
B
1
5
(0
.8
4
)
N
S
L
-K
D
D
(0
.8
8
)
In
te
rn
e
tA
d
s
(0
.9
9
)
Stand-alone
LOF 0.971 0.600 0.995 0.972 0.984 0.788 0.902 0.899 0.955 0.963 0.751 0.745 0.793 0.762
CEN 0.944 0.580 0.966 0.961 0.881 0.816 0.996 0.971 0.915 0.916 0.816 0.738 0.955 0.816
MDIS 0.927 0.640 0.962 0.970 0.898 0.786 0.998 0.966 0.734 0.891 0.731 0.801 0.929 0.694
KDE 0.928 0.637 0.961 0.967 0.883 0.787 0.998 0.958 0.720 0.889 0.731 0.800 0.924 0.693
OCSVM⌫=0.5 0.934 0.610 0.961 0.961 0.863 0.794 0.998 0.958 0.851 0.925 0.736 0.807 0.935 0.704
OCSVM⌫=0.1 0.934 0.557 0.968 0.832 0.760 0.807 0.983 0.797 0.852 0.898 0.736 0.792 0.890 0.710
DAE
LOF 0.933 0.553 0.997 0.931 0.985 0.654 0.751 0.896 0.891 0.793 0.392 0.736 0.662 0.476
CEN 0.922 0.693 0.964 0.959 0.931 0.738 0.972 0.949 0.628 0.730 0.476 0.743 0.881 0.337
MDIS 0.905 0.700 0.950 0.994 0.901 0.707 0.981 0.960 0.653 0.855 0.466 0.765 0.888 0.342
KDE 0.903 0.690 0.954 0.992 0.892 0.706 0.980 0.939 0.616 0.857 0.460 0.756 0.861 0.335
OCSVM⌫=0.5 0.912 0.630 0.958 0.989 0.885 0.665 0.981 0.938 0.655 0.711 0.454 0.690 0.854 0.325
OCSVM⌫=0.1 0.920 0.557 0.976 0.606 0.762 0.668 0.937 0.775 0.702 0.332 0.578 0.536 0.697 0.314
RE-Based 0.969 0.540 0.997 0.986 0.821 0.824 0.998 0.988 0.943 0.972 0.805 0.873 0.959 0.842
VAE
LOF 0.512 0.480 0.549 0.444 0.489 0.479 0.490 0.499 0.507 0.500 0.509 0.505 0.501 0.474
CEN 0.514 0.497 0.549 0.526 0.489 0.461 0.490 0.500 0.507 0.499 0.507 0.504 0.501 0.472
MDIS 0.509 0.517 0.553 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467
KDE 0.509 0.527 0.554 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467
OCSVM⌫=0.5 0.510 0.517 0.555 0.521 0.490 0.484 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.466
OCSVM⌫=0.1 0.515 0.537 0.553 0.537 0.491 0.466 0.490 0.498 0.507 0.499 0.505 0.505 0.501 0.463
RE-Based 0.928 0.657 0.959 0.961 0.883 0.784 0.998 0.957 0.698 0.881 0.734 0.801 0.923 0.694
SAE
= 10
LOF 0.954 0.607 0.996 0.959 0.817 0.762 1.000 0.983 0.960 0.975 0.813 0.894 0.937 0.943
CEN 0.964 0.610 0.995 0.915 0.800 0.754 0.999 0.991 0.950 0.969 0.835 0.886 0.963 0.935
MDIS 0.967 0.603 0.996 0.898 0.794 0.757 0.999 0.990 0.950 0.968 0.826 0.887 0.964 0.936
KDE 0.967 0.607 0.996 0.884 0.783 0.756 0.999 0.990 0.949 0.968 0.825 0.886 0.964 0.934
OCSVM⌫=0.5 0.967 0.610 0.996 0.876 0.773 0.756 0.999 0.990 0.950 0.970 0.823 0.891 0.964 0.935
OCSVM⌫=0.1 0.956 0.600 0.996 0.890 0.781 0.740 0.999 0.988 0.944 0.971 0.825 0.893 0.961 0.933
RE-Based 0.929 0.637 0.959 0.959 0.884 0.787 0.997 0.958 0.720 0.888 0.734 0.800 0.925 0.690
DVAE
= 0.05
↵ = 10 8
LOF 0.908 0.327 0.987 0.705 0.841 0.807 0.999 0.978 0.954 0.973 0.810 0.876 0.958 0.900
CEN 0.906 0.450 0.988 0.774 0.849 0.777 0.999 0.982 0.956 0.963 0.809 0.879 0.960 0.892
MDIS 0.914 0.437 0.987 0.749 0.810 0.794 0.999 0.984 0.957 0.964 0.806 0.873 0.961 0.883
KDE 0.917 0.430 0.987 0.749 0.802 0.796 0.999 0.985 0.957 0.964 0.806 0.872 0.961 0.882
OCSVM⌫=0.5 0.920 0.450 0.988 0.769 0.802 0.797 0.999 0.987 0.957 0.974 0.808 0.872 0.961 0.882
OCSVM⌫=0.1 0.922 0.460 0.988 0.791 0.804 0.780 0.999 0.988 0.956 0.973 0.817 0.872 0.959 0.881
RE-Based 0.928 0.640 0.958 0.953 0.880 0.785 0.998 0.922 0.715 0.836 0.734 0.803 0.924 0.694
(a) (b) (c)
Fig. 5. The influence of sparsity (a) and dimensionality (b) on the AUCs produced by six one-class classifiers using latent representations of DAE, SAE and
DVAE. The visualization of the latent data (the first two features z0 and z1) created by DAE, SAE and DVAE (c) on CTU13-10.
VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 11
of SAE and DVAE make OCSVM perform consistently over
a wide range of ⌫ and values.
The number of neighbors k is chosen in the range from
1% to 50% of training size. For example, if k is 10% of a
training dataset of size 200 samples, k is equal to 20. The
AUCs of hybrid DAE-LOF, SAE-LOF and DVAE-LOF are
computed, and plotted against 50 values of k as shown in
Fig. 6(c). The AUC curves of the hybrid SAE-LOF and DVAE-
LOF seem to level off within the range of k while there is
no clear trend for the AUC curve of DAE-LOF. Thus, the
latent representations of SAE and DVAE strengthen LOF to
be insensitive to the choice of k. More results are shown in
Fig. 6 of the supplementary material.
These experiments confirm that the one-class classifiers,
such as OCSVM and LOF, perform consistently on wide
ranges of parameter settings when using the latent represen-
tations of SAE and DVAE. This can be explained by: (1)
normal data is represented in very well-shaped (Gaussian)
distributions, and allocated in a small region highly isolated
from the regions where anomalies are expected to appear; (2)
the normal data from different sources will have a similar
representation. Fig. 5(c) is a typical example (also Fig. 7 in
the supplementary material). Therefore, OCSVM and LOF can
model normal data very well even though these classifiers use
few datapoints for support vectors in OCSVM (e.g. ⌫ = 0.01)
or for nearest neighbors in LOF (e.g. k = 1% training size).
This happens on several datasets.
The influence of training size: We investigate the influence
of training size on the latent representations of SAE and
DVAE for anomaly detection tasks. Four datasets of more than
10000 training instances are chosen for this experiment, that
is CTU13-09, CTU13-13, NSL-KDD and UNSW-NB15. Each
dataset is sub-sampled multiple times (sizes ranging from 500
to 10000) to give smaller training set sizes for this experiment.
Model selection is used as described in Subsection V-A2. The
AUCs and query times produced from the hybrid SAE-OCCs
and DVAE-OCCs are plotted against these training sizes as
shown in Fig. 8 and Fig. 9 in the supplementary material. The
results clearly show that the six one-class classifiers produce
very similar AUCs amongst the five sizes on the same dataset.
This suggests that the representation models, SAE and DVAE,
tend to be consistent on a wide range of training sizes, and
are less sensitive to training size than the hybrid DBN-OCCs
in [14, see Fig. 5]. This is a positive result because it appears
that excessive amounts of data are not required to make this
method perform well. In terms of the complexity at query time,
CEN out-performs other OCCs, and its query time does not
scale with training size.
Specific kinds of attacks: Our representation models are also
examined on the thirteen specific attack groups in NSL-KDD
and UNSW-NB15 as shown in Table III. This table has a
similar structure to Table II, without arrangement according to
sparsity. In general, the hybrid SAE-OCCs and DVAE-OCCs
produce big improvements in the classification accuracy in
comparison to their baselines on most of the attack groups,
especially on the attack groups where the baseline is already
good. This presents a common theme in classification methods.
Moreover, the performance of SAE-CEN is evaluated on
NSL-KDD by a confusion matrix as shown in Table IV.
The confusion matrix is not the same as in the multi-class
classification problem. This is because the classifiers built from
only normal data use a threshold to classify unseen data into
either the normal or anomalous class. This means that we can
not measure the incorrect classification of a normal datapoint
to a specific attack group, or an attack group to other attack
groups. Therefore, precision values are only computed for
normal and anomaly in the table. In this work, the threshold
is set to correctly classify 90% on normal training data.
TABLE IV
CONFUSION MATRIX OF THE HYBRID SAE-CEN ON NSL-KDD
Actual class
Precision
N
o
r
m
a
l
P
r
o
b
l
e
D
o
S
R
2
L
U
2
R
Prediction
Normal 8658 3 601 848 10 85.6%
Anomaly 1053 2418 6857 2039 57 91.5%
Recall 89.2% 99.9% 91.9% 70.6% 85.1% 88.8%
Note: the values in bold are correctly classified.
In terms of classification accuracy, the performance of these
one-class classification algorithms are comparable, when the
encoding is good (e.g. the encoding of SAE and DVAE). When
considering computational complexity, CEN, which is a sim-
ple method without hyperparameters, is very computationally
efficient at both modeling and querying. Thus, it is nominated
as the best model in our experiments.
VI. CONCLUSION AND FUTURE WORK
In this paper, we proposed latent representation models,
SAE and DVAE, which help anomaly detection methods
to cope with high-dimensional and sparse network datasets.
Classical AEs do not bring data to a “nice” distribution by
themselves, and the distribution they create is arbitrary. In the
tasks where we rely on good behavior of the encoding, we have
to control the distribution. Even with the standard VAE regu-
larization which does control the distribution, it does not put
the network “under pressure” to use all of its representational
power to represent normal data. Our approaches do so, forcing
normal data into a very tight area centered at the origin in
the non-saturating area of the bottleneck unit activations. This
helps AEs trained on normal data to keep normal datapoints
close to the origin and push anomalies far away.
We have demonstrated the latent representation created by
our models helps well-known anomaly detection algorithms
to perform efficiently and consistently on high-dimensional
and sparse network data, even with relatively few training
points. Amongst these algorithms, CEN is very computation-
ally efficient and is easily feasible to perform in real-time.
More importantly, the representation reduces the difficulty of
model selection for these algorithms since their performance
is insensitive to a wide range of hyperparameter settings.
In future we propose to investigate latent representations
using Gaussian mixture models. We also plan to propose an
alternative method for estimating the hyperparameter in
the loss functions of SAE and DVAE, possibly using multi-
objective optimization.
12 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
(a) (b) (c)
Fig. 6. The influence of ⌫ (a) and (b), and k (c) on the performance of OCSVM and LOF respectively when using the latent representations of DAE, SAE
and DVAE on CTU13-13.
TABLE III
AUCS FROM THE CLASSIFIERS MENTIONED IN TABLE II ON SPECIFIC ATTACK GROUPS OF NSL-KDD AND UNSW-NB15.
Representation
Methods
One-class
Classifiers
NSL-KDD UNSW-NB15
P
ro
b
e
D
o
S
R
2
L
U
2
R
F
u
z
z
e
rs
A
n
a
ly
s
is
B
a
c
k
d
o
o
r
D
o
S
E
x
p
lo
it
s
G
e
n
e
ri
c
R
e
c
o
n
n
-
-a
is
s
a
n
c
e
S
h
e
ll
c
o
d
e
W
o
rm
s
Stand-alone
LOF 0.752 0.796 0.821 0.703 0.455 0.635 0.597 0.614 0.670 0.984 0.436 0.354 0.614
CEN 0.974 0.957 0.933 0.934 0.576 0.732 0.748 0.723 0.633 0.895 0.555 0.508 0.676
MDIS 0.986 0.949 0.831 0.885 0.596 0.890 0.900 0.843 0.660 0.969 0.636 0.583 0.679
KDE 0.985 0.945 0.820 0.871 0.601 0.883 0.893 0.840 0.658 0.969 0.639 0.591 0.684
OCSVM⌫=0.5 0.986 0.957 0.838 0.905 0.652 0.855 0.876 0.845 0.733 0.920 0.658 0.603 0.784
OCSVM⌫=0.1 0.958 0.936 0.714 0.789 0.576 0.712 0.733 0.746 0.731 0.961 0.555 0.469 0.853
DAE
LOF 0.620 0.666 0.690 0.509 0.473 0.609 0.560 0.588 0.626 0.985 0.462 0.420 0.561
CEN 0.984 0.926 0.680 0.755 0.551 0.788 0.799 0.744 0.571 0.927 0.626 0.608 0.606
MDIS 0.966 0.912 0.761 0.746 0.565 0.818 0.828 0.770 0.588 0.955 0.644 0.606 0.651
KDE 0.964 0.904 0.666 0.743 0.563 0.799 0.809 0.751 0.571 0.949 0.646 0.614 0.642
OCSVM⌫=0.5 0.982 0.917 0.584 0.795 0.580 0.770 0.798 0.732 0.499 0.827 0.671 0.618 0.732
OCSVM⌫=0.1 0.734 0.834 0.323 0.308 0.391 0.289 0.305 0.417 0.420 0.694 0.527 0.468 0.722
RE-Based 0.981 0.971 0.911 0.930 0.632 0.992 0.957 0.940 0.888 0.979 0.592 0.476 0.816
VAE
LOF 0.489 0.504 0.511 0.488 0.503 0.487 0.522 0.494 0.505 0.501 0.489 0.500 0.464
CEN 0.488 0.504 0.511 0.489 0.504 0.487 0.522 0.494 0.506 0.502 0.488 0.501 0.468
MDIS 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465
KDE 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465
OCSVM⌫=0.5 0.489 0.503 0.512 0.489 0.504 0.487 0.523 0.494 0.504 0.501 0.489 0.499 0.464
OCSVM⌫=0.1 0.489 0.504 0.511 0.490 0.504 0.487 0.522 0.494 0.505 0.501 0.489 0.499 0.462
RE-Based 0.985 0.945 0.818 0.871 0.605 0.882 0.893 0.840 0.660 0.968 0.642 0.598 0.686
SAE
= 10
LOF 0.964 0.952 0.877 0.920 0.683 0.993 0.963 0.942 0.884 0.992 0.706 0.645 0.909
CEN 0.985 0.971 0.925 0.953 0.646 0.984 0.961 0.952 0.902 0.989 0.625 0.567 0.910
MDIS 0.988 0.971 0.926 0.950 0.629 0.994 0.961 0.952 0.909 0.988 0.646 0.573 0.909
KDE 0.988 0.971 0.925 0.949 0.623 0.993 0.961 0.952 0.909 0.988 0.642 0.559 0.906
OCSVM⌫=0.5 0.987 0.972 0.923 0.948 0.632 0.994 0.965 0.956 0.917 0.988 0.656 0.579 0.907
OCSVM⌫=0.1 0.987 0.973 0.912 0.908 0.648 0.994 0.967 0.957 0.921 0.988 0.642 0.554 0.902
RE-Based 0.985 0.946 0.822 0.872 0.601 0.881 0.891 0.838 0.657 0.969 0.640 0.592 0.685
DVAE
= 0.05
↵ = 10 8
LOF 0.977 0.974 0.896 0.934 0.635 0.996 0.956 0.949 0.898 0.990 0.537 0.457 0.895
CEN 0.983 0.971 0.915 0.929 0.605 0.995 0.958 0.941 0.882 0.990 0.666 0.603 0.881
MDIS 0.982 0.972 0.915 0.927 0.616 0.994 0.955 0.940 0.866 0.990 0.653 0.572 0.854
KDE 0.982 0.972 0.915 0.927 0.608 0.993 0.956 0.939 0.864 0.990 0.658 0.578 0.852
OCSVM⌫=0.5 0.982 0.973 0.914 0.926 0.601 0.993 0.960 0.942 0.869 0.990 0.661 0.584 0.860
OCSVM⌫=0.1 0.981 0.972 0.908 0.908 0.599 0.994 0.961 0.942 0.871 0.990 0.659 0.586 0.860
RE-Based 0.985 0.945 0.820 0.872 0.602 0.888 0.898 0.843 0.660 0.971 0.642 0.593 0.682
REFERENCES
[1] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly
detection techniques,” Journal of Network and Computer Applications,
vol. 60, pp. 19–31, 2016.
[2] M. Usama, J. Qadir, A. Raza, H. Arif, K.-L. A. Yau, Y. Elkhatib,
A. Hussain, and A. Al-Fuqaha, “Unsupervised machine learning for
networking: Techniques, applications and research challenges,” arXiv
preprint arXiv:1709.06599, 2017.
[3] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
[4] V. V. Phoha, Internet security dictionary. Springer Science & Business
Media, 2007.
[5] U. Fiore, F. Palmieri, A. Castiglione, and A. De Santis, “Network
anomaly detection with the Restricted Boltzmann Machine,” Neurocom-
puting, vol. 122, pp. 13–23, 2013.
[6] K. Shafi and H. A. Abbass, “Evaluation of an adaptive genetic-based
signature extraction system for network intrusion detection,” Pattern
Analysis and Applications, vol. 16, no. 4, pp. 549–566, 2013.
[7] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
Williamson, “Estimating the support of a high-dimensional distribution,”
Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
[8] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying
density-based local outliers,” in ACM SIGMOD record, vol. 29, no. 2.
ACM, 2000, pp. 93–104.
[9] S. S. Khan and M. G. Madden, “One-class classification: taxonomy of
study and review of techniques,” The Knowledge Engineering Review,
VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 13
vol. 29, no. 3, pp. 345–374, 2014.
[10] A. N. Mahmood, C. Leckie, and P. Udaya, “An efficient clustering
scheme to exploit hierarchical data in network traffic analysis,” TKDE,
vol. 20, no. 6, pp. 752–767, 2008.
[11] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised
outlier detection in high-dimensional numerical data,” Statistical Analy-
sis and Data Mining: The ASA Data Science Journal, vol. 5, no. 5, pp.
363–387, 2012.
[12] V. L. Cao, M. Nicolau, and J. McDermott, “One-class classification for
anomaly detection with kernel density estimation and genetic program-
ming,” in EuroGP, Portugal, vol. 9594. Springer, 2016, pp. 3–18.
[13] V. L. Cao, M. Nicolau, J. McDermott et al., “A hybrid autoencoder and
density estimation model for anomaly detection,” in Parallel Problem
Solving from Nature. Springer, 2016, pp. 717–726.
[14] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, “High-
dimensional and large-scale anomaly detection using a linear one-class
SVM with deep learning,” Pattern Recognition, vol. 58, pp. 121–134,
2016.
[15] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
length and Helmholtz free energy,” in Advances in neural information
processing systems, 1994, pp. 3–10.
[16] H. Bourlard and Y. Kamp, “Auto-association by multilayer perceptrons
and singular value decomposition,” Biological cybernetics, vol. 59, no. 4,
pp. 291–294, 1988.
[17] N. Japkowicz, C. Myers, and M. Gluck, “A novelty detection approach
to classification,” in IJCAI, 1995, pp. 518–523.
[18] S. Hawkins, H. He, G. Williams, and R. Baxter, “Outlier detection
using replicator neural networks,” in Data warehousing and knowledge
discovery. Springer, 2002, pp. 170–180.
[19] M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with
nonlinear dimensionality reduction,” in Proc MLSDA. ACM, 2014, p. 4.
[20] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
[21] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
2006.
[22] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-
wise training of deep networks,” in Advances in neural information
processing systems, 2007, pp. 153–160.
[23] D. Rajashekar, A. N. Zincir-Heywood, and M. I. Heywood, “Smart
phone user behaviour characterization based on autoencoders and self
organizing maps,” in ICDMW. IEEE, 2016, pp. 319–326.
[24] M. P. Wand and M. C. Jones, Kernel smoothing. CRC Press, 1994.
[25] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A
comparative study of anomaly detection schemes in network intrusion
detection,” in Proc SIAM International Conference on Data Mining.
SIAM, 2003, pp. 25–36.
[26] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting
and composing robust features with denoising autoencoders,” in Proc
ICML. ACM, 2008, pp. 1096–1103.
[27] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
“Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion,” JMLR, vol. 11, no. 11,
pp. 3371–3408, 2010.
[28] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
preprint arXiv:1312.6114, 2013.
[29] D. M. Tax and R. P. Duin, “Support vector data description,” Machine
learning, vol. 54, no. 1, pp. 45–66, 2004.
[30] C. Bennett and K. Campbell, “A linear programming approach to novelty
detection,” Advances in neural information processing systems, vol. 13,
no. 13, p. 395, 2001.
[31] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li,
“AI2: Training a big data machine to defend,” in Proc BigDataSecurity,
HPSC, and IDS. IEEE, 2016, pp. 49–54.
[32] S. M. Erfani, M. Baktashmotlagh, S. Rajasegarar, S. Karunasekera, and
C. Leckie, “R1SVM: A randomised nonlinear approach to large-scale
anomaly detection,” in AAAI Conference on Artificial Intelligence, 2015.
[33] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
review and new perspectives,” PAMI, vol. 35, no. 8, pp. 1798–1828,
2013.
[34] M. Ranzato, Y.-l. Boureau, and Y. L. Cun, “Sparse feature learning for
deep belief networks,” in Advances in neural information processing
systems, 2008, pp. 1185–1192.
[35] M. Ranzato, C. Poultney, S. Chopra, and Y. L. Cun, “Efficient learning
of sparse representations with an energy-based model,” in Advances in
neural information processing systems, 2007, pp. 1137–1144.
[36] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive
auto-encoders: Explicit invariance during feature extraction,” in Proc
ICML, 2011, pp. 833–840.
[37] J. Duchi, “Derivations for linear algebra and optimization,” Berkeley,
California, 2007.
[38] M. Lichman, “UCI machine learning repository,” 2013. [Online].
Available: http://archive.ics.uci.edu/ml
[39] G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Micenková,
E. Schubert, I. Assent, and M. E. Houle, “On the evaluation of unsu-
pervised outlier detection: measures, datasets, and an empirical study,”
Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891–927,
2016.
[40] S. Garcia, M. Grill, J. Stiborek, and A. Zunino, “An empirical compar-
ison of botnet detection methods,” Computers & Security, vol. 45, pp.
100–123, 2014.
[41] D. C. Le, A. N. Zincir-Heywood, and M. I. Heywood, “Data analytics
on network traffic flows for botnet behaviour detection,” in SSCI. IEEE,
2006, pp. 1–7.
[42] “KDD Cup Dataset,” 1999, available at the following website
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[43] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed
analysis of the KDD CUP 99 data set,” in CISDA. IEEE, 2009, pp.
1–6.
[44] J. McHugh, “Testing intrusion detection systems: a critique of the 1998
and 1999 DARPA intrusion detection system evaluations as performed
by Lincoln laboratory,” TISSEC, vol. 3, no. 4, pp. 262–294, 2000.
[45] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for
network intrusion detection systems (UNSW-NB15 network data set),”
in MilCIS. IEEE, 2015, pp. 1–6.
[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” JMLR, vol. 12, pp.
2825–2830, 2011.
[47] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
preprint arXiv:1212.5701, 2012.
[48] L. Prechelt, “Early stopping-but when?” Neural Networks: Tricks of the
trade, pp. 553–553, 1998.
[49] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proc International Conference
on Artificial Intelligence and Statistics, 2010, pp. 249–256.
[50] C. Wang, S. S. Venkatesh, and J. S. Judd, “Optimal stopping and effec-
tive machine complexity in learning,” in Advances in neural information
processing systems, 1994, pp. 303–310.
Van Loi Cao Loi received a BSc and a MSc
in Computer Science from Le Quy Don Technical
University, Vietnam. He worked for the university as
an assistant lecturer. In 2015, he moved to Ireland
to study a PhD in University College Dublin under
the supervision of Assoc. Prof. James McDermott
and Assoc. Prof. Miguel Nicolau, and is funded
by VIED, Vietnam. His main research interests are
neural network, machine learning, evolutionary com-
putation, and information security.
Miguel Nicolau Miguel is an Assoc Professor in
UCD. He received a BSc in Belgium, followed by
a BSc, MSc and PhD in the University of Limerick.
He then worked as an Expert Engineer in the INRIA
Institute in Paris, France. In 2010 he moved back
to Ireland, and worked as a Research Fellow and
Lecturer in UCD. His teaching experience spans over
15 years, and includes positions at University of
Limerick, Fudan University in Shanghai, and UCD.
James McDermott James holds a BSc in Com-
puter Science with Mathematics, from the National
University of Ireland, Galway. His PhD was in the
University of Limerick. His post-doctoral research
was in UCD and Massachusetts Institute of Technol-
ogy. He is now an Associate Professor in University
College Dublin. His main research interests are
in evolutionary computation, machine learning, and
computer music.

More Related Content

What's hot

Artificial Neural Network Abstract
Artificial Neural Network AbstractArtificial Neural Network Abstract
Artificial Neural Network AbstractAnjali Agrawal
 
VLSI IN NEURAL NETWORKS
VLSI IN NEURAL NETWORKSVLSI IN NEURAL NETWORKS
VLSI IN NEURAL NETWORKSMohan Moki
 
Artificial Neural Network and its Applications
Artificial Neural Network and its ApplicationsArtificial Neural Network and its Applications
Artificial Neural Network and its Applicationsshritosh kumar
 
My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...Anirbit Mukherjee
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
 
Artificial Neural Network report
Artificial Neural Network reportArtificial Neural Network report
Artificial Neural Network reportAnjali Agrawal
 
Artificial neural networks seminar presentation using MSWord.
Artificial neural networks seminar presentation using MSWord.Artificial neural networks seminar presentation using MSWord.
Artificial neural networks seminar presentation using MSWord.Mohd Faiz
 
Neural Network Applications In Machining: A Review
Neural Network Applications In Machining: A ReviewNeural Network Applications In Machining: A Review
Neural Network Applications In Machining: A ReviewAshish Khetan
 
fundamentals-of-neural-networks-laurene-fausett
fundamentals-of-neural-networks-laurene-fausettfundamentals-of-neural-networks-laurene-fausett
fundamentals-of-neural-networks-laurene-fausettZarnigar Altaf
 
NEAT: Neuroevolution of Augmenting Topologies
NEAT: Neuroevolution of Augmenting TopologiesNEAT: Neuroevolution of Augmenting Topologies
NEAT: Neuroevolution of Augmenting TopologiesKhush Patel
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSREHMAT ULLAH
 
mohsin dalvi artificial neural networks questions
mohsin dalvi   artificial neural networks questionsmohsin dalvi   artificial neural networks questions
mohsin dalvi artificial neural networks questionsAkash Maurya
 
Artificial Neural Network Paper Presentation
Artificial Neural Network Paper PresentationArtificial Neural Network Paper Presentation
Artificial Neural Network Paper Presentationguestac67362
 
Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Sivagowry Shathesh
 

What's hot (20)

Project presentation
Project presentationProject presentation
Project presentation
 
Artificial Neural Network Abstract
Artificial Neural Network AbstractArtificial Neural Network Abstract
Artificial Neural Network Abstract
 
A tutorial in Connectome Analysis (1) - Marcus Kaiser
A tutorial in Connectome Analysis (1) - Marcus KaiserA tutorial in Connectome Analysis (1) - Marcus Kaiser
A tutorial in Connectome Analysis (1) - Marcus Kaiser
 
VLSI IN NEURAL NETWORKS
VLSI IN NEURAL NETWORKSVLSI IN NEURAL NETWORKS
VLSI IN NEURAL NETWORKS
 
Artificial Neural Network and its Applications
Artificial Neural Network and its ApplicationsArtificial Neural Network and its Applications
Artificial Neural Network and its Applications
 
My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...My invited talk at the 23rd International Symposium of Mathematical Programmi...
My invited talk at the 23rd International Symposium of Mathematical Programmi...
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural Network
 
Artificial Neural Network report
Artificial Neural Network reportArtificial Neural Network report
Artificial Neural Network report
 
Project Report -Vaibhav
Project Report -VaibhavProject Report -Vaibhav
Project Report -Vaibhav
 
Artificial neural networks seminar presentation using MSWord.
Artificial neural networks seminar presentation using MSWord.Artificial neural networks seminar presentation using MSWord.
Artificial neural networks seminar presentation using MSWord.
 
Neural Network Applications In Machining: A Review
Neural Network Applications In Machining: A ReviewNeural Network Applications In Machining: A Review
Neural Network Applications In Machining: A Review
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
fundamentals-of-neural-networks-laurene-fausett
fundamentals-of-neural-networks-laurene-fausettfundamentals-of-neural-networks-laurene-fausett
fundamentals-of-neural-networks-laurene-fausett
 
NEAT: Neuroevolution of Augmenting Topologies
NEAT: Neuroevolution of Augmenting TopologiesNEAT: Neuroevolution of Augmenting Topologies
NEAT: Neuroevolution of Augmenting Topologies
 
Functional Brain Networks - Javier M. Buldù
Functional Brain Networks - Javier M. BuldùFunctional Brain Networks - Javier M. Buldù
Functional Brain Networks - Javier M. Buldù
 
Neural networks introduction
Neural networks introductionNeural networks introduction
Neural networks introduction
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
mohsin dalvi artificial neural networks questions
mohsin dalvi   artificial neural networks questionsmohsin dalvi   artificial neural networks questions
mohsin dalvi artificial neural networks questions
 
Artificial Neural Network Paper Presentation
Artificial Neural Network Paper PresentationArtificial Neural Network Paper Presentation
Artificial Neural Network Paper Presentation
 
Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing
 

Similar to Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint

Web server load prediction and anomaly detection from hypertext transfer prot...
Web server load prediction and anomaly detection from hypertext transfer prot...Web server load prediction and anomaly detection from hypertext transfer prot...
Web server load prediction and anomaly detection from hypertext transfer prot...IJECEIAES
 
Congestion Prediction in Internet of Things Network using Temporal Convolutio...
Congestion Prediction in Internet of Things Network using Temporal Convolutio...Congestion Prediction in Internet of Things Network using Temporal Convolutio...
Congestion Prediction in Internet of Things Network using Temporal Convolutio...vitsrinu
 
An efficient approach on spatial big data related to wireless networks and it...
An efficient approach on spatial big data related to wireless networks and it...An efficient approach on spatial big data related to wireless networks and it...
An efficient approach on spatial big data related to wireless networks and it...eSAT Journals
 
A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...IJECEIAES
 
Robust encryption algorithm based sht in wireless sensor networks
Robust encryption algorithm based sht in wireless sensor networksRobust encryption algorithm based sht in wireless sensor networks
Robust encryption algorithm based sht in wireless sensor networksijdpsjournal
 
SECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETS
SECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETSSECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETS
SECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETSIJCNCJournal
 
Securing BGP by Handling Dynamic Network Behavior and Unbalanced Datasets
Securing BGP by Handling Dynamic Network Behavior and Unbalanced DatasetsSecuring BGP by Handling Dynamic Network Behavior and Unbalanced Datasets
Securing BGP by Handling Dynamic Network Behavior and Unbalanced DatasetsIJCNCJournal
 
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...IJNSA Journal
 
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...IJNSA Journal
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Projectbutest
 
BIG DATA ANALYTICS FOR USER-ACTIVITY ANALYSIS AND USER-ANOMALY DETECTION IN...
 BIG DATA ANALYTICS FOR USER-ACTIVITY  ANALYSIS AND USER-ANOMALY DETECTION IN... BIG DATA ANALYTICS FOR USER-ACTIVITY  ANALYSIS AND USER-ANOMALY DETECTION IN...
BIG DATA ANALYTICS FOR USER-ACTIVITY ANALYSIS AND USER-ANOMALY DETECTION IN...Nexgen Technology
 
Intelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networksIntelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networksIJECEIAES
 
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation AnalysisA New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysisijceronline
 
Reliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor NetworkReliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor NetworkIJMTST Journal
 
A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...
A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...
A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...Shakas Technologies
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abstncct
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxShakas Technologies
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxShakas Technologies
 
Security Method in Data Acquisition Wireless Sensor Network
Security Method in Data Acquisition Wireless Sensor Network Security Method in Data Acquisition Wireless Sensor Network
Security Method in Data Acquisition Wireless Sensor Network Dharmendrasingh417
 
Titles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdfTitles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdfinfo751436
 

Similar to Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint (20)

Web server load prediction and anomaly detection from hypertext transfer prot...
Web server load prediction and anomaly detection from hypertext transfer prot...Web server load prediction and anomaly detection from hypertext transfer prot...
Web server load prediction and anomaly detection from hypertext transfer prot...
 
Congestion Prediction in Internet of Things Network using Temporal Convolutio...
Congestion Prediction in Internet of Things Network using Temporal Convolutio...Congestion Prediction in Internet of Things Network using Temporal Convolutio...
Congestion Prediction in Internet of Things Network using Temporal Convolutio...
 
An efficient approach on spatial big data related to wireless networks and it...
An efficient approach on spatial big data related to wireless networks and it...An efficient approach on spatial big data related to wireless networks and it...
An efficient approach on spatial big data related to wireless networks and it...
 
A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...A data estimation for failing nodes using fuzzy logic with integrated microco...
A data estimation for failing nodes using fuzzy logic with integrated microco...
 
Robust encryption algorithm based sht in wireless sensor networks
Robust encryption algorithm based sht in wireless sensor networksRobust encryption algorithm based sht in wireless sensor networks
Robust encryption algorithm based sht in wireless sensor networks
 
SECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETS
SECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETSSECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETS
SECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETS
 
Securing BGP by Handling Dynamic Network Behavior and Unbalanced Datasets
Securing BGP by Handling Dynamic Network Behavior and Unbalanced DatasetsSecuring BGP by Handling Dynamic Network Behavior and Unbalanced Datasets
Securing BGP by Handling Dynamic Network Behavior and Unbalanced Datasets
 
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
 
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
BIG DATA ANALYTICS FOR USER-ACTIVITY ANALYSIS AND USER-ANOMALY DETECTION IN...
 BIG DATA ANALYTICS FOR USER-ACTIVITY  ANALYSIS AND USER-ANOMALY DETECTION IN... BIG DATA ANALYTICS FOR USER-ACTIVITY  ANALYSIS AND USER-ANOMALY DETECTION IN...
BIG DATA ANALYTICS FOR USER-ACTIVITY ANALYSIS AND USER-ANOMALY DETECTION IN...
 
Intelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networksIntelligent black hole detection in mobile AdHoc networks
Intelligent black hole detection in mobile AdHoc networks
 
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation AnalysisA New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
 
Reliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor NetworkReliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor Network
 
A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...
A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...
A_Measurement_Approach_for_Inline_Intrusion_Detection_of_Heartbleed-Like_Atta...
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
 
Security Method in Data Acquisition Wireless Sensor Network
Security Method in Data Acquisition Wireless Sensor Network Security Method in Data Acquisition Wireless Sensor Network
Security Method in Data Acquisition Wireless Sensor Network
 
Titles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdfTitles with Abstracts_2023-2024_Data Mining.pdf
Titles with Abstracts_2023-2024_Data Mining.pdf
 

Recently uploaded

PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEMISSRITIMABIOLOGYEXP
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...Nguyen Thanh Tu Collection
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineCeline George
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
The role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipThe role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipKarl Donert
 
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...Nguyen Thanh Tu Collection
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfChristalin Nelson
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 

Recently uploaded (20)

PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command Line
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
The role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipThe role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenship
 
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdf
 
Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...
Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...
Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,
 

Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint

  • 1. IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX 1 Learning Neural Representations for Network Anomaly Detection Van Loi Cao, Miguel Nicolau and James McDermott Abstract—This paper proposes latent representation models for improving network anomaly detection. Well-known anomaly de- tection algorithms often suffer from challenges posed by network data, such as high dimension and sparsity, and a lack of anomaly data for training, model selection, and hyperparameter tuning. Our approach is to introduce new regularizers to a classical Autoencoder (AE) and a Variational Autoencoder (VAE), which force normal data into a very tight area centered at the origin in the non-saturating area of the bottleneck unit activations. These trained AEs on normal data will push normal points towards the origin, whereas anomalies, which differ from normal data, will be put far away from the normal region. The models are very different from common regularized AEs, Sparse AE and Contractive AE, in which the regularized AEs tend to make their latent representation less sensitive to changes of the input data. The bottleneck feature space is now used as a new data representation. A number of one-class learning algorithms are used for evaluating the proposed models. The experiments testify that our models help these classifiers to perform efficiently and consistently on high-dimensional and sparse network datasets, even with relatively few training points. More importantly, the models can minimize the effect of model selection on these classifiers since their performance is insensitive to a wide range of hyperparameter settings. Index Terms—Anomaly detection, latent representation, high dimension, one-class classification, autoencoders. I. INTRODUCTION THE rapid growth of computer networks has enabled them to function as a central information system in modern life. The increase in the size, services and applications, and infrastructure of computer networks such as the Internet of Things (IoT), has made them complex and heterogeneous. Thus, they confront various critical threats such as malicious activities, network intruders and cyber criminals. Identifying and preventing these detrimental cyber activities have high pri- ority these days [1]. Analyzing and monitoring network traffic to identify such malicious actions in large-scale networks are crucial tasks, and ideally should be carried out automatically with little supervision by network administrators [2]. Anomaly detection is a data analysis task where the goal is to detect patterns deviating greatly from normal data. It is suitable for automatically identifying illegal, malicious activities and other forms of network abuse from the normal behaviors of network systems [3], [4]. Many machine learning algorithms have been Manuscript received December 22, 2017; revised March 13, 2018. This work is funded by Vietnam International Education Development (VIED) and by agreement with the Irish Universities Association. VL. Cao is with the School of Computer Science, University College Dublin, Dublin, Ireland (e-mail: loi.cao@ucdconnect.ie). J. McDermott and M. Nicolau are with University College Dublin, Dublin, Ireland (e-mail: james.mcdermott2@ucd.ie and miguel.nicolau@ucd.ie). employed for developing anomaly detection models [1], [2], [3]. However, several issues, such as the high dimension and complex types of network data, the lack of labelled anomalous traffic, and the rapid evolution of intrusion methods, make network anomaly detection a challenging task. In this work, we aim to cope with these issues by proposing latent repre- sentation models which compress normal data into a specific region of a latent feature space. This is expected to facilitate modelling of normal data. As stated, one of the major issues is that labelled anomalous data tends not be available for constructing network anomaly detection models [3]. Collecting anomalies is extremely dif- ficult due to privacy and security concerns of computer net- works, and the shortage of intrusion network traffic and events in host logs [5], [6]. Network administrators tend to avoid divulging data that could compromise the privacy of their clients or privileged information of their networks. Labeling a huge volume of anomalous data covering all possible kinds of attacks from a real-world network would be a challenging and time-consuming task. Moreover, malicious actions or intrusive methods are evolving over time. Thus, it may require a significant amount of time to gather and label these data after the awareness of the detailed information and behavior of new attacks becomes available. Furthermore, new anomalies, such as zero-day vulnerabilities, often cause serious damage to net- work systems. Thus anomaly detection models are required to cope with new anomalous actions efficiently. Most supervised learning algorithms using knowledge of previous anomalies are unable to detect novelties [1]. These issues strongly suggest that the training process should be as independent as possible from the availability of anomalous data, and anomaly detection models should be able to respond in a flexible and timely way to any new anomalous actions. However, the absence of anomalies implies the crucial issue that no validation set is available for estimating hyperparam- eters. Most well-known anomaly detection algorithms, such as one-class Support Vector Machine (OCSVM) [7] or Local Outlier Factor (LOF) [8], are highly dependent on the choice of parameters [8], [9] (more details will be discussed in Section II and III). Supposing a small proportion of anomalies are available for estimating parameters, this may damage the per- formance of anomaly detection models since new, completely different anomalies may appear in the future. Therefore, it is desirable that network anomaly detection models should provide a good prediction on unseen data on a wide range of parameter settings, and have the ability to detect any new forms of anomalies instantly as they appear.
  • 2. 2 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX The high dimension and complexity of network data is another challenge to network anomaly detection. Network traffic is typically described by a huge number of features, such as in CISCO NetFlow data, and in different data types, such as hierarchies (IP addresses), categories (protocols and services) or continuous attributes [3], [10]. Anomaly detection techniques often require some preprocessing on input data, which may result in a higher-dimensional and sparser version of the data. The curse of dimensionality is a problem for anomaly detection algorithms [11]. This leads to a high pro- portion of irrelevant features effectively producing noise that conceals true anomalies in network data. If enough subspaces that contain a subset of features are given, at least one subspace (mostly relevant features) can be found in which anomalies appear far from normal data. However, the search for such subspaces is systematically difficult in high dimension since the number of subspaces increases exponentially with the dimensionality, which is called the exponential search space problem. The curse of dimensionality also results in concentration of distances. The relative difference between the pairwise distance of any two datapoints and that of others vanishes with increasing dimensionality. This is a challenge to distance-based anomaly detection algorithms. Therefore, network anomaly detection algorithms are required to deal with high-dimensional and sparse data1 , by discovering more robust and relevant features. Unsupervised learning techniques, such as Support Vector Data Description (SVDD), OCSVM and LOF, have been widely used for anomaly detection [3]. These techniques have successfully addressed the task of modeling normal data without any assumption about its underlying distribu- tion. LOF [8] is an advanced technique for high-dimensional anomaly detection, which uses the local density deviation of a given datapoint from its neighbors as an anomaly score. When LOF is trained on only normal data, it can be used as a one- class classifier. Recently, Kernel Density Estimation (KDE) has been employed for building anomaly detection models, and proven to efficiently model normal data with unknown underlying distributions [12], [13]. In practice however, these anomaly detection algorithms have some drawbacks: less generalization ability in high dimension due to the curse of dimensionality phenomenon [11], [14], and the difficulty of tuning hyperparameters. These algorithms are non-parametric methods, thus their query time is potentially high (more details in Sections II and III). Autoencoders (AEs) [15], [16] are a neural network archi- tecture which have emerged as a suitable approach to anomaly detection [5], [17], [18], [19] and as building blocks in deep learning [20], [21], [22] in recent years. An AE is a feed- forward neural network which attempts to reconstruct the original input data at the output layer. The middle hidden layer, sometimes called the bottleneck layer, like a nonlin- ear PCA, compresses the redundancies while preserving and differentiating non-redundant information in the input [17]. 1A data with a majority of zero elements is considered as a sparse dataset. Sparsity is a term used to represent the ratio of the number of zero entries to the total number of entries in a dataset, and it is in the range of [0, 1]. In this paper, a dataset with a sparsity above 0.5 is regarded as a sparse one. In the anomaly detection context, an AE trained on normal data will behave well on normal instances and will result in small reconstruction errors (REs), but poorly reconstruct anomalies giving large REs. Thus, RE is commonly used as a measure of anomaly score. Alternatively, the middle hidden layer of a trained AE can be used as a new feature representation (called a latent representation) for improving the performance of density-based anomaly detection [13] or anomaly detection based on self-organizing maps [23]. The central idea is that the latent representation which is lower- dimension, and more robust to capture normal behaviors, would help simple classifiers to identify anomalies. However, the normal data is allowed to be freely distributed in the latent feature space. The AE encoder could learn to map points from the normal class into very different regions of the latent feature space. Thus, the distribution of normal data in the latent feature space may have an arbitrary shape which may not encourage the stability of anomaly detection algorithms. In order to overcome the limitations of the well-known anomaly detection algorithms, we aim to find a new data representation for facilitating simple anomaly detection al- gorithms. The new representation is aimed to have useful characteristics: lower dimension, straightforward to capture the structure of normal data, a similar shape of normal data in the new representation for different input distributions, and normal data to be distributed in a small region in the feature space and anomalies to be expected to appear in the rest of the space. This will potentially improve the performance of anomaly detection algorithms, and may make them less sensitive to parameter settings. Our approach is to develop two AEs, a classical AE and a Variational Autoencoder (VAE), for constructing such a data representation by introducing some constraints on the distribution of normal data in the bottleneck layer. The new regularizers will encourage these AEs to learn to represent latent data in a more meaningful way - training data (which is assumed to be normal) appears close together, and is distributed in a specific region in the latent feature space. The bottleneck layers of these trained AEs will then be used as the new data representation. Fig. 1 gives an example of data representation in the original space (a), in the latent feature space of AEs (b), and in the latent feature space of our models (c). The normal data shown in Fig 1(b) is closer together than that in Fig 1(a), and has an arbitrary shape. In Fig 1(c), the normal data is constrained to be distributed in a good shape close to the origin. A number of one-class classification algorithms are then employed to capture the region representing normal behavior in the latent feature space, and identify any datapoint not belonging to this region as anomalies. More details will be presented in Section IV. The remainder of the paper is organized as follows. In Section II and III, we briefly describe several anomaly detec- tion algorithms, and highlight some related work in anomaly detection. Our methods are presented in Section IV. This is followed by Section V showing the evaluation and discussion of our models. Section VI draws some conclusions and sug- gests future work.
  • 3. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 3 x x 0 1 (a) z z 0 1 (b) Normal Anomaly z0 z1 (c) Fig. 1. Illustrations of data in the original feature space (a), the latent feature space of AEs (b), and the latent feature space of our models (c). II. MATHEMATICS OF ONE-CLASS CLASSIFICATION ALGORITHMS This section is to briefly describe anomaly detection al- gorithms used in this paper. This includes Centroid, Mean distance, KDE, LOF and OCSVM as well as autoencoders. A. Anomaly detection algorithms Centroid (CEN): This is a parametric method which uses a single Gaussian to model training data. The distance (i.e. radius) from the centroid (the origin) to an observation reflects the degree of abnormality of the observation. A larger value implies a higher probability that the datapoint is an anomaly. By imposing a threshold on the distance, a query datapoint can be classified as either normal or an anomaly. This method has no hyperparameters, and works under the assumption that the training data has a Gaussian distribution. Mean Distance (MDIS): The mean of the Euclidean distance from a datapoint to normal training set can be used as anomaly score. By imposing a threshold on the mean distance, the anomaly score of a given datapoint above the threshold indicates an anomaly. MDIS has no hyperparameters, and is a non-parametric method. Kernel Density Estimation (KDE): KDE is used for estimat- ing the probability density function of a sample in data [24]. KDE can be used for constructing an anomaly detection model as presented in [12]. However, the main drawback of the model is its computational cost at querying stage, especially on large datasets. The performance in terms of classification accuracy of KDE-based classifiers will depend on the choice of the bandwidth h of a kernel function [12]. Local Outlier Factor (LOF): LOF [8] considers the data- points that have a considerably lower local density than their neighbors as anomalies. It estimates a density deviation score, called local outlier factor, of a given datapoint with respect to its neighbors. The larger the LOF score a given datapoint has, the higher the probability the datapoint is anomalous. The algorithm has shown its power on network anomaly detection [25]. In practice however, it has some limitations when dealing with high-dimensional data [2], and the choice of the number of neighbors k is still an open question. One-class Support Vector Machine (OCSVM): OCSVM [7] first maps the normal data into a feature space via a kernel function, and searches for a hyperplane with maximum margin between the region containing most of normal data (normal region) and the origin in the feature space. The idea behind this is to allocate the region encompassing the origin for anomalies to appear. That is to say, the OCSVM decision function returns a positive value in the normal region far from the origin, and a negative value in the anomaly region near the origin. B. Autoencoder An autoencoder [15], [16] is a neural network which con- sists of two parts: encoder and decoder as shown in Fig. 2(a). The encoder is defined as a feature extractor that allows the explicit representation of an input x in a feature space. Let f✓ denote the encoder, and X = x1 , x2 , ...xn be a dataset. The encoder f✓ will map the input xi 2 X into a latent vector zi = f✓(xi ), where zi is the code or latent representation. The decoder g✓ will map the latent representation zi back into the input space, which forms a reconstruction x̂i = g✓(zi ). The encoder and decoder are commonly represented as single-layer neural networks in the form of non-linear functions of affine mappings as follows: f✓ (x) = sf (Wx + b) (1) g✓(z) = sg ⇣ W 0 z + b 0 ⌘ (2) where W and W 0 are the weight matrices of the encoder and decoder, and b and b 0 are the bias vectors of the encoder and decoder. sf and fg are the activation functions of the encoder and decoder, such as a logistic sigmoid or hyperbolic tangent non-linear function, or a linear identity function. Autoencoders learn to minimize the loss function in (3) with respect to the parameters ✓ = {W, W 0 , b, b 0 }, using a learning algorithm such as Stochastic Gradient Descent (SGD) with back-propagation. The reconstruction loss function over training instances can be written as: LAE(✓; x) = 1 n n X i=0 l(xi , x̂i ) = 1 n n X i=0 l(xi , g✓(f✓(xi ))) (3) where l(xi , x̂i ) is the discrepancy between the input xi and its reconstruction x̂i. The choice of the reconstruction loss depends largely on the appropriate distributional assumptions on given data. The mean squared error (MSE)2 is commonly used for real-valued data, whereas a cross-entropy loss3 can be used for binary data. By compressing input data into a lower dimensional space, the classical autoencoder avoids simply learning the identity, and removes redundant information [17]. Denoising autoencoders (DAEs) [26], [27] are regularized autoencoders that are trained to reconstruct the original input from a corrupted version of the input. This will allow DAEs to capture the structure of the input distribution, and again prevent them from learning the identity. The loss function of AEs in (3) is rewritten for DAEs as follows: LDAE(✓; x) = n X i=0 Ep(x̃|xi) ⇥ l(xi , g✓(f✓(x̃))) ⇤ (4) where x̃ is the corrupted version of xi drawn from p(x̃|xi ). Ep(x̃|xi) is the expectation of a reconstruction loss at xi over a number of samples x̃ drawn from p(x̃|xi ). This is because the 2LAE(✓; x) = 1 n Pn i=1 k xi x̂i k2 3LAE(✓; x) = 1 n Pn i=1 xi log(x̂i) + (1 xi) log(1 x̂i)
  • 4. 4 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX corruption process is performed stochastically on the original input each time a point xi is considered. There are many ways to corrupt the input, such as Gaussian noise or salt and pepper noise, but randomly masking features of the input to zero is the most commonly used. This loss function can be optimized by a SGD as in optimizing the AEs loss function. C. Variational Autoencoder The Variational Autoencoder (VAE) [28] is a neural network that consists of two parts: a probabilistic encoder representing the approximate posterior q (z|x) to the intractable true pos- terior p✓(z|x), and a probabilistic decoder that refers to the generative model p✓(x|z) as shown in Fig 2(b). The objective of VAE is to optimize the variational lower bound on the marginal likelihood of data w.r.t. variational parameters and generative parameters ✓. The marginal likelihood is computed as a sum over the marginal likelihoods of individual datapoint since it is intractable, log p✓(x1 , ..., xn ) = Pn i=1 log p✓(xi ), where log p✓(xi ) can be written as: log p✓(xi ) = DKL q (z|xi )kp✓(z|xi ) + L(✓, ; xi ) (5) The term L(✓, ; xi ) is the lower bound on the marginal likeli- hood of datapoint xi since the first term, the Kullback-Leibler divergence (KL-divergence) of the approximate posterior from the true posterior, is non-negative. The lower bound can be written as follows: L(✓, ; xi ) = Eq (z|x)[ log q (z|x) + log p✓(x, z)] = DKL q (z|xi )kp✓(z) + Eq (z|xi)[log p✓(xi |z)] (6) where p✓(xi |z) is the likelihood of xi given the latent variable z, and p✓(z) is the prior over latent variables. However, the second term in (6) requires a random latent variable z sampling from the approximate posterior q (z|x). This is problematic since back-propagation can not flow through a random node z. When q (z|x) is restricted to some kinds of parametric distributions, e.g. Gaussian, the random variable z can be reparameterized as a deterministic function z = g (✏, x) where ✏ is an auxiliary variable with independent marginal p(✏). This yields a lower-variance lower bound estimator called SGVB (Stochastic Gradient Variational Bayes): L̃(✓, ; xi ) = DKL q (z|xi )kp✓(z) + 1 L L X l=1 log p✓(xi |zi,l ) (7) where zi,l = g (✏i,l , xi ) and ✏l ⇠ p(✏). In (7), the KL- divergence term forces q (z|x) to be as close as possible to p✓(z) and works as a regularizer, whereas the second term is an expected negative reconstruction error. For analytically integrating the KL-divergence in (7), the true posterior p✓(z|x) is assumed to be an approximate Gaus- sian with approximately diagonal covariance. Let the prior p✓(z) = N(0, I), and the approximate posterior is multivariate Gaussian with a diagonal covariance structure q (z|xi ) = N(µi , ( i )2 ), where µi and i are mean and s.d. evaluated at datapoint i. Let µi j and i j denote the j-th element of µi Encoder Bottleneck Decoder (a) z =! + #. % ! # z % ∼ ' 0,1 Encoder Bottleneck Decoder (b) One-class Classifiers Latent representation (c) Fig. 2. The architectures of AEs (a), VAEs (b), and the hybrids of the latent representation models and one-class classifiers (c). and i respectively, where J is the dimensionality of z. The KL-divergence in (7) is written as follows: DKL q (z|xi )kp✓(z) = DKL N(µi , ( i )2 )kN(0, I) = 1 2 J X j=1 ✓ ( i j)2 + (µi j)2 1 log(( i j)2 ) ◆ (8) Taking DKL q (z|xi )kp✓(z) in (7), we get the objective function of VAE at datapoint i as follows: L(✓, ; xi ) w 1 2 J X j=1 ✓ ( i j)2 + (µi j)2 1 log(( i j)2 ) ◆ + 1 L L X l=1 log p✓(xi |zi,l ) (9) where zi,l = µi + i ✏l and ✏l ⇠ N(0, I). L is the number of samples per datapoint. In practice, it can be set to 1 as in [28]. When optimizing (maximizing) the objective function at (9) by Stochastic Gradient Ascent, VAEs learn the recognition model parameters jointly with the generative model param- eters ✓. Given datapoint xi , the probabilistic encoder outputs the parameters of the approximate posterior at this datapoint µi and i . An actual value zi,l ⇠ q (z|xi ) obtained through zi,l = µi + i ✏l is the input for the probabilistic decoder. The output of the decoder is the reconstruction x̂i . The distribution of the encoder output is Gaussian, whereas that of the decoder depends on the type of data (Gaussian for real-value data or Bernoulli for binary). III. RELATED WORK In this section, we discuss recent trends and some state-of- the-art anomaly detection algorithms. This includes Support Vector Machines [7], [29], [30], and autoencoder-based meth- ods [5], [14], [17], [18], [19], [31]. Schölkopf et al. [7] and Campbell et al. [30] presented hyperplane-based one-class SVM approaches as already dis- cussed. In [7], their aim is to map the input data into the feature space via a kernel function, and then find a hyperplane with a maximum margin between the region containing normal data and the origin in the feature space. The half space
  • 5. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 5 containing the origin is identified as the anomalous region. The trade-off between the two objectives, maximizing the margin and minimizing the number of target vectors falling into the anomalous region, is controlled by the outlier fraction ⌫ 2 (0, 1). The larger the value of ⌫, the more normal vectors are rejected as outliers and the more normal vectors become support vectors. When ⌫ approaches 1 almost all normal vectors become support vectors. The method was evaluated on the US postal service database of handwritten digits, and the results show that the classifier performed well. However, how to choose values for the hyperparameter ⌫ and kernel parameters such as gamma (related to bandwidth h in KDE) is still an open question. Instead of allocating the origin region for anomalies, Campbell et al. [30] proposed a model that learns to capture the region containing normal instances in feature space. They attempted to find a hyperplane with respect to the center of the distribution of normal data, and anomalies were assumed to appear in the other side. Linear programming techniques are employed instead of the quadratic programming in Schölkopf’s approach, that can make their model learn large datasets rapidly. Tax and Duin [29] proposed a method called Support Vector Data Description for anomaly detection. In this approach, normal data is again first mapped into a feature space corre- sponding to a kernel function. It then finds a hypersphere with minimum radius which encompasses almost all normal vectors in the feature space. Any query datapoints lying inside the hypersphere are considered as normal and others as anomalies. In order to achieve good classification accuracy, it is desir- able to reduce the volume of the hypersphere by rejecting some fraction of training data (the outlier fraction known as parameter C) when training this model. This illustrates a theme present in all one-class classification research, the trade-off between false positive and false negative rates. They introduced different kernel functions to SVDD that make the method more flexible, and the Gaussian kernel was found to be the most suitable for many datasets. When using the Gaussian kernel, the method is comparable to OCSVM [7]. However, the technique requires a large number of normal examples, and extra outlier objects for training in order to improve the classification accuracy [29]. Both SVDD and OCSVM have demonstrated their effectiveness on anomaly detection, but their limitations are the ability to model large-scale and high- dimensional data due to their time and space complexity [32]. The approach of using stand-alone AEs to build anomaly de- tection systems was proposed in [5], [18], [19], in which AEs act as either anomaly detection methods or feature reduction techniques. Hawkins et al. [18] trained an AE (also known as a replicator neural network) with three narrow hidden layers on normal data. Its RE was used as an “outlier score”: an outlier score above a predetermined threshold indicated an anomaly. A step-wise activation function was used for the neurons in the middle hidden layer, which mapped input data into a number of possible clusters. Each of these clusters was associated with an active state of these neurons. These neurons were active with specific steps on a particular class of data (normal or anomaly). Thus, the labels of these clusters can be used as an alternative approach for indicating anomalies. The model was evaluated on the Wisconsin Breast Cancer (WBC) and the KDD’99 datasets, and both of these models (RE-based and cluster-based) produced high accuracy. Furthermore, Fiore et al. [5] constructed an AE using Discriminative Restricted Boltzmann Machines to test the hypothesis that there is a deep similarity among normal behaviors. They expected that their model can describe all the characteristics of normal traffic when comparing it against unseen anomalous traffic. Their experiments involving real-world network traces and the KDD’99 datasets confirmed that its performance suffered when testing in a network greatly different from that where training data was collected. In contrast, Sakurada et al. [19] employed an AE as a nonlinear feature reduction technique for anomaly detection. They attempted to clarify the properties of AEs by comparing a classical AE and a DAE to linear PCA and Kernel PCA. These techniques were evaluated on an artificial dataset and on spacecraft telemetry data. They concluded that DAEs not only outperform linear PCA and Kernel PCA in terms of accuracy, but also can avoid the high computation costs of kernel PCA. Hybrid approaches or extensions of AEs have been recently proposed for anomaly detection [14], [31]. Veeramachaneni et al. [31] proposed an ensemble learner to combine three single one-class classifiers: AE-based, density-based, and ma- trix decomposition-based techniques. They also used a human expert to provide ongoing correct labels from which the algorithms can learn. The models were tested on a large network log file dataset, and achieved promising results. Erfani et al. [14] introduced a hybrid of a Deep Belief Network (DBN) and OCCs, such as OCSVM and SVDD, for solving the problem of high-dimensional anomaly detection. The DBN was pre-trained in the greedy layer-wise fashion, that is unsu- pervised training of each Restricted Boltzmann Machine one- by-one. OCSVM [7] and SVDD [29] were then built on top of the pre-trained DBN. This structure takes advantages of high decision classification accuracy from these OCCs and nonlin- ear feature reduction from DBNs. The model was evaluated on eight high-dimensional UCI datasets. The results showed that the performance of the hybrid models was comparable to AEs and better than stand-alone OCSVM and SVDD, and the training and testing times improved significantly. IV. PROPOSED MODEL We aim to find a new data representation that facilitates simple anomaly detection algorithms. This section clarifies how to construct the data representation by introducing new regularizers to an AE and a VAE. The new regularizers together with reconstruction loss will help these AEs to give a robust representation of normal behavior. The regularizers will encourage the encoders of these AEs to condense normal data as close together as possible at a particular region in the latent feature space, while reconstruction loss promotes these AEs to keep normal points from overlapping each other. In order to separate the normal region from anomalies, normal points will be “pushed” towards the origin at the non-saturating area of the bottleneck unit outputs by the regularizers. That is, each coordinate (given by the output of the bottleneck unit
  • 6. 6 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX activation) of an encoded point will tend to be pushed closer to the non-saturating value (zero) of the activation function. Thus, a trained AE on normal data can keep normal datapoints close to the origin, whereas anomalous datapoints, if they differ from normal datapoints, will therefore tend to differ greatly, and appear in other regions. A number of one-class classifiers are employed for evaluating the proposed models. Fig. 2(c) illustrates the hybrid of the data representation models and one-class classifiers. More details are shown in Subsections IV-A and IV-B. Our models are very different from other common regularized AEs, including Sparse AEs and Contractive AEs. Sparse AEs attempt to construct a sparse representation in an overcomplete setting in which a few of the outputs of the hidden unit activations can vary at a time, and others are set to a saturating value [33]. Thus, the latent data is penalized close to the saturating value at zero [34], or the hidden bias vectors are controlled [35]. Contractive AEs seek a latent representation that is as insensitive as possible w.r.t the variances in the input data [36]. Thus, the outputs of the hidden units are constrained to be close to their marginal values (e.g. 0 or 1 in sigmoid function). A. Shrink Autoencoder A new regularizer is added to the loss function of an AE which encourages the AE to construct a representation of normal data which will be easy for one-class classification algorithms. The regularizer is designed to penalize normal datapoints whose vectors in the latent space are of large magnitude, that is it will restrict the normal data to lie close to the origin. Hence, this is called a shrink regularizer, and the AE is named Shrink AE (SAE). The loss function in (3) can be redefined for this situation as follows: LSAE(✓; xi , z) = 1 n n X i=1 l(xi , x̂i ) + 1 n n X i=1 k zi k2 (10) where x̂i and zi are the reconstruction and the latent vector of the observation xi respectively. The first term is the recon- struction error, 1 n Pn i=1 k xi x̂i k2 , and the second term is the shrink regularizer. The parameter controls the trade-off between the two terms in the loss function. B. Dirac delta Variational Autoencoder VAEs attempt to encode data so that it is distributed as a standard Gaussian in the latent space. Thus, normal data will reside in a large area centered at the origin. Our strategy is to compress normal data into a smaller area near the origin. Therefore, we redesign the KL-divergence at (8) by forcing the approximate posterior q (z|x) to be as close as possible to a new prior p✓(z) with very small standard deviation. Let us recall the KL-divergence between two multivariate Gaussian distributions in Rn , P1 = N(µ1, ⌃1) and P2 = N(µ2, ⌃2), defined in [37] as: DKL (P1kP2) = 1 2  tr( ⌃1 ⌃2 ) + (µ2 µ1)T ⌃ 1 2 (µ2 µ1) n + log ✓ det(⌃2) det(⌃1) ◆ (11) Let µi and ⌃i denote the variational mean and the covariance matrix evaluated at datapoint i, q (z|xi ) = N(µi , ⌃i ), and J be the dimensionality of z. Consider a constant ↵ (↵ ⌧ 1.0) to be the variance of the prior probability, p✓(z) = N(0, ↵I). I is a identity matrix. Applying these to (11), the KL-divergence between q (z|xi ) and p✓(z) can be written as follows: DKL q (z|xi )kp✓(z) = 1 2  tr((↵I) 1 ⌃i )+(µi )T (↵I) 1 (µi ) J + log ✓ det(↵I) det(⌃i) ◆ (12) Taking I and ↵ in (12), we get: DKL q (z|xi )kp✓(z) = 1 2  tr((↵) 1 ⌃i )+(↵) 1 (µi )T (µi ) J+log ✓ (↵)J det(⌃i) ◆ = 1 2↵ [tr(⌃i )+(µi )T (µi ) ↵J+↵J log ↵ ↵ log(det(⌃i ))] (13) Because ⌃i is a diagonal matrix of size J ⇥ J, ⌃i can be used as a vector of its J diagonal elements. Let µi j and ( i j)2 denote the j–th element of µi and ⌃i respectively. Taking tr(⌃i ) and det(⌃i ), we get: DKL q (z|xi )kp✓(z) = 1 2↵  J X j=1 ( i j)2 + J X j=1 (µi j)2 ↵ J X j=1 1 + ↵ J X j=1 log ↵ ↵ log( J Y j=1 ( i j)2 ) = 1 2↵ J X j=1 [( i j)2 + (µi j)2 ↵+ ↵ log ↵ ↵ log(( i j)2 )] (14) Now we apply the KL-divergence in (14) to (7). The negative log likelihood loss in (7) is replaced by MSE between xi and its reconstruction x̂i since we will apply our models only on real-valued datasets. The objective function given at (7) can be rewritten as follows: L(✓, ; xi ) w 1 n n X 1 k xi x̂i k 2 1 2↵ J X j=1 [( i j)2 + (µi j)2 ↵ + ↵ log ↵ ↵ log(( i j)2 )] (15) The prior can be seen as a Dirac delta distribution because ↵ is very small. Thus, this VAE is named Dirac delta Varia- tional Autoencoder (DVAE). Maximizing (15) is equivalent to minimizing its KL-divergence and RE components. We introduce a parameter to control the trade-off between the two components in (15). The objective function can be rewritten in a form of the loss function of DVAE as follows: LDVAE(✓, ; xi ) = 1 n n X 1 k xi x̂i k 2 + 1 2↵ J X j=1 [( i j)2 + (µi j)2 ↵ + ↵ log ↵ ↵ log(( i j)2 )] (16)
  • 7. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 7 V. EVALUATION AND DISCUSSION This section is to evaluate the SAE and DVAE algorithms on constructing the data representation for improving the perfor- mance of anomaly detection algorithms. This is demonstrated by the experimental results produced from five simple one- class classification (OCC) algorithms LOF, CEN, KDE, MDIS, OCSVM using the latent representations of SAE and DVAE on fourteen problems. In order to highlight the strengths of SAE and DVAE, the results are also compared to those from: (1) the stand-alone OCCs (without any AE latent representation), (2) the OCCs using the latent representations of a denoising AE (DAE) and a VAE, and (3) the RE-based OCC. For measuring the accuracy of the models, we evaluate the area under the resulting ROC curve (AUC) by trying many different thresholds, and create a confusion matrix by choosing only one threshold. A number of experiments and analysis for exploring different aspects of the latent representations of SAE and DVAE are carried out as follows: • Evaluate the effect of dimensionality and sparsity on the classification accuracy of the OCCs using the latent representations given by SAE and DVAE. • Explore the effect on classification accuracy of OCSVM and LOF of their parameters ⌫, , and k. Investigate the distribution of latent vectors on normal and anomaly data. • Measure the effect of training size on the AUCs and query time created by SAE-OCCs and DVAE-OCCs. • Evaluate the AUCs from the OCCs on specific categories of attack types in NSL-KDD and UNSW-NB15. A. Experiments 1) Datasets: The experiments are conducted on fourteen datasets including network problems as shown in Table I. The eight network datasets are mostly well-known problems in the domain of network security. Although the main objective is to cope with the challenges arising in high-dimensional net- work data, the models are also evaluated on six non-network datasets from the UCI Machine Learning Repository [38]. This is because we intend to evaluate the performance of our models on a diversity of data, and expect to emphasize their strength on high-dimensional network-related datasets. The normal traffic in CTU13, UNSW-NB15 and NSL-KDD is considered as normal data, whereas all the attacks are treated as anomalies. In PenDigits, the digits ‘0’ and ‘2’ are chosen as the normal and anomalous classes respectively. For GLASS, window glass is considered as the normal class, and other classes as the anomalous class. In the other datasets, the normal and anomalous classes are indicated following [39]. The CTU13 is a publicly available botnet dataset provided in 2011 [40]. The data covers a wide range of real-world botnet traffic mixed with normal traffic and background traf- fic (unlabeled data). The CTU13 consists of thirteen botnet scenarios, and each of them involves a specific type of malware. We choose four scenarios in CTU13, and split each of them into 40% for training (normal traffic) and 60% for evaluating (normal and botnet traffic) following [41]. We use most of the 14 features in CTU13 except source/destination IP addresses. Three categorical features, protocol, sTos and dTos, are encoded by one-hot-encoding, which results in higher dimensional versions of these scenarios. TABLE I FOURTEEN DATASETS FOR EVALUATING THE PROPOSED MODELS Dataset Dimension4 Training set Normal Test Anomaly Test PageBlocks 10 3930 983 112 WPBC 32 118 30 10 PenDigits 16 780 363 364 GLASS 9 130 33 11 Shuttle 9 3410 11478 3022 Arrhythmia 259 189 48 37 Rbot (CTU13-10) 38 6338 9509 63812 Murlo (CTU13-8) 40 29128 43694 3677 Neris (CTU13-9) 41 11986 17981 110993 Virut (CTU13-13) 40 12775 19164 24002 Spambase 57 2230 558 363 UNSW-NB155 196 56000 37000 45332 NSL-KDD5 122 67343 9711 12833 InternetAds 1558 1582 396 77 NSL-KDD is a filtered version of the KDD’99 dataset [42], which was suggested to address the inherent issues mentioned in [43]. Although NSL-KDD still suffers from some problems discussed in [44], it can be reasonable to use the data as an effective benchmark for comparing anomaly detection algorithms in this work due to the shortage of public intrusion data. Each 41-feature record in NSL-KDD is labeled as either normal or a specific attack group in the four main categories: Denial of Service (DoS), Remote to Local (R2L), User to Local (U2R) and Probe. NSL-KDD consists of two parts: KDDTrain+ and KDDTest+ which are drawn from differ- ent distributions (additional 14 types of attacks in KDDTest+ only). Three categorical features, protocol type, service and flag, are preprocessed by one-hot-encoding which increases the number of features to 122. UNSW-NB15 has been recently provided and is expected to address the inherent issues in the KDD’99 dataset and NSL- KDD [45]. Each record comprising 47 features is labeled either as realistic normal traffic or one of the nine modern attack categories: Fuzzers, Analysis, Backdoor, DoS, Exploit, Generic, Reconnaissance, Shellcode and Worm. The dataset is decomposed into two sets, UNSW NB15 training-set and UNSW NB15 testing-set, for training and evaluating. The categorical attributes, such as protocol, service and state, are preprocessed by one-hot-encoding which increases the number of features to 196. The labelled anomalies in the training parts of NSL-KDD and UNSW-NB15 are discarded. PenDigits and Shuttle are already partitioned into training and testing parts, thus we simply delete labelled anomalies in the training parts to form training sets. For Spambase, InternetAds, PageBlocks, WPBC, GLASS and Arrhythmia, we take 80% of normal data for training and 20% of normal and anomalies for testing. All datasets are normalized into [-1, 1] since the activation function of the output layer of these AEs is the tanh function, and missing values are discarded. 4The dimensions of the four CTU13 datasets, UNSW-NB15 and NSL-KDD are preprocessing by on-hot-encoding. 5The training sets of UNSW-NB15 and NSL-KDD are much larger than other datasets, thus we will sample a small proportion (10%) for training.
  • 8. 8 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX 2) Parameter Settings: Anomalies are not available during training, so cross-validation can not be used to tune hyperpa- rameters. This is one of the major difficulties for this task. We configure the hyperparameters of AEs and OCCs using common values and rules of thumb, and then confirm that performance is not sensitive to these values. OCC Parameters: The Gaussian kernel is used for KDE and OCSVM. The scaling parameter related to the bandwidth h by = 1 2h2 is set by a default value, = 1 nf as in [46], where nf is the number of input features. The trade-off parameter ⌫ is set to two separate values6 , 0.1 and 0.5, which refers to OCSVM⌫=0.1 and OCSVM⌫=0.5. In LOF, the number of nearest neighbors k is chosen as 10% of the training size. AE Parameters: The architectures of SAE and DVAE are configured as follows: the number of hidden layers is equal to 5 as in [14], the size of the bottleneck layer m is chosen by the rule of thumb presented in [13], m = [1 + p n], where n is the number of input features. The choice of mini-batch size is dependent on the size of training sets. This is needed because the sizes of the datasets vary by a factor of 500. For small training sets (< 2000), we split into 20 batches. For large, we set mini-batch size to 100. We also want to provide a similar number of batches for each iteration in training processes which will help early-stopping work efficiently. In order to eliminate learning rate and the number of training iterations, we employ the Adadelta algorithm [47] together with early- stopping techniques [48] for training these networks, which enables the training processes to operate automatically and avoid over-fitting. The hyperbolic tangent function is chosen as the activation function for these AEs. Weights are initialized following the scheme in [49]. In practice, the KL-divergence in the DVAE loss function is scaled by log10 since its value is extremely large in early epochs. The distribution of latent data before training seems to be very similar to the standard Gaussian distribution. The prior p✓(z) is a Dirac delta distribution, thus the KL-divergence is very large, especially at early iterations of the training process. Fig. 3 (also Fig. 5 in the supplementary material) illustrates the distribution of latent data (the first feature z0) during the training process. Therefore, the log10 scaling is expected to reduce the domination of this term on the loss function. Fig. 3. Histogram of latent data (the first feature z0) during the training of DVAE (↵ = 10 8) on Spambase. SAE and DVAE are trained to minimize the loss functions in (10) and (16) by an adaptive SGD algorithm (Adadelta) as in the training of MLPs. We do not apply a pretraining procedure for these networks since modern back-propagation methods (weight initialization [49] and Adadelta [47]), together with 6This is expected to show the influence of ⌫ on the performance of OCSVM. the new regularization terms, are expected to encourage the networks to learn the parameters in hidden layers effectively. Early stopping is controlled by two parameters. Training will terminate when the loss does not improve by an absolute value of 10 3 for t iterations. t is calculated as 2000 / number of batches (where number of batches is already defined in this section). Note that only normal data is employed for the training process. We use the same model selection for setting up a five hidden layer DAE and a five hidden layer VAE7 . However, the DAE is trained in greedy layer-wise fashion following the original scheme proposed in [20], [21]. In the pretraining procedure, each single denoising autoencoder is trained to minimize MSE between the reconstruction formed from a corrupted version8 of the input, and the original input. This is optimized by SGD with a common value for learning rate, 10 2 , and 200 iterations9 to initialize weights and biases for the DAE. The DAE and VAE are then fine-tuned (end-to-end) as in the training of SAE and DVAE. Estimating : This is carried out for estimating the param- eter in the loss functions of SAE (10) and DVAE (16). The regularizers (shrink in SAE and KL-divergence in DVAE), force normal datapoints as close together as possible at the origin, whereas the reconstruction loss attempts to keep them from overlapping in order to reconstruct them at the output layer. The two components tend to conflict with each other. Thus, an appropriate value of should be chosen to bal- ance the two components. However, anomalous data is not available for tuning or determining the number of training iterations in order to avoid overfitting. According to [50], there are three phases in the training process of a feed-forward network. The generalization error includes two components called approximation error and complexity error. In the first phase, the approximation error dominates the complexity error, and the generalization error decreases gradually. In phase 2, these components are approximately balanced, and the gener- alization error continues to decrease further. The complexity error is increasingly large after phase 2, and dominates the approximation error due to large network weights, which can lead to oscillation and high generalization errors (phase 3). Thus, the training process should be stopped in phase 2. Therefore, we investigate these loss functions and their two components on five values, SAE 2 {0.1, 1, 5, 10, 50} and DVAE 2 {0.001, 0.01, 0.05, 0.1, 0.5} on four datasets over 1000 epochs. Firstly, we observe three phases on the SAE training error curves. The larger the value of , the longer phase 2 will last, which makes it easy to choose early stopping parameters. When is large (about 10) phase 2 is longer, but = 50 makes the training error less stable on phase 2. = 10 seems to be a good value which allows us to choose common values for early stopping parameters. When we apply early stopping with SAE = 10, we see that the stopping point is 7The equation (9) is rewritten in a form of the VAE loss function since the VAE is trained under the same training scheme in DVAE: LVAE(✓, ; xi) = 1 n Pn 1 k xi x̂i k 2 + 1 2 PJ j=1[( i j)2 + (µi j)2 1 log(( i j)2)]. 8It is obtained by randomly setting 10% of the input features to zero. 9There is no need for using early-stopping here since this is aimed to initialize weights and biases to be close to a good solution.
  • 9. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 9 mostly in phase 2. We also observe AUC curves, and the early stopping appears to perform well. Even AUCs are very good at the first few epochs on some datasets, but we are not using AUCs to choose . Similarly, we choose DVAE = 0.05. For brevity we present only the curves of SAE on CTU13-10 with SAE = 10 in Fig. 4, and on the four datasets in Figs. 1–4 in the supplementary material. Fig. 4. SAE loss function and its components (RE and Shrink losses) (w.r.t the left y-axis), and the AUCs created by SAE-LOF, SAE-CEN and SAE-OCSVM (w.r.t the right y-axis) during the training process of SAE on CTU13-10. 3) Main experiments: The bottleneck layers of the trained DAE, VAE, SAE and DVAE are used as latent representa- tions for six one-class classifiers LOF, CEN, MDIS, KDE, OCSVM⌫=0.1 and OCSVM⌫=0.5. We use the terms DAE- OCCs, VAE-OCCs, SAE-OCCs, and DVAE-OCCs to refer to the six one-class classifiers when using the latent representa- tions of DAE, VAE, SAE and DVAE respectively. The REs of these AEs are also used as anomaly score that produces four further RE-based classifiers. The performance of these stand- alone one-class classifiers on original data are considered as baselines. All experiments are implemented in Python 2.7 and run on a machine with an Intel Core 2 Duo i5-3360M CPU at 2.8 GHz, 8 GB RAM and RAM frequency of 1600 MHz, and the implementation of our algorithms is available on GitHub (https://github.com/vanloicao/SAEDVAE). The OCCs provided by scikit-learn are employed [46]. The main results are shown in Table II. B. Analysis and discussion Discussion: Table II presents the AUCs achieved by DAE- OCCs, VAE-OCCs, SAE-OCCs and DVAE-OCCs, and their corresponding RE-based classifiers from the 2nd to the 5th rows respectively. The results created by the six stand-alone one-class classifiers are shown in the first row. Each column represents the AUCs created by a number of classifiers on the same problem. We use gray-scale to present the performance of these classifiers on each dataset. In each column, the highest AUC is highlighted by the lightest gray. The fourteen datasets are arranged in ascending sparsity order. Table II shows that when working on the latent repre- sentations produced by SAE and DVAE, the six one-class classifiers perform better in terms of classification accuracy than those using DAE, VAE or stand-alone OCCs on the eight network-related datasets. These datasets are typically very high-dimensional and sparse, such as InternetAds with 1558 features. This suggests that the latent representations produced by SAE and DVAE facilitate these one-class classifiers in deal- ing with high-dimensional and sparse network-related datasets. However, VAE-OCCs produces relatively poor performance. This can be explained as follows: the VAE regularizer has less influence on learning the representation since the latent data is already in a good shape before training (see Fig.3). Thus, most of the representation power of the VAE may be used for reconstruction. Moreover, normal data resides in a large region that may give more “room” for anomalies to appear inside the region. The normal data is also not forced on the non-saturated part of the activation function. The hybrid SAE-OCCs and DVAE-OCCs also yield very similar AUCs on each network-related dataset, even though these one-class classifiers originate from different algorithms, and their parameters (e.g. ⌫) are set to different values. This is clear to see in the 4th and 5th rows where sparsity > 0.50. This implies that SAE and DVAE may constrain normal data in their latent representations in a well-shaped distribution that is straightforward for these classification algorithms to capture normal behaviors, and less sensitive to parameter settings. Moreover, SAE-OCCs and DVAE-OCCs produce comparable or superior AUCs in comparison to the RE-based DAE classi- fier on the network-related datasets, especially for high sparsity and dimensionality. The influence of OCC parameters and the distribution of latent vectors are explored later. The influence of dimensionality and sparsity: We next inves- tigate the influence of sparsity and dimensionality of data on the classification accuracy produced from hybrid DAE-OCCs, SAE-OCCs and DVAE-OCCs. We use the term AUC-DIFF to refer to the difference in AUC between a classifier (e.g. LOF) on the original data and on the data encoded by an AE. A positive value of AUC-DIFF indicates an improvement due to the AE encoding. AUC-DIFF is plotted against sparsity and dimensionality of datasets shown in Fig. 5(a) and Fig. 5(b). It can be seen from Fig. 5(a) that there is a clear increasing trend in the AUC-DIFF lines of SAE-OCCs and DVAE-OCCs, while the AUC-DIFF graph of DAE-OCCs tends to decrease. Similar patterns can also be found when investigating the influence of dimensionality, shown in Fig. 5(b). The ranking of datasets by sparsity is similar to the ranking by dimensionality, therefore these two pieces of evidence are partly overlapping. The conclusion is that the benefit of the new AE encodings is greater for sparse, high-dimension datasets, whereas the benefit of the existing DAE encoding is greater for small, non- sparse datasets. The influence of OCC parameters: This is to assess the influence of OCC parameters, ⌫, and k, on the perfor- mance in terms of classification accuracy of OCSVM and LOF when using the latent representations of DAE, SAE and DVAE. The parameter is fixed being equal to 1 nf for investigating ⌫, whereas ⌫ is set to 0.1 when examining . Each of these parameters is examined on fifty different values, ⌫ 2 [0.01, 0.5] and 2 [2⇥10 4 , 2⇥104 ]. We plot AUCs from DAE-OCSVM, SAE-OCSVM and DVAE-OCSVM against ⌫ in Fig. 6(a), and against in Fig. 6(b). The figures show that the AUC curves of SAE-OCSVM and DVAE-OCSVM tend to be stable while those of DAE-OCSVM vary according to the values of ⌫ or . This implies that the latent representations
  • 10. 10 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX TABLE II AUCS FROM THE STAND-ALONE ONE-CLASS CLASSIFIERS, HYBRID DAE-OCCS, SAE-OCCS AND DVAE-OCCS, AND THE RE-BASED CLASSIFIERS. Represen- -tation Methods One-class Classifiers Datasets (Sparsity) P a g e B lo c k s (0 .0 0 ) W P B C (0 .0 2 ) P e n D ig it s (0 .1 3 ) G L A S S (0 .1 8 ) S h u tt le (0 .2 2 ) A rr h y th m ia (0 .5 0 ) C T U 1 3 -1 0 (0 .7 1 ) C T U 1 3 -0 8 (0 .7 3 ) C T U 1 3 -0 9 (0 .7 3 ) C T U 1 3 -1 3 (0 .7 3 ) S p a m b a s e (0 .8 1 ) U N S W -N B 1 5 (0 .8 4 ) N S L -K D D (0 .8 8 ) In te rn e tA d s (0 .9 9 ) Stand-alone LOF 0.971 0.600 0.995 0.972 0.984 0.788 0.902 0.899 0.955 0.963 0.751 0.745 0.793 0.762 CEN 0.944 0.580 0.966 0.961 0.881 0.816 0.996 0.971 0.915 0.916 0.816 0.738 0.955 0.816 MDIS 0.927 0.640 0.962 0.970 0.898 0.786 0.998 0.966 0.734 0.891 0.731 0.801 0.929 0.694 KDE 0.928 0.637 0.961 0.967 0.883 0.787 0.998 0.958 0.720 0.889 0.731 0.800 0.924 0.693 OCSVM⌫=0.5 0.934 0.610 0.961 0.961 0.863 0.794 0.998 0.958 0.851 0.925 0.736 0.807 0.935 0.704 OCSVM⌫=0.1 0.934 0.557 0.968 0.832 0.760 0.807 0.983 0.797 0.852 0.898 0.736 0.792 0.890 0.710 DAE LOF 0.933 0.553 0.997 0.931 0.985 0.654 0.751 0.896 0.891 0.793 0.392 0.736 0.662 0.476 CEN 0.922 0.693 0.964 0.959 0.931 0.738 0.972 0.949 0.628 0.730 0.476 0.743 0.881 0.337 MDIS 0.905 0.700 0.950 0.994 0.901 0.707 0.981 0.960 0.653 0.855 0.466 0.765 0.888 0.342 KDE 0.903 0.690 0.954 0.992 0.892 0.706 0.980 0.939 0.616 0.857 0.460 0.756 0.861 0.335 OCSVM⌫=0.5 0.912 0.630 0.958 0.989 0.885 0.665 0.981 0.938 0.655 0.711 0.454 0.690 0.854 0.325 OCSVM⌫=0.1 0.920 0.557 0.976 0.606 0.762 0.668 0.937 0.775 0.702 0.332 0.578 0.536 0.697 0.314 RE-Based 0.969 0.540 0.997 0.986 0.821 0.824 0.998 0.988 0.943 0.972 0.805 0.873 0.959 0.842 VAE LOF 0.512 0.480 0.549 0.444 0.489 0.479 0.490 0.499 0.507 0.500 0.509 0.505 0.501 0.474 CEN 0.514 0.497 0.549 0.526 0.489 0.461 0.490 0.500 0.507 0.499 0.507 0.504 0.501 0.472 MDIS 0.509 0.517 0.553 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467 KDE 0.509 0.527 0.554 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467 OCSVM⌫=0.5 0.510 0.517 0.555 0.521 0.490 0.484 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.466 OCSVM⌫=0.1 0.515 0.537 0.553 0.537 0.491 0.466 0.490 0.498 0.507 0.499 0.505 0.505 0.501 0.463 RE-Based 0.928 0.657 0.959 0.961 0.883 0.784 0.998 0.957 0.698 0.881 0.734 0.801 0.923 0.694 SAE = 10 LOF 0.954 0.607 0.996 0.959 0.817 0.762 1.000 0.983 0.960 0.975 0.813 0.894 0.937 0.943 CEN 0.964 0.610 0.995 0.915 0.800 0.754 0.999 0.991 0.950 0.969 0.835 0.886 0.963 0.935 MDIS 0.967 0.603 0.996 0.898 0.794 0.757 0.999 0.990 0.950 0.968 0.826 0.887 0.964 0.936 KDE 0.967 0.607 0.996 0.884 0.783 0.756 0.999 0.990 0.949 0.968 0.825 0.886 0.964 0.934 OCSVM⌫=0.5 0.967 0.610 0.996 0.876 0.773 0.756 0.999 0.990 0.950 0.970 0.823 0.891 0.964 0.935 OCSVM⌫=0.1 0.956 0.600 0.996 0.890 0.781 0.740 0.999 0.988 0.944 0.971 0.825 0.893 0.961 0.933 RE-Based 0.929 0.637 0.959 0.959 0.884 0.787 0.997 0.958 0.720 0.888 0.734 0.800 0.925 0.690 DVAE = 0.05 ↵ = 10 8 LOF 0.908 0.327 0.987 0.705 0.841 0.807 0.999 0.978 0.954 0.973 0.810 0.876 0.958 0.900 CEN 0.906 0.450 0.988 0.774 0.849 0.777 0.999 0.982 0.956 0.963 0.809 0.879 0.960 0.892 MDIS 0.914 0.437 0.987 0.749 0.810 0.794 0.999 0.984 0.957 0.964 0.806 0.873 0.961 0.883 KDE 0.917 0.430 0.987 0.749 0.802 0.796 0.999 0.985 0.957 0.964 0.806 0.872 0.961 0.882 OCSVM⌫=0.5 0.920 0.450 0.988 0.769 0.802 0.797 0.999 0.987 0.957 0.974 0.808 0.872 0.961 0.882 OCSVM⌫=0.1 0.922 0.460 0.988 0.791 0.804 0.780 0.999 0.988 0.956 0.973 0.817 0.872 0.959 0.881 RE-Based 0.928 0.640 0.958 0.953 0.880 0.785 0.998 0.922 0.715 0.836 0.734 0.803 0.924 0.694 (a) (b) (c) Fig. 5. The influence of sparsity (a) and dimensionality (b) on the AUCs produced by six one-class classifiers using latent representations of DAE, SAE and DVAE. The visualization of the latent data (the first two features z0 and z1) created by DAE, SAE and DVAE (c) on CTU13-10.
  • 11. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 11 of SAE and DVAE make OCSVM perform consistently over a wide range of ⌫ and values. The number of neighbors k is chosen in the range from 1% to 50% of training size. For example, if k is 10% of a training dataset of size 200 samples, k is equal to 20. The AUCs of hybrid DAE-LOF, SAE-LOF and DVAE-LOF are computed, and plotted against 50 values of k as shown in Fig. 6(c). The AUC curves of the hybrid SAE-LOF and DVAE- LOF seem to level off within the range of k while there is no clear trend for the AUC curve of DAE-LOF. Thus, the latent representations of SAE and DVAE strengthen LOF to be insensitive to the choice of k. More results are shown in Fig. 6 of the supplementary material. These experiments confirm that the one-class classifiers, such as OCSVM and LOF, perform consistently on wide ranges of parameter settings when using the latent represen- tations of SAE and DVAE. This can be explained by: (1) normal data is represented in very well-shaped (Gaussian) distributions, and allocated in a small region highly isolated from the regions where anomalies are expected to appear; (2) the normal data from different sources will have a similar representation. Fig. 5(c) is a typical example (also Fig. 7 in the supplementary material). Therefore, OCSVM and LOF can model normal data very well even though these classifiers use few datapoints for support vectors in OCSVM (e.g. ⌫ = 0.01) or for nearest neighbors in LOF (e.g. k = 1% training size). This happens on several datasets. The influence of training size: We investigate the influence of training size on the latent representations of SAE and DVAE for anomaly detection tasks. Four datasets of more than 10000 training instances are chosen for this experiment, that is CTU13-09, CTU13-13, NSL-KDD and UNSW-NB15. Each dataset is sub-sampled multiple times (sizes ranging from 500 to 10000) to give smaller training set sizes for this experiment. Model selection is used as described in Subsection V-A2. The AUCs and query times produced from the hybrid SAE-OCCs and DVAE-OCCs are plotted against these training sizes as shown in Fig. 8 and Fig. 9 in the supplementary material. The results clearly show that the six one-class classifiers produce very similar AUCs amongst the five sizes on the same dataset. This suggests that the representation models, SAE and DVAE, tend to be consistent on a wide range of training sizes, and are less sensitive to training size than the hybrid DBN-OCCs in [14, see Fig. 5]. This is a positive result because it appears that excessive amounts of data are not required to make this method perform well. In terms of the complexity at query time, CEN out-performs other OCCs, and its query time does not scale with training size. Specific kinds of attacks: Our representation models are also examined on the thirteen specific attack groups in NSL-KDD and UNSW-NB15 as shown in Table III. This table has a similar structure to Table II, without arrangement according to sparsity. In general, the hybrid SAE-OCCs and DVAE-OCCs produce big improvements in the classification accuracy in comparison to their baselines on most of the attack groups, especially on the attack groups where the baseline is already good. This presents a common theme in classification methods. Moreover, the performance of SAE-CEN is evaluated on NSL-KDD by a confusion matrix as shown in Table IV. The confusion matrix is not the same as in the multi-class classification problem. This is because the classifiers built from only normal data use a threshold to classify unseen data into either the normal or anomalous class. This means that we can not measure the incorrect classification of a normal datapoint to a specific attack group, or an attack group to other attack groups. Therefore, precision values are only computed for normal and anomaly in the table. In this work, the threshold is set to correctly classify 90% on normal training data. TABLE IV CONFUSION MATRIX OF THE HYBRID SAE-CEN ON NSL-KDD Actual class Precision N o r m a l P r o b l e D o S R 2 L U 2 R Prediction Normal 8658 3 601 848 10 85.6% Anomaly 1053 2418 6857 2039 57 91.5% Recall 89.2% 99.9% 91.9% 70.6% 85.1% 88.8% Note: the values in bold are correctly classified. In terms of classification accuracy, the performance of these one-class classification algorithms are comparable, when the encoding is good (e.g. the encoding of SAE and DVAE). When considering computational complexity, CEN, which is a sim- ple method without hyperparameters, is very computationally efficient at both modeling and querying. Thus, it is nominated as the best model in our experiments. VI. CONCLUSION AND FUTURE WORK In this paper, we proposed latent representation models, SAE and DVAE, which help anomaly detection methods to cope with high-dimensional and sparse network datasets. Classical AEs do not bring data to a “nice” distribution by themselves, and the distribution they create is arbitrary. In the tasks where we rely on good behavior of the encoding, we have to control the distribution. Even with the standard VAE regu- larization which does control the distribution, it does not put the network “under pressure” to use all of its representational power to represent normal data. Our approaches do so, forcing normal data into a very tight area centered at the origin in the non-saturating area of the bottleneck unit activations. This helps AEs trained on normal data to keep normal datapoints close to the origin and push anomalies far away. We have demonstrated the latent representation created by our models helps well-known anomaly detection algorithms to perform efficiently and consistently on high-dimensional and sparse network data, even with relatively few training points. Amongst these algorithms, CEN is very computation- ally efficient and is easily feasible to perform in real-time. More importantly, the representation reduces the difficulty of model selection for these algorithms since their performance is insensitive to a wide range of hyperparameter settings. In future we propose to investigate latent representations using Gaussian mixture models. We also plan to propose an alternative method for estimating the hyperparameter in the loss functions of SAE and DVAE, possibly using multi- objective optimization.
  • 12. 12 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX (a) (b) (c) Fig. 6. The influence of ⌫ (a) and (b), and k (c) on the performance of OCSVM and LOF respectively when using the latent representations of DAE, SAE and DVAE on CTU13-13. TABLE III AUCS FROM THE CLASSIFIERS MENTIONED IN TABLE II ON SPECIFIC ATTACK GROUPS OF NSL-KDD AND UNSW-NB15. Representation Methods One-class Classifiers NSL-KDD UNSW-NB15 P ro b e D o S R 2 L U 2 R F u z z e rs A n a ly s is B a c k d o o r D o S E x p lo it s G e n e ri c R e c o n n - -a is s a n c e S h e ll c o d e W o rm s Stand-alone LOF 0.752 0.796 0.821 0.703 0.455 0.635 0.597 0.614 0.670 0.984 0.436 0.354 0.614 CEN 0.974 0.957 0.933 0.934 0.576 0.732 0.748 0.723 0.633 0.895 0.555 0.508 0.676 MDIS 0.986 0.949 0.831 0.885 0.596 0.890 0.900 0.843 0.660 0.969 0.636 0.583 0.679 KDE 0.985 0.945 0.820 0.871 0.601 0.883 0.893 0.840 0.658 0.969 0.639 0.591 0.684 OCSVM⌫=0.5 0.986 0.957 0.838 0.905 0.652 0.855 0.876 0.845 0.733 0.920 0.658 0.603 0.784 OCSVM⌫=0.1 0.958 0.936 0.714 0.789 0.576 0.712 0.733 0.746 0.731 0.961 0.555 0.469 0.853 DAE LOF 0.620 0.666 0.690 0.509 0.473 0.609 0.560 0.588 0.626 0.985 0.462 0.420 0.561 CEN 0.984 0.926 0.680 0.755 0.551 0.788 0.799 0.744 0.571 0.927 0.626 0.608 0.606 MDIS 0.966 0.912 0.761 0.746 0.565 0.818 0.828 0.770 0.588 0.955 0.644 0.606 0.651 KDE 0.964 0.904 0.666 0.743 0.563 0.799 0.809 0.751 0.571 0.949 0.646 0.614 0.642 OCSVM⌫=0.5 0.982 0.917 0.584 0.795 0.580 0.770 0.798 0.732 0.499 0.827 0.671 0.618 0.732 OCSVM⌫=0.1 0.734 0.834 0.323 0.308 0.391 0.289 0.305 0.417 0.420 0.694 0.527 0.468 0.722 RE-Based 0.981 0.971 0.911 0.930 0.632 0.992 0.957 0.940 0.888 0.979 0.592 0.476 0.816 VAE LOF 0.489 0.504 0.511 0.488 0.503 0.487 0.522 0.494 0.505 0.501 0.489 0.500 0.464 CEN 0.488 0.504 0.511 0.489 0.504 0.487 0.522 0.494 0.506 0.502 0.488 0.501 0.468 MDIS 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465 KDE 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465 OCSVM⌫=0.5 0.489 0.503 0.512 0.489 0.504 0.487 0.523 0.494 0.504 0.501 0.489 0.499 0.464 OCSVM⌫=0.1 0.489 0.504 0.511 0.490 0.504 0.487 0.522 0.494 0.505 0.501 0.489 0.499 0.462 RE-Based 0.985 0.945 0.818 0.871 0.605 0.882 0.893 0.840 0.660 0.968 0.642 0.598 0.686 SAE = 10 LOF 0.964 0.952 0.877 0.920 0.683 0.993 0.963 0.942 0.884 0.992 0.706 0.645 0.909 CEN 0.985 0.971 0.925 0.953 0.646 0.984 0.961 0.952 0.902 0.989 0.625 0.567 0.910 MDIS 0.988 0.971 0.926 0.950 0.629 0.994 0.961 0.952 0.909 0.988 0.646 0.573 0.909 KDE 0.988 0.971 0.925 0.949 0.623 0.993 0.961 0.952 0.909 0.988 0.642 0.559 0.906 OCSVM⌫=0.5 0.987 0.972 0.923 0.948 0.632 0.994 0.965 0.956 0.917 0.988 0.656 0.579 0.907 OCSVM⌫=0.1 0.987 0.973 0.912 0.908 0.648 0.994 0.967 0.957 0.921 0.988 0.642 0.554 0.902 RE-Based 0.985 0.946 0.822 0.872 0.601 0.881 0.891 0.838 0.657 0.969 0.640 0.592 0.685 DVAE = 0.05 ↵ = 10 8 LOF 0.977 0.974 0.896 0.934 0.635 0.996 0.956 0.949 0.898 0.990 0.537 0.457 0.895 CEN 0.983 0.971 0.915 0.929 0.605 0.995 0.958 0.941 0.882 0.990 0.666 0.603 0.881 MDIS 0.982 0.972 0.915 0.927 0.616 0.994 0.955 0.940 0.866 0.990 0.653 0.572 0.854 KDE 0.982 0.972 0.915 0.927 0.608 0.993 0.956 0.939 0.864 0.990 0.658 0.578 0.852 OCSVM⌫=0.5 0.982 0.973 0.914 0.926 0.601 0.993 0.960 0.942 0.869 0.990 0.661 0.584 0.860 OCSVM⌫=0.1 0.981 0.972 0.908 0.908 0.599 0.994 0.961 0.942 0.871 0.990 0.659 0.586 0.860 RE-Based 0.985 0.945 0.820 0.872 0.602 0.888 0.898 0.843 0.660 0.971 0.642 0.593 0.682 REFERENCES [1] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly detection techniques,” Journal of Network and Computer Applications, vol. 60, pp. 19–31, 2016. [2] M. Usama, J. Qadir, A. Raza, H. Arif, K.-L. A. Yau, Y. Elkhatib, A. Hussain, and A. Al-Fuqaha, “Unsupervised machine learning for networking: Techniques, applications and research challenges,” arXiv preprint arXiv:1709.06599, 2017. [3] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009. [4] V. V. Phoha, Internet security dictionary. Springer Science & Business Media, 2007. [5] U. Fiore, F. Palmieri, A. Castiglione, and A. De Santis, “Network anomaly detection with the Restricted Boltzmann Machine,” Neurocom- puting, vol. 122, pp. 13–23, 2013. [6] K. Shafi and H. A. Abbass, “Evaluation of an adaptive genetic-based signature extraction system for network intrusion detection,” Pattern Analysis and Applications, vol. 16, no. 4, pp. 549–566, 2013. [7] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001. [8] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” in ACM SIGMOD record, vol. 29, no. 2. ACM, 2000, pp. 93–104. [9] S. S. Khan and M. G. Madden, “One-class classification: taxonomy of study and review of techniques,” The Knowledge Engineering Review,
  • 13. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 13 vol. 29, no. 3, pp. 345–374, 2014. [10] A. N. Mahmood, C. Leckie, and P. Udaya, “An efficient clustering scheme to exploit hierarchical data in network traffic analysis,” TKDE, vol. 20, no. 6, pp. 752–767, 2008. [11] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised outlier detection in high-dimensional numerical data,” Statistical Analy- sis and Data Mining: The ASA Data Science Journal, vol. 5, no. 5, pp. 363–387, 2012. [12] V. L. Cao, M. Nicolau, and J. McDermott, “One-class classification for anomaly detection with kernel density estimation and genetic program- ming,” in EuroGP, Portugal, vol. 9594. Springer, 2016, pp. 3–18. [13] V. L. Cao, M. Nicolau, J. McDermott et al., “A hybrid autoencoder and density estimation model for anomaly detection,” in Parallel Problem Solving from Nature. Springer, 2016, pp. 717–726. [14] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, “High- dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning,” Pattern Recognition, vol. 58, pp. 121–134, 2016. [15] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length and Helmholtz free energy,” in Advances in neural information processing systems, 1994, pp. 3–10. [16] H. Bourlard and Y. Kamp, “Auto-association by multilayer perceptrons and singular value decomposition,” Biological cybernetics, vol. 59, no. 4, pp. 291–294, 1988. [17] N. Japkowicz, C. Myers, and M. Gluck, “A novelty detection approach to classification,” in IJCAI, 1995, pp. 518–523. [18] S. Hawkins, H. He, G. Williams, and R. Baxter, “Outlier detection using replicator neural networks,” in Data warehousing and knowledge discovery. Springer, 2002, pp. 170–180. [19] M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with nonlinear dimensionality reduction,” in Proc MLSDA. ACM, 2014, p. 4. [20] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. [21] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006. [22] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- wise training of deep networks,” in Advances in neural information processing systems, 2007, pp. 153–160. [23] D. Rajashekar, A. N. Zincir-Heywood, and M. I. Heywood, “Smart phone user behaviour characterization based on autoencoders and self organizing maps,” in ICDMW. IEEE, 2016, pp. 319–326. [24] M. P. Wand and M. C. Jones, Kernel smoothing. CRC Press, 1994. [25] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A comparative study of anomaly detection schemes in network intrusion detection,” in Proc SIAM International Conference on Data Mining. SIAM, 2003, pp. 25–36. [26] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc ICML. ACM, 2008, pp. 1096–1103. [27] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” JMLR, vol. 11, no. 11, pp. 3371–3408, 2010. [28] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. [29] D. M. Tax and R. P. Duin, “Support vector data description,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004. [30] C. Bennett and K. Campbell, “A linear programming approach to novelty detection,” Advances in neural information processing systems, vol. 13, no. 13, p. 395, 2001. [31] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li, “AI2: Training a big data machine to defend,” in Proc BigDataSecurity, HPSC, and IDS. IEEE, 2016, pp. 49–54. [32] S. M. Erfani, M. Baktashmotlagh, S. Rajasegarar, S. Karunasekera, and C. Leckie, “R1SVM: A randomised nonlinear approach to large-scale anomaly detection,” in AAAI Conference on Artificial Intelligence, 2015. [33] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” PAMI, vol. 35, no. 8, pp. 1798–1828, 2013. [34] M. Ranzato, Y.-l. Boureau, and Y. L. Cun, “Sparse feature learning for deep belief networks,” in Advances in neural information processing systems, 2008, pp. 1185–1192. [35] M. Ranzato, C. Poultney, S. Chopra, and Y. L. Cun, “Efficient learning of sparse representations with an energy-based model,” in Advances in neural information processing systems, 2007, pp. 1137–1144. [36] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance during feature extraction,” in Proc ICML, 2011, pp. 833–840. [37] J. Duchi, “Derivations for linear algebra and optimization,” Berkeley, California, 2007. [38] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml [39] G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Micenková, E. Schubert, I. Assent, and M. E. Houle, “On the evaluation of unsu- pervised outlier detection: measures, datasets, and an empirical study,” Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891–927, 2016. [40] S. Garcia, M. Grill, J. Stiborek, and A. Zunino, “An empirical compar- ison of botnet detection methods,” Computers & Security, vol. 45, pp. 100–123, 2014. [41] D. C. Le, A. N. Zincir-Heywood, and M. I. Heywood, “Data analytics on network traffic flows for botnet behaviour detection,” in SSCI. IEEE, 2006, pp. 1–7. [42] “KDD Cup Dataset,” 1999, available at the following website http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [43] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the KDD CUP 99 data set,” in CISDA. IEEE, 2009, pp. 1–6. [44] J. McHugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln laboratory,” TISSEC, vol. 3, no. 4, pp. 262–294, 2000. [45] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set),” in MilCIS. IEEE, 2015, pp. 1–6. [46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, “Scikit-learn: Machine learning in Python,” JMLR, vol. 12, pp. 2825–2830, 2011. [47] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012. [48] L. Prechelt, “Early stopping-but when?” Neural Networks: Tricks of the trade, pp. 553–553, 1998. [49] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256. [50] C. Wang, S. S. Venkatesh, and J. S. Judd, “Optimal stopping and effec- tive machine complexity in learning,” in Advances in neural information processing systems, 1994, pp. 303–310. Van Loi Cao Loi received a BSc and a MSc in Computer Science from Le Quy Don Technical University, Vietnam. He worked for the university as an assistant lecturer. In 2015, he moved to Ireland to study a PhD in University College Dublin under the supervision of Assoc. Prof. James McDermott and Assoc. Prof. Miguel Nicolau, and is funded by VIED, Vietnam. His main research interests are neural network, machine learning, evolutionary com- putation, and information security. Miguel Nicolau Miguel is an Assoc Professor in UCD. He received a BSc in Belgium, followed by a BSc, MSc and PhD in the University of Limerick. He then worked as an Expert Engineer in the INRIA Institute in Paris, France. In 2010 he moved back to Ireland, and worked as a Research Fellow and Lecturer in UCD. His teaching experience spans over 15 years, and includes positions at University of Limerick, Fudan University in Shanghai, and UCD. James McDermott James holds a BSc in Com- puter Science with Mathematics, from the National University of Ireland, Galway. His PhD was in the University of Limerick. His post-doctoral research was in UCD and Massachusetts Institute of Technol- ogy. He is now an Associate Professor in University College Dublin. His main research interests are in evolutionary computation, machine learning, and computer music.