Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX 1
Learning Neural Representations for Network
Anomaly Detection
Van Loi Cao, Miguel Nicolau and James McDermott
Abstract—This paper proposes latent representation models for
improving network anomaly detection. Well-known anomaly de-
tection algorithms often suffer from challenges posed by network
data, such as high dimension and sparsity, and a lack of anomaly
data for training, model selection, and hyperparameter tuning.
Our approach is to introduce new regularizers to a classical
Autoencoder (AE) and a Variational Autoencoder (VAE), which
force normal data into a very tight area centered at the origin in
the non-saturating area of the bottleneck unit activations. These
trained AEs on normal data will push normal points towards
the origin, whereas anomalies, which differ from normal data,
will be put far away from the normal region. The models are
very different from common regularized AEs, Sparse AE and
Contractive AE, in which the regularized AEs tend to make
their latent representation less sensitive to changes of the input
data. The bottleneck feature space is now used as a new data
representation. A number of one-class learning algorithms are
used for evaluating the proposed models. The experiments testify
that our models help these classifiers to perform efficiently and
consistently on high-dimensional and sparse network datasets,
even with relatively few training points. More importantly, the
models can minimize the effect of model selection on these
classifiers since their performance is insensitive to a wide range
of hyperparameter settings.
Index Terms—Anomaly detection, latent representation, high
dimension, one-class classification, autoencoders.
I. INTRODUCTION
THE rapid growth of computer networks has enabled them
to function as a central information system in modern
life. The increase in the size, services and applications, and
infrastructure of computer networks such as the Internet of
Things (IoT), has made them complex and heterogeneous.
Thus, they confront various critical threats such as malicious
activities, network intruders and cyber criminals. Identifying
and preventing these detrimental cyber activities have high pri-
ority these days [1]. Analyzing and monitoring network traffic
to identify such malicious actions in large-scale networks are
crucial tasks, and ideally should be carried out automatically
with little supervision by network administrators [2]. Anomaly
detection is a data analysis task where the goal is to detect
patterns deviating greatly from normal data. It is suitable for
automatically identifying illegal, malicious activities and other
forms of network abuse from the normal behaviors of network
systems [3], [4]. Many machine learning algorithms have been
Manuscript received December 22, 2017; revised March 13, 2018. This
work is funded by Vietnam International Education Development (VIED) and
by agreement with the Irish Universities Association.
VL. Cao is with the School of Computer Science, University College
Dublin, Dublin, Ireland (e-mail: loi.cao@ucdconnect.ie).
J. McDermott and M. Nicolau are with University College Dublin, Dublin,
Ireland (e-mail: james.mcdermott2@ucd.ie and miguel.nicolau@ucd.ie).
employed for developing anomaly detection models [1], [2],
[3]. However, several issues, such as the high dimension and
complex types of network data, the lack of labelled anomalous
traffic, and the rapid evolution of intrusion methods, make
network anomaly detection a challenging task. In this work,
we aim to cope with these issues by proposing latent repre-
sentation models which compress normal data into a specific
region of a latent feature space. This is expected to facilitate
modelling of normal data.
As stated, one of the major issues is that labelled anomalous
data tends not be available for constructing network anomaly
detection models [3]. Collecting anomalies is extremely dif-
ficult due to privacy and security concerns of computer net-
works, and the shortage of intrusion network traffic and events
in host logs [5], [6]. Network administrators tend to avoid
divulging data that could compromise the privacy of their
clients or privileged information of their networks. Labeling a
huge volume of anomalous data covering all possible kinds of
attacks from a real-world network would be a challenging and
time-consuming task. Moreover, malicious actions or intrusive
methods are evolving over time. Thus, it may require a
significant amount of time to gather and label these data after
the awareness of the detailed information and behavior of new
attacks becomes available. Furthermore, new anomalies, such
as zero-day vulnerabilities, often cause serious damage to net-
work systems. Thus anomaly detection models are required to
cope with new anomalous actions efficiently. Most supervised
learning algorithms using knowledge of previous anomalies
are unable to detect novelties [1]. These issues strongly suggest
that the training process should be as independent as possible
from the availability of anomalous data, and anomaly detection
models should be able to respond in a flexible and timely way
to any new anomalous actions.
However, the absence of anomalies implies the crucial issue
that no validation set is available for estimating hyperparam-
eters. Most well-known anomaly detection algorithms, such
as one-class Support Vector Machine (OCSVM) [7] or Local
Outlier Factor (LOF) [8], are highly dependent on the choice of
parameters [8], [9] (more details will be discussed in Section II
and III). Supposing a small proportion of anomalies are
available for estimating parameters, this may damage the per-
formance of anomaly detection models since new, completely
different anomalies may appear in the future. Therefore, it
is desirable that network anomaly detection models should
provide a good prediction on unseen data on a wide range
of parameter settings, and have the ability to detect any new
forms of anomalies instantly as they appear.

2 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
The high dimension and complexity of network data is
another challenge to network anomaly detection. Network
traffic is typically described by a huge number of features,
such as in CISCO NetFlow data, and in different data types,
such as hierarchies (IP addresses), categories (protocols and
services) or continuous attributes [3], [10]. Anomaly detection
techniques often require some preprocessing on input data,
which may result in a higher-dimensional and sparser version
of the data. The curse of dimensionality is a problem for
anomaly detection algorithms [11]. This leads to a high pro-
portion of irrelevant features effectively producing noise that
conceals true anomalies in network data. If enough subspaces
that contain a subset of features are given, at least one
subspace (mostly relevant features) can be found in which
anomalies appear far from normal data. However, the search
for such subspaces is systematically difficult in high dimension
since the number of subspaces increases exponentially with
the dimensionality, which is called the exponential search
space problem. The curse of dimensionality also results in
concentration of distances. The relative difference between
the pairwise distance of any two datapoints and that of others
vanishes with increasing dimensionality. This is a challenge
to distance-based anomaly detection algorithms. Therefore,
network anomaly detection algorithms are required to deal
with high-dimensional and sparse data1
, by discovering more
robust and relevant features.
Unsupervised learning techniques, such as Support Vector
Data Description (SVDD), OCSVM and LOF, have been
widely used for anomaly detection [3]. These techniques
have successfully addressed the task of modeling normal
data without any assumption about its underlying distribu-
tion. LOF [8] is an advanced technique for high-dimensional
anomaly detection, which uses the local density deviation of a
given datapoint from its neighbors as an anomaly score. When
LOF is trained on only normal data, it can be used as a one-
class classifier. Recently, Kernel Density Estimation (KDE)
has been employed for building anomaly detection models,
and proven to efficiently model normal data with unknown
underlying distributions [12], [13]. In practice however, these
anomaly detection algorithms have some drawbacks: less
generalization ability in high dimension due to the curse of
dimensionality phenomenon [11], [14], and the difficulty of
tuning hyperparameters. These algorithms are non-parametric
methods, thus their query time is potentially high (more details
in Sections II and III).
Autoencoders (AEs) [15], [16] are a neural network archi-
tecture which have emerged as a suitable approach to anomaly
detection [5], [17], [18], [19] and as building blocks in deep
learning [20], [21], [22] in recent years. An AE is a feed-
forward neural network which attempts to reconstruct the
original input data at the output layer. The middle hidden
layer, sometimes called the bottleneck layer, like a nonlin-
ear PCA, compresses the redundancies while preserving and
differentiating non-redundant information in the input [17].
1A data with a majority of zero elements is considered as a sparse dataset.
Sparsity is a term used to represent the ratio of the number of zero entries
to the total number of entries in a dataset, and it is in the range of [0, 1]. In
this paper, a dataset with a sparsity above 0.5 is regarded as a sparse one.
In the anomaly detection context, an AE trained on normal
data will behave well on normal instances and will result
in small reconstruction errors (REs), but poorly reconstruct
anomalies giving large REs. Thus, RE is commonly used
as a measure of anomaly score. Alternatively, the middle
hidden layer of a trained AE can be used as a new feature
representation (called a latent representation) for improving
the performance of density-based anomaly detection [13] or
anomaly detection based on self-organizing maps [23]. The
central idea is that the latent representation which is lower-
dimension, and more robust to capture normal behaviors,
would help simple classifiers to identify anomalies. However,
the normal data is allowed to be freely distributed in the latent
feature space. The AE encoder could learn to map points from
the normal class into very different regions of the latent feature
space. Thus, the distribution of normal data in the latent feature
space may have an arbitrary shape which may not encourage
the stability of anomaly detection algorithms.
In order to overcome the limitations of the well-known
anomaly detection algorithms, we aim to find a new data
representation for facilitating simple anomaly detection al-
gorithms. The new representation is aimed to have useful
characteristics: lower dimension, straightforward to capture the
structure of normal data, a similar shape of normal data in
the new representation for different input distributions, and
normal data to be distributed in a small region in the feature
space and anomalies to be expected to appear in the rest
of the space. This will potentially improve the performance
of anomaly detection algorithms, and may make them less
sensitive to parameter settings. Our approach is to develop two
AEs, a classical AE and a Variational Autoencoder (VAE),
for constructing such a data representation by introducing
some constraints on the distribution of normal data in the
bottleneck layer. The new regularizers will encourage these
AEs to learn to represent latent data in a more meaningful
way - training data (which is assumed to be normal) appears
close together, and is distributed in a specific region in the
latent feature space. The bottleneck layers of these trained
AEs will then be used as the new data representation. Fig. 1
gives an example of data representation in the original space
(a), in the latent feature space of AEs (b), and in the latent
feature space of our models (c). The normal data shown in
Fig 1(b) is closer together than that in Fig 1(a), and has an
arbitrary shape. In Fig 1(c), the normal data is constrained to
be distributed in a good shape close to the origin. A number
of one-class classification algorithms are then employed to
capture the region representing normal behavior in the latent
feature space, and identify any datapoint not belonging to
this region as anomalies. More details will be presented in
Section IV.
The remainder of the paper is organized as follows. In
Section II and III, we briefly describe several anomaly detec-
tion algorithms, and highlight some related work in anomaly
detection. Our methods are presented in Section IV. This is
followed by Section V showing the evaluation and discussion
of our models. Section VI draws some conclusions and sug-
gests future work.

VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 3
x
x
0
1
(a)
z
z
0
1
(b)
Normal Anomaly
z0
z1
(c)
Fig. 1. Illustrations of data in the original feature space (a), the latent feature
space of AEs (b), and the latent feature space of our models (c).
II. MATHEMATICS OF ONE-CLASS CLASSIFICATION
ALGORITHMS
This section is to briefly describe anomaly detection al-
gorithms used in this paper. This includes Centroid, Mean
distance, KDE, LOF and OCSVM as well as autoencoders.
A. Anomaly detection algorithms
Centroid (CEN): This is a parametric method which uses
a single Gaussian to model training data. The distance (i.e.
radius) from the centroid (the origin) to an observation reflects
the degree of abnormality of the observation. A larger value
implies a higher probability that the datapoint is an anomaly.
By imposing a threshold on the distance, a query datapoint
can be classified as either normal or an anomaly. This method
has no hyperparameters, and works under the assumption that
the training data has a Gaussian distribution.
Mean Distance (MDIS): The mean of the Euclidean distance
from a datapoint to normal training set can be used as
anomaly score. By imposing a threshold on the mean distance,
the anomaly score of a given datapoint above the threshold
indicates an anomaly. MDIS has no hyperparameters, and is a
non-parametric method.
Kernel Density Estimation (KDE): KDE is used for estimat-
ing the probability density function of a sample in data [24].
KDE can be used for constructing an anomaly detection model
as presented in [12]. However, the main drawback of the model
is its computational cost at querying stage, especially on large
datasets. The performance in terms of classification accuracy
of KDE-based classifiers will depend on the choice of the
bandwidth h of a kernel function [12].
Local Outlier Factor (LOF): LOF [8] considers the data-
points that have a considerably lower local density than their
neighbors as anomalies. It estimates a density deviation score,
called local outlier factor, of a given datapoint with respect to
its neighbors. The larger the LOF score a given datapoint
has, the higher the probability the datapoint is anomalous.
The algorithm has shown its power on network anomaly
detection [25]. In practice however, it has some limitations
when dealing with high-dimensional data [2], and the choice
of the number of neighbors k is still an open question.
One-class Support Vector Machine (OCSVM): OCSVM [7]
first maps the normal data into a feature space via a kernel
function, and searches for a hyperplane with maximum margin
between the region containing most of normal data (normal
region) and the origin in the feature space. The idea behind this
is to allocate the region encompassing the origin for anomalies
to appear. That is to say, the OCSVM decision function returns
a positive value in the normal region far from the origin, and
a negative value in the anomaly region near the origin.
B. Autoencoder
An autoencoder [15], [16] is a neural network which con-
sists of two parts: encoder and decoder as shown in Fig. 2(a).
The encoder is defined as a feature extractor that allows the
explicit representation of an input x in a feature space. Let
f✓ denote the encoder, and X = x1
, x2
, ...xn
be a dataset.
The encoder f✓ will map the input xi
2 X into a latent vector
zi
= f✓(xi
), where zi
is the code or latent representation. The
decoder g✓ will map the latent representation zi
back into the
input space, which forms a reconstruction x̂i
= g✓(zi
). The
encoder and decoder are commonly represented as single-layer
neural networks in the form of non-linear functions of affine
mappings as follows:
f✓ (x) = sf (Wx + b) (1)
g✓(z) = sg
⇣
W
0
z + b
0
⌘
(2)
where W and W
0
are the weight matrices of the encoder and
decoder, and b and b
0
are the bias vectors of the encoder and
decoder. sf and fg are the activation functions of the encoder
and decoder, such as a logistic sigmoid or hyperbolic tangent
non-linear function, or a linear identity function.
Autoencoders learn to minimize the loss function in (3)
with respect to the parameters ✓ = {W, W
0
, b, b
0
}, using a
learning algorithm such as Stochastic Gradient Descent (SGD)
with back-propagation. The reconstruction loss function over
training instances can be written as:
LAE(✓; x) =
1
n
n
X
i=0
l(xi
, x̂i
) =
1
n
n
X
i=0
l(xi
, g✓(f✓(xi
))) (3)
where l(xi
, x̂i
) is the discrepancy between the input xi
and
its reconstruction x̂i. The choice of the reconstruction loss
depends largely on the appropriate distributional assumptions
on given data. The mean squared error (MSE)2
is commonly
used for real-valued data, whereas a cross-entropy loss3
can be
used for binary data. By compressing input data into a lower
dimensional space, the classical autoencoder avoids simply
learning the identity, and removes redundant information [17].
Denoising autoencoders (DAEs) [26], [27] are regularized
autoencoders that are trained to reconstruct the original input
from a corrupted version of the input. This will allow DAEs
to capture the structure of the input distribution, and again
prevent them from learning the identity. The loss function of
AEs in (3) is rewritten for DAEs as follows:
LDAE(✓; x) =
n
X
i=0
Ep(x̃|xi)
⇥
l(xi
, g✓(f✓(x̃)))
⇤
(4)
where x̃ is the corrupted version of xi
drawn from p(x̃|xi
).
Ep(x̃|xi) is the expectation of a reconstruction loss at xi
over
a number of samples x̃ drawn from p(x̃|xi
). This is because the
2LAE(✓; x) = 1
n
Pn
i=1 k xi x̂i k2
3LAE(✓; x) = 1
n
Pn
i=1 xi log(x̂i) + (1 xi) log(1 x̂i)

corruption process is performed stochastically on the original
input each time a point xi
is considered. There are many ways
to corrupt the input, such as Gaussian noise or salt and pepper
noise, but randomly masking features of the input to zero is
the most commonly used. This loss function can be optimized
by a SGD as in optimizing the AEs loss function.
C. Variational Autoencoder
The Variational Autoencoder (VAE) [28] is a neural network
that consists of two parts: a probabilistic encoder representing
the approximate posterior q (z|x) to the intractable true pos-
terior p✓(z|x), and a probabilistic decoder that refers to the
generative model p✓(x|z) as shown in Fig 2(b). The objective
of VAE is to optimize the variational lower bound on the
marginal likelihood of data w.r.t. variational parameters and
generative parameters ✓. The marginal likelihood is computed
as a sum over the marginal likelihoods of individual datapoint
since it is intractable, log p✓(x1
, ..., xn
) =
Pn
i=1 log p✓(xi
),
where log p✓(xi
) can be written as:
log p✓(xi
) = DKL q (z|xi
)kp✓(z|xi
) + L(✓, ; xi
) (5)
The term L(✓, ; xi
) is the lower bound on the marginal likeli-
hood of datapoint xi
since the first term, the Kullback-Leibler
divergence (KL-divergence) of the approximate posterior from
the true posterior, is non-negative. The lower bound can be
written as follows:
L(✓, ; xi
) = Eq (z|x)[ log q (z|x) + log p✓(x, z)]
= DKL q (z|xi
)kp✓(z) + Eq (z|xi)[log p✓(xi
|z)] (6)
where p✓(xi
|z) is the likelihood of xi
given the latent variable
z, and p✓(z) is the prior over latent variables.
However, the second term in (6) requires a random latent
variable z sampling from the approximate posterior q (z|x).
This is problematic since back-propagation can not flow
through a random node z. When q (z|x) is restricted to
some kinds of parametric distributions, e.g. Gaussian, the
random variable z can be reparameterized as a deterministic
function z = g (✏, x) where ✏ is an auxiliary variable with
independent marginal p(✏). This yields a lower-variance lower
bound estimator called SGVB (Stochastic Gradient Variational
Bayes): L̃(✓, ; xi
)
= DKL q (z|xi
)kp✓(z) +
1
L
L
X
l=1
log p✓(xi
|zi,l
) (7)
where zi,l
= g (✏i,l
, xi
) and ✏l
⇠ p(✏). In (7), the KL-
divergence term forces q (z|x) to be as close as possible to
p✓(z) and works as a regularizer, whereas the second term is
an expected negative reconstruction error.
For analytically integrating the KL-divergence in (7), the
true posterior p✓(z|x) is assumed to be an approximate Gaus-
sian with approximately diagonal covariance. Let the prior
p✓(z) = N(0, I), and the approximate posterior is multivariate
Gaussian with a diagonal covariance structure q (z|xi
) =
N(µi
, ( i
)2
), where µi
and i
are mean and s.d. evaluated
at datapoint i. Let µi
j and i
j denote the j-th element of µi
Encoder
Bottleneck
Decoder
(a)
z =! + #. %
! #
z
% ∼ ' 0,1
Encoder
Bottleneck
Decoder
(b)
One-class
Classifiers
Latent
representation
(c)
Fig. 2. The architectures of AEs (a), VAEs (b), and the hybrids of the latent
representation models and one-class classifiers (c).
and i
respectively, where J is the dimensionality of z. The
KL-divergence in (7) is written as follows:
DKL q (z|xi
)kp✓(z) = DKL N(µi
, ( i
)2
)kN(0, I)
=
1
2
J
X
j=1
✓
( i
j)2
+ (µi
j)2
1 log(( i
j)2
)
◆
(8)
Taking DKL q (z|xi
)kp✓(z) in (7), we get the objective
function of VAE at datapoint i as follows:
L(✓, ; xi
) w
1
2
J
X
j=1
✓
( i
j)2
+ (µi
j)2
1 log(( i
j)2
)
◆
+
1
L
L
X
l=1
log p✓(xi
|zi,l
) (9)
where zi,l
= µi
+ i
✏l
and ✏l
⇠ N(0, I). L is the number of
samples per datapoint. In practice, it can be set to 1 as in [28].
When optimizing (maximizing) the objective function at (9)
by Stochastic Gradient Ascent, VAEs learn the recognition
model parameters jointly with the generative model param-
eters ✓. Given datapoint xi
, the probabilistic encoder outputs
the parameters of the approximate posterior at this datapoint
µi
and i
. An actual value zi,l
⇠ q (z|xi
) obtained through
zi,l
= µi
+ i
✏l
is the input for the probabilistic decoder. The
output of the decoder is the reconstruction x̂i
. The distribution
of the encoder output is Gaussian, whereas that of the decoder
depends on the type of data (Gaussian for real-value data or
Bernoulli for binary).
III. RELATED WORK
In this section, we discuss recent trends and some state-of-
the-art anomaly detection algorithms. This includes Support
Vector Machines [7], [29], [30], and autoencoder-based meth-
ods [5], [14], [17], [18], [19], [31].
Schölkopf et al. [7] and Campbell et al. [30] presented
hyperplane-based one-class SVM approaches as already dis-
cussed. In [7], their aim is to map the input data into the
feature space via a kernel function, and then find a hyperplane
with a maximum margin between the region containing normal
data and the origin in the feature space. The half space

containing the origin is identified as the anomalous region.
The trade-off between the two objectives, maximizing the
margin and minimizing the number of target vectors falling
into the anomalous region, is controlled by the outlier fraction
⌫ 2 (0, 1). The larger the value of ⌫, the more normal vectors
are rejected as outliers and the more normal vectors become
support vectors. When ⌫ approaches 1 almost all normal
vectors become support vectors. The method was evaluated
on the US postal service database of handwritten digits, and
the results show that the classifier performed well. However,
how to choose values for the hyperparameter ⌫ and kernel
parameters such as gamma (related to bandwidth h in KDE)
is still an open question. Instead of allocating the origin region
for anomalies, Campbell et al. [30] proposed a model that
learns to capture the region containing normal instances in
feature space. They attempted to find a hyperplane with respect
to the center of the distribution of normal data, and anomalies
were assumed to appear in the other side. Linear programming
techniques are employed instead of the quadratic programming
in Schölkopf’s approach, that can make their model learn large
datasets rapidly.
Tax and Duin [29] proposed a method called Support Vector
Data Description for anomaly detection. In this approach,
normal data is again first mapped into a feature space corre-
sponding to a kernel function. It then finds a hypersphere with
minimum radius which encompasses almost all normal vectors
in the feature space. Any query datapoints lying inside the
hypersphere are considered as normal and others as anomalies.
In order to achieve good classification accuracy, it is desir-
able to reduce the volume of the hypersphere by rejecting
some fraction of training data (the outlier fraction known
as parameter C) when training this model. This illustrates
a theme present in all one-class classification research, the
trade-off between false positive and false negative rates. They
introduced different kernel functions to SVDD that make the
method more flexible, and the Gaussian kernel was found to be
the most suitable for many datasets. When using the Gaussian
kernel, the method is comparable to OCSVM [7]. However,
the technique requires a large number of normal examples,
and extra outlier objects for training in order to improve the
classification accuracy [29]. Both SVDD and OCSVM have
demonstrated their effectiveness on anomaly detection, but
their limitations are the ability to model large-scale and high-
dimensional data due to their time and space complexity [32].
The approach of using stand-alone AEs to build anomaly de-
tection systems was proposed in [5], [18], [19], in which AEs
act as either anomaly detection methods or feature reduction
techniques. Hawkins et al. [18] trained an AE (also known as a
replicator neural network) with three narrow hidden layers on
normal data. Its RE was used as an “outlier score”: an outlier
score above a predetermined threshold indicated an anomaly.
A step-wise activation function was used for the neurons in the
middle hidden layer, which mapped input data into a number
of possible clusters. Each of these clusters was associated with
an active state of these neurons. These neurons were active
with specific steps on a particular class of data (normal or
anomaly). Thus, the labels of these clusters can be used as
an alternative approach for indicating anomalies. The model
was evaluated on the Wisconsin Breast Cancer (WBC) and
the KDD’99 datasets, and both of these models (RE-based
and cluster-based) produced high accuracy. Furthermore, Fiore
et al. [5] constructed an AE using Discriminative Restricted
Boltzmann Machines to test the hypothesis that there is a
deep similarity among normal behaviors. They expected that
their model can describe all the characteristics of normal
traffic when comparing it against unseen anomalous traffic.
Their experiments involving real-world network traces and
the KDD’99 datasets confirmed that its performance suffered
when testing in a network greatly different from that where
training data was collected. In contrast, Sakurada et al. [19]
employed an AE as a nonlinear feature reduction technique for
anomaly detection. They attempted to clarify the properties
of AEs by comparing a classical AE and a DAE to linear
PCA and Kernel PCA. These techniques were evaluated on
an artificial dataset and on spacecraft telemetry data. They
concluded that DAEs not only outperform linear PCA and
Kernel PCA in terms of accuracy, but also can avoid the high
computation costs of kernel PCA.
Hybrid approaches or extensions of AEs have been recently
proposed for anomaly detection [14], [31]. Veeramachaneni
et al. [31] proposed an ensemble learner to combine three
single one-class classifiers: AE-based, density-based, and ma-
trix decomposition-based techniques. They also used a human
expert to provide ongoing correct labels from which the
algorithms can learn. The models were tested on a large
network log file dataset, and achieved promising results. Erfani
et al. [14] introduced a hybrid of a Deep Belief Network
(DBN) and OCCs, such as OCSVM and SVDD, for solving
the problem of high-dimensional anomaly detection. The DBN
was pre-trained in the greedy layer-wise fashion, that is unsu-
pervised training of each Restricted Boltzmann Machine one-
by-one. OCSVM [7] and SVDD [29] were then built on top of
the pre-trained DBN. This structure takes advantages of high
decision classification accuracy from these OCCs and nonlin-
ear feature reduction from DBNs. The model was evaluated
on eight high-dimensional UCI datasets. The results showed
that the performance of the hybrid models was comparable to
AEs and better than stand-alone OCSVM and SVDD, and the
training and testing times improved significantly.
IV. PROPOSED MODEL
We aim to find a new data representation that facilitates
simple anomaly detection algorithms. This section clarifies
how to construct the data representation by introducing new
regularizers to an AE and a VAE. The new regularizers
together with reconstruction loss will help these AEs to give
a robust representation of normal behavior. The regularizers
will encourage the encoders of these AEs to condense normal
data as close together as possible at a particular region in the
latent feature space, while reconstruction loss promotes these
AEs to keep normal points from overlapping each other. In
order to separate the normal region from anomalies, normal
points will be “pushed” towards the origin at the non-saturating
area of the bottleneck unit outputs by the regularizers. That
is, each coordinate (given by the output of the bottleneck unit

activation) of an encoded point will tend to be pushed closer
to the non-saturating value (zero) of the activation function.
Thus, a trained AE on normal data can keep normal datapoints
close to the origin, whereas anomalous datapoints, if they
differ from normal datapoints, will therefore tend to differ
greatly, and appear in other regions. A number of one-class
classifiers are employed for evaluating the proposed models.
Fig. 2(c) illustrates the hybrid of the data representation
models and one-class classifiers. More details are shown in
Subsections IV-A and IV-B.
Our models are very different from other common
regularized AEs, including Sparse AEs and Contractive AEs.
Sparse AEs attempt to construct a sparse representation in
an overcomplete setting in which a few of the outputs of
the hidden unit activations can vary at a time, and others
are set to a saturating value [33]. Thus, the latent data is
penalized close to the saturating value at zero [34], or the
hidden bias vectors are controlled [35]. Contractive AEs seek
a latent representation that is as insensitive as possible w.r.t the
variances in the input data [36]. Thus, the outputs of the hidden
units are constrained to be close to their marginal values (e.g.
0 or 1 in sigmoid function).
A. Shrink Autoencoder
A new regularizer is added to the loss function of an AE
which encourages the AE to construct a representation of
normal data which will be easy for one-class classification
algorithms. The regularizer is designed to penalize normal
datapoints whose vectors in the latent space are of large
magnitude, that is it will restrict the normal data to lie close
to the origin. Hence, this is called a shrink regularizer, and
the AE is named Shrink AE (SAE). The loss function in (3)
can be redefined for this situation as follows:
LSAE(✓; xi
, z) =
1
n
n
X
i=1
l(xi
, x̂i
) +
1
n
n
X
i=1
k zi
k2
(10)
where x̂i and zi are the reconstruction and the latent vector
of the observation xi respectively. The first term is the recon-
struction error, 1
n
Pn
i=1 k xi
x̂i
k2
, and the second term is
the shrink regularizer. The parameter controls the trade-off
between the two terms in the loss function.
B. Dirac delta Variational Autoencoder
VAEs attempt to encode data so that it is distributed as a
standard Gaussian in the latent space. Thus, normal data will
reside in a large area centered at the origin. Our strategy is
to compress normal data into a smaller area near the origin.
Therefore, we redesign the KL-divergence at (8) by forcing
the approximate posterior q (z|x) to be as close as possible
to a new prior p✓(z) with very small standard deviation.
Let us recall the KL-divergence between two multivariate
Gaussian distributions in Rn
, P1 = N(µ1, ⌃1) and P2 =
N(µ2, ⌃2), defined in [37] as:
DKL (P1kP2) =
1
2

tr(
⌃1
⌃2
) + (µ2 µ1)T
⌃ 1
2 (µ2 µ1)
n + log
✓
det(⌃2)
det(⌃1)
◆
(11)
Let µi
and ⌃i
denote the variational mean and the covariance
matrix evaluated at datapoint i, q (z|xi
) = N(µi
, ⌃i
), and J
be the dimensionality of z. Consider a constant ↵ (↵ ⌧ 1.0)
to be the variance of the prior probability, p✓(z) = N(0, ↵I). I
is a identity matrix. Applying these to (11), the KL-divergence
between q (z|xi
) and p✓(z) can be written as follows:
DKL q (z|xi
)kp✓(z) =
1
2

tr((↵I) 1
⌃i
)+(µi
)T
(↵I) 1
(µi
)
J + log
✓
det(↵I)
det(⌃i)
◆
(12)
Taking I and ↵ in (12), we get: DKL q (z|xi
)kp✓(z)
=
1
2

tr((↵) 1
⌃i
)+(↵) 1
(µi
)T
(µi
) J+log
✓
(↵)J
det(⌃i)
◆
=
1
2↵
[tr(⌃i
)+(µi
)T
(µi
) ↵J+↵J log ↵ ↵ log(det(⌃i
))]
(13)
Because ⌃i
is a diagonal matrix of size J ⇥ J, ⌃i
can be
used as a vector of its J diagonal elements. Let µi
j and ( i
j)2
denote the j–th element of µi
and ⌃i
respectively.
Taking tr(⌃i
) and det(⌃i
), we get:
DKL q (z|xi
)kp✓(z) =
1
2↵
 J
X
j=1
( i
j)2
+
J
X
j=1
(µi
j)2
↵
J
X
j=1
1
+ ↵
J
X
j=1
log ↵ ↵ log(
J
Y
j=1
( i
j)2
)
=
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵+ ↵ log ↵ ↵ log(( i
j)2
)] (14)
Now we apply the KL-divergence in (14) to (7). The
negative log likelihood loss in (7) is replaced by MSE between
xi
and its reconstruction x̂i
since we will apply our models
only on real-valued datasets. The objective function given
at (7) can be rewritten as follows:
L(✓, ; xi
) w
1
n
n
X
1
k xi
x̂i
k
2
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵ + ↵ log ↵ ↵ log(( i
j)2
)] (15)
The prior can be seen as a Dirac delta distribution because
↵ is very small. Thus, this VAE is named Dirac delta Varia-
tional Autoencoder (DVAE). Maximizing (15) is equivalent
to minimizing its KL-divergence and RE components. We
introduce a parameter to control the trade-off between
the two components in (15). The objective function can be
rewritten in a form of the loss function of DVAE as follows:
LDVAE(✓, ; xi
) =
1
n
n
X
1
k xi
x̂i
k
2
+
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵ + ↵ log ↵ ↵ log(( i
j)2
)] (16)

V. EVALUATION AND DISCUSSION
This section is to evaluate the SAE and DVAE algorithms on
constructing the data representation for improving the perfor-
mance of anomaly detection algorithms. This is demonstrated
by the experimental results produced from five simple one-
class classification (OCC) algorithms LOF, CEN, KDE, MDIS,
OCSVM using the latent representations of SAE and DVAE on
fourteen problems. In order to highlight the strengths of SAE
and DVAE, the results are also compared to those from: (1)
the stand-alone OCCs (without any AE latent representation),
(2) the OCCs using the latent representations of a denoising
AE (DAE) and a VAE, and (3) the RE-based OCC. For
measuring the accuracy of the models, we evaluate the area
under the resulting ROC curve (AUC) by trying many different
thresholds, and create a confusion matrix by choosing only one
threshold. A number of experiments and analysis for exploring
different aspects of the latent representations of SAE and
DVAE are carried out as follows:
• Evaluate the effect of dimensionality and sparsity on
the classification accuracy of the OCCs using the latent
representations given by SAE and DVAE.
• Explore the effect on classification accuracy of OCSVM
and LOF of their parameters ⌫, , and k. Investigate the
distribution of latent vectors on normal and anomaly data.
• Measure the effect of training size on the AUCs and query
time created by SAE-OCCs and DVAE-OCCs.
• Evaluate the AUCs from the OCCs on specific categories
of attack types in NSL-KDD and UNSW-NB15.
A. Experiments
1) Datasets: The experiments are conducted on fourteen
datasets including network problems as shown in Table I. The
eight network datasets are mostly well-known problems in the
domain of network security. Although the main objective is
to cope with the challenges arising in high-dimensional net-
work data, the models are also evaluated on six non-network
datasets from the UCI Machine Learning Repository [38].
This is because we intend to evaluate the performance of
our models on a diversity of data, and expect to emphasize
their strength on high-dimensional network-related datasets.
The normal traffic in CTU13, UNSW-NB15 and NSL-KDD
is considered as normal data, whereas all the attacks are
treated as anomalies. In PenDigits, the digits ‘0’ and ‘2’ are
chosen as the normal and anomalous classes respectively. For
GLASS, window glass is considered as the normal class, and
other classes as the anomalous class. In the other datasets, the
normal and anomalous classes are indicated following [39].
The CTU13 is a publicly available botnet dataset provided
in 2011 [40]. The data covers a wide range of real-world
botnet traffic mixed with normal traffic and background traf-
fic (unlabeled data). The CTU13 consists of thirteen botnet
scenarios, and each of them involves a specific type of
malware. We choose four scenarios in CTU13, and split each
of them into 40% for training (normal traffic) and 60% for
evaluating (normal and botnet traffic) following [41]. We use
most of the 14 features in CTU13 except source/destination
IP addresses. Three categorical features, protocol, sTos and
dTos, are encoded by one-hot-encoding, which results in higher
dimensional versions of these scenarios.
TABLE I
FOURTEEN DATASETS FOR EVALUATING THE PROPOSED MODELS
Dataset Dimension4 Training
set
Normal
Test
Anomaly
Test
PageBlocks 10 3930 983 112
WPBC 32 118 30 10
PenDigits 16 780 363 364
GLASS 9 130 33 11
Shuttle 9 3410 11478 3022
Arrhythmia 259 189 48 37
Rbot (CTU13-10) 38 6338 9509 63812
Murlo (CTU13-8) 40 29128 43694 3677
Neris (CTU13-9) 41 11986 17981 110993
Virut (CTU13-13) 40 12775 19164 24002
Spambase 57 2230 558 363
UNSW-NB155 196 56000 37000 45332
NSL-KDD5 122 67343 9711 12833
InternetAds 1558 1582 396 77
NSL-KDD is a filtered version of the KDD’99 dataset [42],
which was suggested to address the inherent issues mentioned
in [43]. Although NSL-KDD still suffers from some problems
discussed in [44], it can be reasonable to use the data as
an effective benchmark for comparing anomaly detection
algorithms in this work due to the shortage of public intrusion
data. Each 41-feature record in NSL-KDD is labeled as either
normal or a specific attack group in the four main categories:
Denial of Service (DoS), Remote to Local (R2L), User to
Local (U2R) and Probe. NSL-KDD consists of two parts:
KDDTrain+
and KDDTest+
which are drawn from differ-
ent distributions (additional 14 types of attacks in KDDTest+
only). Three categorical features, protocol type, service and
flag, are preprocessed by one-hot-encoding which increases
the number of features to 122.
UNSW-NB15 has been recently provided and is expected to
address the inherent issues in the KDD’99 dataset and NSL-
KDD [45]. Each record comprising 47 features is labeled
either as realistic normal traffic or one of the nine modern
attack categories: Fuzzers, Analysis, Backdoor, DoS, Exploit,
Generic, Reconnaissance, Shellcode and Worm. The dataset
is decomposed into two sets, UNSW NB15 training-set and
UNSW NB15 testing-set, for training and evaluating. The
categorical attributes, such as protocol, service and state, are
preprocessed by one-hot-encoding which increases the number
of features to 196. The labelled anomalies in the training parts
of NSL-KDD and UNSW-NB15 are discarded.
PenDigits and Shuttle are already partitioned into training
and testing parts, thus we simply delete labelled anomalies
in the training parts to form training sets. For Spambase,
InternetAds, PageBlocks, WPBC, GLASS and Arrhythmia, we
take 80% of normal data for training and 20% of normal and
anomalies for testing. All datasets are normalized into [-1, 1]
since the activation function of the output layer of these AEs
is the tanh function, and missing values are discarded.
4The dimensions of the four CTU13 datasets, UNSW-NB15 and NSL-KDD
are preprocessing by on-hot-encoding.
5The training sets of UNSW-NB15 and NSL-KDD are much larger than
other datasets, thus we will sample a small proportion (10%) for training.

2) Parameter Settings: Anomalies are not available during
training, so cross-validation can not be used to tune hyperpa-
rameters. This is one of the major difficulties for this task.
We configure the hyperparameters of AEs and OCCs using
common values and rules of thumb, and then confirm that
performance is not sensitive to these values.
OCC Parameters: The Gaussian kernel is used for KDE and
OCSVM. The scaling parameter related to the bandwidth h
by = 1
2h2 is set by a default value, = 1
nf as in [46], where
nf is the number of input features. The trade-off parameter
⌫ is set to two separate values6
, 0.1 and 0.5, which refers
to OCSVM⌫=0.1 and OCSVM⌫=0.5. In LOF, the number of
nearest neighbors k is chosen as 10% of the training size.
AE Parameters: The architectures of SAE and DVAE are
configured as follows: the number of hidden layers is equal to
5 as in [14], the size of the bottleneck layer m is chosen by
the rule of thumb presented in [13], m = [1 +
p
n], where n is
the number of input features. The choice of mini-batch size is
dependent on the size of training sets. This is needed because
the sizes of the datasets vary by a factor of 500. For small
training sets (< 2000), we split into 20 batches. For large, we
set mini-batch size to 100. We also want to provide a similar
number of batches for each iteration in training processes
which will help early-stopping work efficiently. In order to
eliminate learning rate and the number of training iterations,
we employ the Adadelta algorithm [47] together with early-
stopping techniques [48] for training these networks, which
enables the training processes to operate automatically and
avoid over-fitting. The hyperbolic tangent function is chosen
as the activation function for these AEs. Weights are initialized
following the scheme in [49].
In practice, the KL-divergence in the DVAE loss function
is scaled by log10 since its value is extremely large in early
epochs. The distribution of latent data before training seems to
be very similar to the standard Gaussian distribution. The prior
p✓(z) is a Dirac delta distribution, thus the KL-divergence is
very large, especially at early iterations of the training process.
Fig. 3 (also Fig. 5 in the supplementary material) illustrates
the distribution of latent data (the first feature z0) during the
training process. Therefore, the log10 scaling is expected to
reduce the domination of this term on the loss function.
Fig. 3. Histogram of latent data (the first feature z0) during the training of
DVAE (↵ = 10 8) on Spambase.
SAE and DVAE are trained to minimize the loss functions
in (10) and (16) by an adaptive SGD algorithm (Adadelta) as in
the training of MLPs. We do not apply a pretraining procedure
for these networks since modern back-propagation methods
(weight initialization [49] and Adadelta [47]), together with
6This is expected to show the influence of ⌫ on the performance of OCSVM.
the new regularization terms, are expected to encourage the
networks to learn the parameters in hidden layers effectively.
Early stopping is controlled by two parameters. Training will
terminate when the loss does not improve by an absolute value
of 10 3
for t iterations. t is calculated as 2000 / number
of batches (where number of batches is already defined in
this section). Note that only normal data is employed for the
training process.
We use the same model selection for setting up a five hidden
layer DAE and a five hidden layer VAE7
. However, the DAE
is trained in greedy layer-wise fashion following the original
scheme proposed in [20], [21]. In the pretraining procedure,
each single denoising autoencoder is trained to minimize MSE
between the reconstruction formed from a corrupted version8
of the input, and the original input. This is optimized by
SGD with a common value for learning rate, 10 2
, and 200
iterations9
to initialize weights and biases for the DAE. The
DAE and VAE are then fine-tuned (end-to-end) as in the
training of SAE and DVAE.
Estimating : This is carried out for estimating the param-
eter in the loss functions of SAE (10) and DVAE (16). The
regularizers (shrink in SAE and KL-divergence in DVAE),
force normal datapoints as close together as possible at the
origin, whereas the reconstruction loss attempts to keep them
from overlapping in order to reconstruct them at the output
layer. The two components tend to conflict with each other.
Thus, an appropriate value of should be chosen to bal-
ance the two components. However, anomalous data is not
available for tuning or determining the number of training
iterations in order to avoid overfitting. According to [50], there
are three phases in the training process of a feed-forward
network. The generalization error includes two components
called approximation error and complexity error. In the first
phase, the approximation error dominates the complexity error,
and the generalization error decreases gradually. In phase 2,
these components are approximately balanced, and the gener-
alization error continues to decrease further. The complexity
error is increasingly large after phase 2, and dominates the
approximation error due to large network weights, which can
lead to oscillation and high generalization errors (phase 3).
Thus, the training process should be stopped in phase 2.
Therefore, we investigate these loss functions and their two
components on five values, SAE 2 {0.1, 1, 5, 10, 50} and
DVAE 2 {0.001, 0.01, 0.05, 0.1, 0.5} on four datasets over
1000 epochs. Firstly, we observe three phases on the SAE
training error curves. The larger the value of , the longer
phase 2 will last, which makes it easy to choose early stopping
parameters. When is large (about 10) phase 2 is longer, but
= 50 makes the training error less stable on phase 2. = 10
seems to be a good value which allows us to choose common
values for early stopping parameters. When we apply early
stopping with SAE = 10, we see that the stopping point is
7The equation (9) is rewritten in a form of the VAE loss function since the
VAE is trained under the same training scheme in DVAE: LVAE(✓, ; xi) =
1
n
Pn
1 k xi x̂i k
2
+ 1
2
PJ
j=1[( i
j)2 + (µi
j)2 1 log(( i
j)2)].
8It is obtained by randomly setting 10% of the input features to zero.
9There is no need for using early-stopping here since this is aimed to
initialize weights and biases to be close to a good solution.

mostly in phase 2. We also observe AUC curves, and the early
stopping appears to perform well. Even AUCs are very good
at the first few epochs on some datasets, but we are not using
AUCs to choose . Similarly, we choose DVAE = 0.05. For
brevity we present only the curves of SAE on CTU13-10 with
SAE = 10 in Fig. 4, and on the four datasets in Figs. 1–4 in
the supplementary material.
Fig. 4. SAE loss function and its components (RE and Shrink losses) (w.r.t the
left y-axis), and the AUCs created by SAE-LOF, SAE-CEN and SAE-OCSVM
(w.r.t the right y-axis) during the training process of SAE on CTU13-10.
3) Main experiments: The bottleneck layers of the trained
DAE, VAE, SAE and DVAE are used as latent representa-
tions for six one-class classifiers LOF, CEN, MDIS, KDE,
OCSVM⌫=0.1 and OCSVM⌫=0.5. We use the terms DAE-
OCCs, VAE-OCCs, SAE-OCCs, and DVAE-OCCs to refer to
the six one-class classifiers when using the latent representa-
tions of DAE, VAE, SAE and DVAE respectively. The REs of
these AEs are also used as anomaly score that produces four
further RE-based classifiers. The performance of these stand-
alone one-class classifiers on original data are considered as
baselines. All experiments are implemented in Python 2.7
and run on a machine with an Intel Core 2 Duo i5-3360M
CPU at 2.8 GHz, 8 GB RAM and RAM frequency of 1600
MHz, and the implementation of our algorithms is available on
GitHub (https://github.com/vanloicao/SAEDVAE). The OCCs
provided by scikit-learn are employed [46]. The main results
are shown in Table II.
B. Analysis and discussion
Discussion: Table II presents the AUCs achieved by DAE-
OCCs, VAE-OCCs, SAE-OCCs and DVAE-OCCs, and their
corresponding RE-based classifiers from the 2nd
to the 5th
rows respectively. The results created by the six stand-alone
one-class classifiers are shown in the first row. Each column
represents the AUCs created by a number of classifiers on the
same problem. We use gray-scale to present the performance
of these classifiers on each dataset. In each column, the highest
AUC is highlighted by the lightest gray. The fourteen datasets
are arranged in ascending sparsity order.
Table II shows that when working on the latent repre-
sentations produced by SAE and DVAE, the six one-class
classifiers perform better in terms of classification accuracy
than those using DAE, VAE or stand-alone OCCs on the eight
network-related datasets. These datasets are typically very
high-dimensional and sparse, such as InternetAds with 1558
features. This suggests that the latent representations produced
by SAE and DVAE facilitate these one-class classifiers in deal-
ing with high-dimensional and sparse network-related datasets.
However, VAE-OCCs produces relatively poor performance.
This can be explained as follows: the VAE regularizer has less
influence on learning the representation since the latent data
is already in a good shape before training (see Fig.3). Thus,
most of the representation power of the VAE may be used
for reconstruction. Moreover, normal data resides in a large
region that may give more “room” for anomalies to appear
inside the region. The normal data is also not forced on the
non-saturated part of the activation function.
The hybrid SAE-OCCs and DVAE-OCCs also yield very
similar AUCs on each network-related dataset, even though
these one-class classifiers originate from different algorithms,
and their parameters (e.g. ⌫) are set to different values. This
is clear to see in the 4th
and 5th
rows where sparsity > 0.50.
This implies that SAE and DVAE may constrain normal data
in their latent representations in a well-shaped distribution that
is straightforward for these classification algorithms to capture
normal behaviors, and less sensitive to parameter settings.
Moreover, SAE-OCCs and DVAE-OCCs produce comparable
or superior AUCs in comparison to the RE-based DAE classi-
fier on the network-related datasets, especially for high sparsity
and dimensionality. The influence of OCC parameters and the
distribution of latent vectors are explored later.
The influence of dimensionality and sparsity: We next inves-
tigate the influence of sparsity and dimensionality of data on
the classification accuracy produced from hybrid DAE-OCCs,
SAE-OCCs and DVAE-OCCs. We use the term AUC-DIFF to
refer to the difference in AUC between a classifier (e.g. LOF)
on the original data and on the data encoded by an AE. A
positive value of AUC-DIFF indicates an improvement due to
the AE encoding. AUC-DIFF is plotted against sparsity and
dimensionality of datasets shown in Fig. 5(a) and Fig. 5(b).
It can be seen from Fig. 5(a) that there is a clear increasing
trend in the AUC-DIFF lines of SAE-OCCs and DVAE-OCCs,
while the AUC-DIFF graph of DAE-OCCs tends to decrease.
Similar patterns can also be found when investigating the
influence of dimensionality, shown in Fig. 5(b). The ranking of
datasets by sparsity is similar to the ranking by dimensionality,
therefore these two pieces of evidence are partly overlapping.
The conclusion is that the benefit of the new AE encodings
is greater for sparse, high-dimension datasets, whereas the
benefit of the existing DAE encoding is greater for small, non-
sparse datasets.
The influence of OCC parameters: This is to assess the
influence of OCC parameters, ⌫, and k, on the perfor-
mance in terms of classification accuracy of OCSVM and
LOF when using the latent representations of DAE, SAE
and DVAE. The parameter is fixed being equal to 1
nf for
investigating ⌫, whereas ⌫ is set to 0.1 when examining .
Each of these parameters is examined on fifty different values,
⌫ 2 [0.01, 0.5] and 2 [2⇥10 4
, 2⇥104
]. We plot AUCs from
DAE-OCSVM, SAE-OCSVM and DVAE-OCSVM against ⌫
in Fig. 6(a), and against in Fig. 6(b). The figures show that
the AUC curves of SAE-OCSVM and DVAE-OCSVM tend to
be stable while those of DAE-OCSVM vary according to the
values of ⌫ or . This implies that the latent representations

TABLE II
AUCS FROM THE STAND-ALONE ONE-CLASS CLASSIFIERS, HYBRID DAE-OCCS, SAE-OCCS AND DVAE-OCCS, AND THE RE-BASED CLASSIFIERS.
Represen-
-tation
Methods
One-class
Classifiers
Datasets (Sparsity)
P
a
g
e
B
lo
c
k
s
(0
.0
0
)
W
P
B
C
(0
.0
2
)
P
e
n
D
ig
it
s
(0
.1
3
)
G
L
A
S
S
(0
.1
8
)
S
h
u
tt
le
(0
.2
2
)
A
rr
h
y
th
m
ia
(0
.5
0
)
C
T
U
1
3
-1
0
(0
.7
1
)
C
T
U
1
3
-0
8
(0
.7
3
)
C
T
U
1
3
-0
9
(0
.7
3
)
C
T
U
1
3
-1
3
(0
.7
3
)
S
p
a
m
b
a
s
e
(0
.8
1
)
U
N
S
W
-N
B
1
5
(0
.8
4
)
N
S
L
-K
D
D
(0
.8
8
)
In
te
rn
e
tA
d
s
(0
.9
9
)
Stand-alone
LOF 0.971 0.600 0.995 0.972 0.984 0.788 0.902 0.899 0.955 0.963 0.751 0.745 0.793 0.762
CEN 0.944 0.580 0.966 0.961 0.881 0.816 0.996 0.971 0.915 0.916 0.816 0.738 0.955 0.816
MDIS 0.927 0.640 0.962 0.970 0.898 0.786 0.998 0.966 0.734 0.891 0.731 0.801 0.929 0.694
KDE 0.928 0.637 0.961 0.967 0.883 0.787 0.998 0.958 0.720 0.889 0.731 0.800 0.924 0.693
OCSVM⌫=0.5 0.934 0.610 0.961 0.961 0.863 0.794 0.998 0.958 0.851 0.925 0.736 0.807 0.935 0.704
OCSVM⌫=0.1 0.934 0.557 0.968 0.832 0.760 0.807 0.983 0.797 0.852 0.898 0.736 0.792 0.890 0.710
DAE
LOF 0.933 0.553 0.997 0.931 0.985 0.654 0.751 0.896 0.891 0.793 0.392 0.736 0.662 0.476
CEN 0.922 0.693 0.964 0.959 0.931 0.738 0.972 0.949 0.628 0.730 0.476 0.743 0.881 0.337
MDIS 0.905 0.700 0.950 0.994 0.901 0.707 0.981 0.960 0.653 0.855 0.466 0.765 0.888 0.342
KDE 0.903 0.690 0.954 0.992 0.892 0.706 0.980 0.939 0.616 0.857 0.460 0.756 0.861 0.335
OCSVM⌫=0.5 0.912 0.630 0.958 0.989 0.885 0.665 0.981 0.938 0.655 0.711 0.454 0.690 0.854 0.325
OCSVM⌫=0.1 0.920 0.557 0.976 0.606 0.762 0.668 0.937 0.775 0.702 0.332 0.578 0.536 0.697 0.314
RE-Based 0.969 0.540 0.997 0.986 0.821 0.824 0.998 0.988 0.943 0.972 0.805 0.873 0.959 0.842
VAE
LOF 0.512 0.480 0.549 0.444 0.489 0.479 0.490 0.499 0.507 0.500 0.509 0.505 0.501 0.474
CEN 0.514 0.497 0.549 0.526 0.489 0.461 0.490 0.500 0.507 0.499 0.507 0.504 0.501 0.472
MDIS 0.509 0.517 0.553 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467
KDE 0.509 0.527 0.554 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467
OCSVM⌫=0.5 0.510 0.517 0.555 0.521 0.490 0.484 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.466
OCSVM⌫=0.1 0.515 0.537 0.553 0.537 0.491 0.466 0.490 0.498 0.507 0.499 0.505 0.505 0.501 0.463
RE-Based 0.928 0.657 0.959 0.961 0.883 0.784 0.998 0.957 0.698 0.881 0.734 0.801 0.923 0.694
SAE
= 10
LOF 0.954 0.607 0.996 0.959 0.817 0.762 1.000 0.983 0.960 0.975 0.813 0.894 0.937 0.943
CEN 0.964 0.610 0.995 0.915 0.800 0.754 0.999 0.991 0.950 0.969 0.835 0.886 0.963 0.935
MDIS 0.967 0.603 0.996 0.898 0.794 0.757 0.999 0.990 0.950 0.968 0.826 0.887 0.964 0.936
KDE 0.967 0.607 0.996 0.884 0.783 0.756 0.999 0.990 0.949 0.968 0.825 0.886 0.964 0.934
OCSVM⌫=0.5 0.967 0.610 0.996 0.876 0.773 0.756 0.999 0.990 0.950 0.970 0.823 0.891 0.964 0.935
OCSVM⌫=0.1 0.956 0.600 0.996 0.890 0.781 0.740 0.999 0.988 0.944 0.971 0.825 0.893 0.961 0.933
RE-Based 0.929 0.637 0.959 0.959 0.884 0.787 0.997 0.958 0.720 0.888 0.734 0.800 0.925 0.690
DVAE
= 0.05
↵ = 10 8
LOF 0.908 0.327 0.987 0.705 0.841 0.807 0.999 0.978 0.954 0.973 0.810 0.876 0.958 0.900
CEN 0.906 0.450 0.988 0.774 0.849 0.777 0.999 0.982 0.956 0.963 0.809 0.879 0.960 0.892
MDIS 0.914 0.437 0.987 0.749 0.810 0.794 0.999 0.984 0.957 0.964 0.806 0.873 0.961 0.883
KDE 0.917 0.430 0.987 0.749 0.802 0.796 0.999 0.985 0.957 0.964 0.806 0.872 0.961 0.882
OCSVM⌫=0.5 0.920 0.450 0.988 0.769 0.802 0.797 0.999 0.987 0.957 0.974 0.808 0.872 0.961 0.882
OCSVM⌫=0.1 0.922 0.460 0.988 0.791 0.804 0.780 0.999 0.988 0.956 0.973 0.817 0.872 0.959 0.881
RE-Based 0.928 0.640 0.958 0.953 0.880 0.785 0.998 0.922 0.715 0.836 0.734 0.803 0.924 0.694
(a) (b) (c)
Fig. 5. The influence of sparsity (a) and dimensionality (b) on the AUCs produced by six one-class classifiers using latent representations of DAE, SAE and
DVAE. The visualization of the latent data (the first two features z0 and z1) created by DAE, SAE and DVAE (c) on CTU13-10.

of SAE and DVAE make OCSVM perform consistently over
a wide range of ⌫ and values.
The number of neighbors k is chosen in the range from
1% to 50% of training size. For example, if k is 10% of a
training dataset of size 200 samples, k is equal to 20. The
AUCs of hybrid DAE-LOF, SAE-LOF and DVAE-LOF are
computed, and plotted against 50 values of k as shown in
Fig. 6(c). The AUC curves of the hybrid SAE-LOF and DVAE-
LOF seem to level off within the range of k while there is
no clear trend for the AUC curve of DAE-LOF. Thus, the
latent representations of SAE and DVAE strengthen LOF to
be insensitive to the choice of k. More results are shown in
Fig. 6 of the supplementary material.
These experiments confirm that the one-class classifiers,
such as OCSVM and LOF, perform consistently on wide
ranges of parameter settings when using the latent represen-
tations of SAE and DVAE. This can be explained by: (1)
normal data is represented in very well-shaped (Gaussian)
distributions, and allocated in a small region highly isolated
from the regions where anomalies are expected to appear; (2)
the normal data from different sources will have a similar
representation. Fig. 5(c) is a typical example (also Fig. 7 in
the supplementary material). Therefore, OCSVM and LOF can
model normal data very well even though these classifiers use
few datapoints for support vectors in OCSVM (e.g. ⌫ = 0.01)
or for nearest neighbors in LOF (e.g. k = 1% training size).
This happens on several datasets.
The influence of training size: We investigate the influence
of training size on the latent representations of SAE and
DVAE for anomaly detection tasks. Four datasets of more than
10000 training instances are chosen for this experiment, that
is CTU13-09, CTU13-13, NSL-KDD and UNSW-NB15. Each
dataset is sub-sampled multiple times (sizes ranging from 500
to 10000) to give smaller training set sizes for this experiment.
Model selection is used as described in Subsection V-A2. The
AUCs and query times produced from the hybrid SAE-OCCs
and DVAE-OCCs are plotted against these training sizes as
shown in Fig. 8 and Fig. 9 in the supplementary material. The
results clearly show that the six one-class classifiers produce
very similar AUCs amongst the five sizes on the same dataset.
This suggests that the representation models, SAE and DVAE,
tend to be consistent on a wide range of training sizes, and
are less sensitive to training size than the hybrid DBN-OCCs
in [14, see Fig. 5]. This is a positive result because it appears
that excessive amounts of data are not required to make this
method perform well. In terms of the complexity at query time,
CEN out-performs other OCCs, and its query time does not
scale with training size.
Specific kinds of attacks: Our representation models are also
examined on the thirteen specific attack groups in NSL-KDD
and UNSW-NB15 as shown in Table III. This table has a
similar structure to Table II, without arrangement according to
sparsity. In general, the hybrid SAE-OCCs and DVAE-OCCs
produce big improvements in the classification accuracy in
comparison to their baselines on most of the attack groups,
especially on the attack groups where the baseline is already
good. This presents a common theme in classification methods.
Moreover, the performance of SAE-CEN is evaluated on
NSL-KDD by a confusion matrix as shown in Table IV.
The confusion matrix is not the same as in the multi-class
classification problem. This is because the classifiers built from
only normal data use a threshold to classify unseen data into
either the normal or anomalous class. This means that we can
not measure the incorrect classification of a normal datapoint
to a specific attack group, or an attack group to other attack
groups. Therefore, precision values are only computed for
normal and anomaly in the table. In this work, the threshold
is set to correctly classify 90% on normal training data.
TABLE IV
CONFUSION MATRIX OF THE HYBRID SAE-CEN ON NSL-KDD
Actual class
Precision
N
o
r
m
a
l
P
r
o
b
l
e
D
o
S
R
2
L
U
2
R
Prediction
Normal 8658 3 601 848 10 85.6%
Anomaly 1053 2418 6857 2039 57 91.5%
Recall 89.2% 99.9% 91.9% 70.6% 85.1% 88.8%
Note: the values in bold are correctly classified.
In terms of classification accuracy, the performance of these
one-class classification algorithms are comparable, when the
encoding is good (e.g. the encoding of SAE and DVAE). When
considering computational complexity, CEN, which is a sim-
ple method without hyperparameters, is very computationally
efficient at both modeling and querying. Thus, it is nominated
as the best model in our experiments.
VI. CONCLUSION AND FUTURE WORK
In this paper, we proposed latent representation models,
SAE and DVAE, which help anomaly detection methods
to cope with high-dimensional and sparse network datasets.
Classical AEs do not bring data to a “nice” distribution by
themselves, and the distribution they create is arbitrary. In the
tasks where we rely on good behavior of the encoding, we have
to control the distribution. Even with the standard VAE regu-
larization which does control the distribution, it does not put
the network “under pressure” to use all of its representational
power to represent normal data. Our approaches do so, forcing
normal data into a very tight area centered at the origin in
the non-saturating area of the bottleneck unit activations. This
helps AEs trained on normal data to keep normal datapoints
close to the origin and push anomalies far away.
We have demonstrated the latent representation created by
our models helps well-known anomaly detection algorithms
to perform efficiently and consistently on high-dimensional
and sparse network data, even with relatively few training
points. Amongst these algorithms, CEN is very computation-
ally efficient and is easily feasible to perform in real-time.
More importantly, the representation reduces the difficulty of
model selection for these algorithms since their performance
is insensitive to a wide range of hyperparameter settings.
In future we propose to investigate latent representations
using Gaussian mixture models. We also plan to propose an
alternative method for estimating the hyperparameter in
the loss functions of SAE and DVAE, possibly using multi-
objective optimization.

(a) (b) (c)
Fig. 6. The influence of ⌫ (a) and (b), and k (c) on the performance of OCSVM and LOF respectively when using the latent representations of DAE, SAE
and DVAE on CTU13-13.
TABLE III
AUCS FROM THE CLASSIFIERS MENTIONED IN TABLE II ON SPECIFIC ATTACK GROUPS OF NSL-KDD AND UNSW-NB15.
Representation
Methods
One-class
Classifiers
NSL-KDD UNSW-NB15
P
ro
b
e
D
o
S
R
2
L
U
2
R
F
u
z
z
e
rs
A
n
a
ly
s
is
B
a
c
k
d
o
o
r
D
o
S
E
x
p
lo
it
s
G
e
n
e
ri
c
R
e
c
o
n
n
-
-a
is
s
a
n
c
e
S
h
e
ll
c
o
d
e
W
o
rm
s
Stand-alone
LOF 0.752 0.796 0.821 0.703 0.455 0.635 0.597 0.614 0.670 0.984 0.436 0.354 0.614
CEN 0.974 0.957 0.933 0.934 0.576 0.732 0.748 0.723 0.633 0.895 0.555 0.508 0.676
MDIS 0.986 0.949 0.831 0.885 0.596 0.890 0.900 0.843 0.660 0.969 0.636 0.583 0.679
KDE 0.985 0.945 0.820 0.871 0.601 0.883 0.893 0.840 0.658 0.969 0.639 0.591 0.684
OCSVM⌫=0.5 0.986 0.957 0.838 0.905 0.652 0.855 0.876 0.845 0.733 0.920 0.658 0.603 0.784
OCSVM⌫=0.1 0.958 0.936 0.714 0.789 0.576 0.712 0.733 0.746 0.731 0.961 0.555 0.469 0.853
DAE
LOF 0.620 0.666 0.690 0.509 0.473 0.609 0.560 0.588 0.626 0.985 0.462 0.420 0.561
CEN 0.984 0.926 0.680 0.755 0.551 0.788 0.799 0.744 0.571 0.927 0.626 0.608 0.606
MDIS 0.966 0.912 0.761 0.746 0.565 0.818 0.828 0.770 0.588 0.955 0.644 0.606 0.651
KDE 0.964 0.904 0.666 0.743 0.563 0.799 0.809 0.751 0.571 0.949 0.646 0.614 0.642
OCSVM⌫=0.5 0.982 0.917 0.584 0.795 0.580 0.770 0.798 0.732 0.499 0.827 0.671 0.618 0.732
OCSVM⌫=0.1 0.734 0.834 0.323 0.308 0.391 0.289 0.305 0.417 0.420 0.694 0.527 0.468 0.722
RE-Based 0.981 0.971 0.911 0.930 0.632 0.992 0.957 0.940 0.888 0.979 0.592 0.476 0.816
VAE
LOF 0.489 0.504 0.511 0.488 0.503 0.487 0.522 0.494 0.505 0.501 0.489 0.500 0.464
CEN 0.488 0.504 0.511 0.489 0.504 0.487 0.522 0.494 0.506 0.502 0.488 0.501 0.468
MDIS 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465
KDE 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465
OCSVM⌫=0.5 0.489 0.503 0.512 0.489 0.504 0.487 0.523 0.494 0.504 0.501 0.489 0.499 0.464
OCSVM⌫=0.1 0.489 0.504 0.511 0.490 0.504 0.487 0.522 0.494 0.505 0.501 0.489 0.499 0.462
RE-Based 0.985 0.945 0.818 0.871 0.605 0.882 0.893 0.840 0.660 0.968 0.642 0.598 0.686
SAE
= 10
LOF 0.964 0.952 0.877 0.920 0.683 0.993 0.963 0.942 0.884 0.992 0.706 0.645 0.909
CEN 0.985 0.971 0.925 0.953 0.646 0.984 0.961 0.952 0.902 0.989 0.625 0.567 0.910
MDIS 0.988 0.971 0.926 0.950 0.629 0.994 0.961 0.952 0.909 0.988 0.646 0.573 0.909
KDE 0.988 0.971 0.925 0.949 0.623 0.993 0.961 0.952 0.909 0.988 0.642 0.559 0.906
OCSVM⌫=0.5 0.987 0.972 0.923 0.948 0.632 0.994 0.965 0.956 0.917 0.988 0.656 0.579 0.907
OCSVM⌫=0.1 0.987 0.973 0.912 0.908 0.648 0.994 0.967 0.957 0.921 0.988 0.642 0.554 0.902
RE-Based 0.985 0.946 0.822 0.872 0.601 0.881 0.891 0.838 0.657 0.969 0.640 0.592 0.685
DVAE
= 0.05
↵ = 10 8
LOF 0.977 0.974 0.896 0.934 0.635 0.996 0.956 0.949 0.898 0.990 0.537 0.457 0.895
CEN 0.983 0.971 0.915 0.929 0.605 0.995 0.958 0.941 0.882 0.990 0.666 0.603 0.881
MDIS 0.982 0.972 0.915 0.927 0.616 0.994 0.955 0.940 0.866 0.990 0.653 0.572 0.854
KDE 0.982 0.972 0.915 0.927 0.608 0.993 0.956 0.939 0.864 0.990 0.658 0.578 0.852
OCSVM⌫=0.5 0.982 0.973 0.914 0.926 0.601 0.993 0.960 0.942 0.869 0.990 0.661 0.584 0.860
OCSVM⌫=0.1 0.981 0.972 0.908 0.908 0.599 0.994 0.961 0.942 0.871 0.990 0.659 0.586 0.860
RE-Based 0.985 0.945 0.820 0.872 0.602 0.888 0.898 0.843 0.660 0.971 0.642 0.593 0.682
REFERENCES
[1] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly
detection techniques,” Journal of Network and Computer Applications,
vol. 60, pp. 19–31, 2016.
[2] M. Usama, J. Qadir, A. Raza, H. Arif, K.-L. A. Yau, Y. Elkhatib,
A. Hussain, and A. Al-Fuqaha, “Unsupervised machine learning for
networking: Techniques, applications and research challenges,” arXiv
preprint arXiv:1709.06599, 2017.
[3] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
[4] V. V. Phoha, Internet security dictionary. Springer Science & Business
Media, 2007.
[5] U. Fiore, F. Palmieri, A. Castiglione, and A. De Santis, “Network
anomaly detection with the Restricted Boltzmann Machine,” Neurocom-
puting, vol. 122, pp. 13–23, 2013.
[6] K. Shafi and H. A. Abbass, “Evaluation of an adaptive genetic-based
signature extraction system for network intrusion detection,” Pattern
Analysis and Applications, vol. 16, no. 4, pp. 549–566, 2013.
[7] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
Williamson, “Estimating the support of a high-dimensional distribution,”
Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
[8] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying
density-based local outliers,” in ACM SIGMOD record, vol. 29, no. 2.
ACM, 2000, pp. 93–104.
[9] S. S. Khan and M. G. Madden, “One-class classification: taxonomy of
study and review of techniques,” The Knowledge Engineering Review,

vol. 29, no. 3, pp. 345–374, 2014.
[10] A. N. Mahmood, C. Leckie, and P. Udaya, “An efficient clustering
scheme to exploit hierarchical data in network traffic analysis,” TKDE,
vol. 20, no. 6, pp. 752–767, 2008.
[11] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised
outlier detection in high-dimensional numerical data,” Statistical Analy-
sis and Data Mining: The ASA Data Science Journal, vol. 5, no. 5, pp.
363–387, 2012.
[12] V. L. Cao, M. Nicolau, and J. McDermott, “One-class classification for
anomaly detection with kernel density estimation and genetic program-
ming,” in EuroGP, Portugal, vol. 9594. Springer, 2016, pp. 3–18.
[13] V. L. Cao, M. Nicolau, J. McDermott et al., “A hybrid autoencoder and
density estimation model for anomaly detection,” in Parallel Problem
Solving from Nature. Springer, 2016, pp. 717–726.
[14] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, “High-
dimensional and large-scale anomaly detection using a linear one-class
SVM with deep learning,” Pattern Recognition, vol. 58, pp. 121–134,
2016.
[15] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
length and Helmholtz free energy,” in Advances in neural information
processing systems, 1994, pp. 3–10.
[16] H. Bourlard and Y. Kamp, “Auto-association by multilayer perceptrons
and singular value decomposition,” Biological cybernetics, vol. 59, no. 4,
pp. 291–294, 1988.
[17] N. Japkowicz, C. Myers, and M. Gluck, “A novelty detection approach
to classification,” in IJCAI, 1995, pp. 518–523.
[18] S. Hawkins, H. He, G. Williams, and R. Baxter, “Outlier detection
using replicator neural networks,” in Data warehousing and knowledge
discovery. Springer, 2002, pp. 170–180.
[19] M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with
nonlinear dimensionality reduction,” in Proc MLSDA. ACM, 2014, p. 4.
[20] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
[21] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
2006.
[22] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-
wise training of deep networks,” in Advances in neural information
[23] D. Rajashekar, A. N. Zincir-Heywood, and M. I. Heywood, “Smart
phone user behaviour characterization based on autoencoders and self
organizing maps,” in ICDMW. IEEE, 2016, pp. 319–326.
[24] M. P. Wand and M. C. Jones, Kernel smoothing. CRC Press, 1994.
[25] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A
comparative study of anomaly detection schemes in network intrusion
detection,” in Proc SIAM International Conference on Data Mining.
SIAM, 2003, pp. 25–36.
[26] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting
and composing robust features with denoising autoencoders,” in Proc
ICML. ACM, 2008, pp. 1096–1103.
[27] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
“Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion,” JMLR, vol. 11, no. 11,
pp. 3371–3408, 2010.
[28] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
[29] D. M. Tax and R. P. Duin, “Support vector data description,” Machine
learning, vol. 54, no. 1, pp. 45–66, 2004.
[30] C. Bennett and K. Campbell, “A linear programming approach to novelty
detection,” Advances in neural information processing systems, vol. 13,
no. 13, p. 395, 2001.
[31] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li,
“AI2: Training a big data machine to defend,” in Proc BigDataSecurity,
HPSC, and IDS. IEEE, 2016, pp. 49–54.
[32] S. M. Erfani, M. Baktashmotlagh, S. Rajasegarar, S. Karunasekera, and
C. Leckie, “R1SVM: A randomised nonlinear approach to large-scale
anomaly detection,” in AAAI Conference on Artificial Intelligence, 2015.
[33] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
review and new perspectives,” PAMI, vol. 35, no. 8, pp. 1798–1828,
2013.
[34] M. Ranzato, Y.-l. Boureau, and Y. L. Cun, “Sparse feature learning for
deep belief networks,” in Advances in neural information processing
systems, 2008, pp. 1185–1192.
[35] M. Ranzato, C. Poultney, S. Chopra, and Y. L. Cun, “Efficient learning
of sparse representations with an energy-based model,” in Advances in
neural information processing systems, 2007, pp. 1137–1144.
[36] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive
auto-encoders: Explicit invariance during feature extraction,” in Proc
ICML, 2011, pp. 833–840.
[37] J. Duchi, “Derivations for linear algebra and optimization,” Berkeley,
California, 2007.
[38] M. Lichman, “UCI machine learning repository,” 2013. [Online].
Available: http://archive.ics.uci.edu/ml
[39] G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Micenková,
E. Schubert, I. Assent, and M. E. Houle, “On the evaluation of unsu-
pervised outlier detection: measures, datasets, and an empirical study,”
Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891–927,
2016.
[40] S. Garcia, M. Grill, J. Stiborek, and A. Zunino, “An empirical compar-
ison of botnet detection methods,” Computers & Security, vol. 45, pp.
100–123, 2014.
[41] D. C. Le, A. N. Zincir-Heywood, and M. I. Heywood, “Data analytics
on network traffic flows for botnet behaviour detection,” in SSCI. IEEE,
2006, pp. 1–7.
[42] “KDD Cup Dataset,” 1999, available at the following website
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[43] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed
analysis of the KDD CUP 99 data set,” in CISDA. IEEE, 2009, pp.
1–6.
[44] J. McHugh, “Testing intrusion detection systems: a critique of the 1998
and 1999 DARPA intrusion detection system evaluations as performed
by Lincoln laboratory,” TISSEC, vol. 3, no. 4, pp. 262–294, 2000.
[45] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for
network intrusion detection systems (UNSW-NB15 network data set),”
in MilCIS. IEEE, 2015, pp. 1–6.
[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” JMLR, vol. 12, pp.
2825–2830, 2011.
[47] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
[48] L. Prechelt, “Early stopping-but when?” Neural Networks: Tricks of the
trade, pp. 553–553, 1998.
[49] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proc International Conference
on Artificial Intelligence and Statistics, 2010, pp. 249–256.
[50] C. Wang, S. S. Venkatesh, and J. S. Judd, “Optimal stopping and effec-
tive machine complexity in learning,” in Advances in neural information
Van Loi Cao Loi received a BSc and a MSc
in Computer Science from Le Quy Don Technical
University, Vietnam. He worked for the university as
an assistant lecturer. In 2015, he moved to Ireland
to study a PhD in University College Dublin under
the supervision of Assoc. Prof. James McDermott
and Assoc. Prof. Miguel Nicolau, and is funded
by VIED, Vietnam. His main research interests are
neural network, machine learning, evolutionary com-
putation, and information security.
Miguel Nicolau Miguel is an Assoc Professor in
UCD. He received a BSc in Belgium, followed by
a BSc, MSc and PhD in the University of Limerick.
He then worked as an Expert Engineer in the INRIA
Institute in Paris, France. In 2010 he moved back
to Ireland, and worked as a Research Fellow and
Lecturer in UCD. His teaching experience spans over
15 years, and includes positions at University of
Limerick, Fudan University in Shanghai, and UCD.
James McDermott James holds a BSc in Com-
puter Science with Mathematics, from the National
University of Ireland, Galway. His PhD was in the
University of Limerick. His post-doctoral research
was in UCD and Massachusetts Institute of Technol-
ogy. He is now an Associate Professor in University
College Dublin. His main research interests are
in evolutionary computation, machine learning, and
computer music.

Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint

Similar to Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint (20)

Recently uploaded

Recently uploaded (20)

Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint