This paper proposes using latent representation models, specifically autoencoders (AEs) and variational autoencoders (VAEs), to improve network anomaly detection. The models are trained on only normal data and introduce regularizers that compress normal data into a tight region around the origin in the latent space, while anomalies will have representations further away. This new latent feature space is then used as input to one-class classifiers to detect anomalies. The goal is for the models to perform well even with limited training data and be insensitive to hyperparameter settings, in order to address challenges of network anomaly detection like lack of labeled anomaly data and high dimensionality.
Evolving Neural Networks through Augmenting Topologies NEATAavaas Gajurel
NEAT (NeuroEvolution of Augmenting Topologies) is an evolutionary algorithm that can evolve both the structure and weights of neural networks. It addresses issues with evolving network topologies by using historical markings to track genes, speciation to protect innovations, and incremental growth from minimal structures. Evaluations show NEAT can efficiently evolve solutions to problems like the XOR task and pole balancing. Ablation studies indicate all components of NEAT are important for its performance. The document hypothesizes NEAT could be useful for competitive coevolution by allowing multiple strategies, and for integrating separate networks by evolving connections between them.
Building Neural Network Through Neuroevolutionbergel
Neuroevolution is a technique that uses genetic algorithms to train neural networks. It evolves network topologies, weights, and hyperparameters without relying on backpropagation or large labeled training datasets. The NEAT algorithm is introduced as a prominent neuroevolution method. It uses speciation to protect innovative network topologies and incremental evolution. NEAT encodes networks and uses historical markings to solve the competing conventions problem during crossover. Extensions to NEAT aim to further improve the scalability and efficiency of neuroevolution.
Evolving Neural Networks Through Augmenting TopologiesDaniele Loiacono
The document summarizes the neuroevolution of augmenting topologies (NEAT) approach to evolving neural networks. NEAT uses historical markings to match genes during crossover, speciation to protect innovative topologies from premature competition, and complexification to gradually introduce new structures. It has been shown to more effectively and quickly solve problems compared to previous neuroevolution techniques. NEAT has been applied to games, navigation, vehicle systems and more.
This document discusses using neural networks in games. It describes how neural networks can simplify game code and allow AI to adapt as the game is played. It then provides examples of how neural networks could be used for threat assessment, attack/flee decisions, anticipation, and control. It also describes how the game Pong was used to test the NeuroEvolution of Augmenting Topologies (NEAT) algorithm, which uses genetic algorithms to evolve neural networks. Several trials were run with different fitness functions, with some networks learning to predict ball movement and others getting stuck or unresponsive. Possible remedies are suggested to address the problems encountered.
This document discusses various topological and spatial features of brain networks, including small-world properties, motifs, clusters, degree distributions, and robustness. It provides examples of analyses conducted on structural and functional brain networks, such as detecting clusters in the cat cortex and examining the effects of simulated brain lesions. Modular organization is highlighted as important for local integration and global separation of processing. Developing brain networks are found to require less information to encode connectivity patterns compared to random networks.
The document provides an overview of neural networks including:
- Their history from early models in the 1940s to the breakthrough of backpropagation in the 1980s.
- What a neural network is and how it works at the level of individual neurons and when connected together.
- Common applications of neural networks like prediction, classification, and clustering.
- Key considerations in choosing an appropriate neural network architecture and training data for a given problem.
This document discusses various applications of neural networks, including pattern recognition, autonomous vehicles, medicine, sports prediction, and virus detection. Some key applications mentioned are using neural networks for patient diagnosis, detecting coronary artery disease from medical images, predicting sports outcomes based on team statistics, and forecasting space weather events. The document also notes some limitations of neural networks, such as requiring large datasets and not providing explanations for decisions.
AI neural networks can support disaster recovery and security operations in cloud computing systems. A neural network model is proposed that monitors a cloud computing network and can rebuild failed systems through new neural connections. The network uses cooperative coevolution algorithms and evolutionary algorithms to automate remediation. It involves distributed problem solving agents across the cloud network and a layered neural network collective that independently evaluates needs and repairs. This provides a robust, self-healing organizational model for cloud computing infrastructure and operations.
Evolving Neural Networks through Augmenting Topologies NEATAavaas Gajurel
NEAT (NeuroEvolution of Augmenting Topologies) is an evolutionary algorithm that can evolve both the structure and weights of neural networks. It addresses issues with evolving network topologies by using historical markings to track genes, speciation to protect innovations, and incremental growth from minimal structures. Evaluations show NEAT can efficiently evolve solutions to problems like the XOR task and pole balancing. Ablation studies indicate all components of NEAT are important for its performance. The document hypothesizes NEAT could be useful for competitive coevolution by allowing multiple strategies, and for integrating separate networks by evolving connections between them.
Building Neural Network Through Neuroevolutionbergel
Neuroevolution is a technique that uses genetic algorithms to train neural networks. It evolves network topologies, weights, and hyperparameters without relying on backpropagation or large labeled training datasets. The NEAT algorithm is introduced as a prominent neuroevolution method. It uses speciation to protect innovative network topologies and incremental evolution. NEAT encodes networks and uses historical markings to solve the competing conventions problem during crossover. Extensions to NEAT aim to further improve the scalability and efficiency of neuroevolution.
Evolving Neural Networks Through Augmenting TopologiesDaniele Loiacono
The document summarizes the neuroevolution of augmenting topologies (NEAT) approach to evolving neural networks. NEAT uses historical markings to match genes during crossover, speciation to protect innovative topologies from premature competition, and complexification to gradually introduce new structures. It has been shown to more effectively and quickly solve problems compared to previous neuroevolution techniques. NEAT has been applied to games, navigation, vehicle systems and more.
This document discusses using neural networks in games. It describes how neural networks can simplify game code and allow AI to adapt as the game is played. It then provides examples of how neural networks could be used for threat assessment, attack/flee decisions, anticipation, and control. It also describes how the game Pong was used to test the NeuroEvolution of Augmenting Topologies (NEAT) algorithm, which uses genetic algorithms to evolve neural networks. Several trials were run with different fitness functions, with some networks learning to predict ball movement and others getting stuck or unresponsive. Possible remedies are suggested to address the problems encountered.
This document discusses various topological and spatial features of brain networks, including small-world properties, motifs, clusters, degree distributions, and robustness. It provides examples of analyses conducted on structural and functional brain networks, such as detecting clusters in the cat cortex and examining the effects of simulated brain lesions. Modular organization is highlighted as important for local integration and global separation of processing. Developing brain networks are found to require less information to encode connectivity patterns compared to random networks.
The document provides an overview of neural networks including:
- Their history from early models in the 1940s to the breakthrough of backpropagation in the 1980s.
- What a neural network is and how it works at the level of individual neurons and when connected together.
- Common applications of neural networks like prediction, classification, and clustering.
- Key considerations in choosing an appropriate neural network architecture and training data for a given problem.
This document discusses various applications of neural networks, including pattern recognition, autonomous vehicles, medicine, sports prediction, and virus detection. Some key applications mentioned are using neural networks for patient diagnosis, detecting coronary artery disease from medical images, predicting sports outcomes based on team statistics, and forecasting space weather events. The document also notes some limitations of neural networks, such as requiring large datasets and not providing explanations for decisions.
AI neural networks can support disaster recovery and security operations in cloud computing systems. A neural network model is proposed that monitors a cloud computing network and can rebuild failed systems through new neural connections. The network uses cooperative coevolution algorithms and evolutionary algorithms to automate remediation. It involves distributed problem solving agents across the cloud network and a layered neural network collective that independently evaluates needs and repairs. This provides a robust, self-healing organizational model for cloud computing infrastructure and operations.
The document discusses artificial neural networks and their application to cryptography. It begins by explaining that artificial neural networks are designed to model the way the brain performs tasks in a massively parallel manner. It then provides details on the basic structure of artificial neural networks, including processing units, weighted connections, and learning rules. The document next discusses using artificial neural networks for cryptography, including implementing a sequential machine with a Jordan network for encryption/decryption and using a chaotic neural network to encrypt digital signals in a secure manner. It concludes that artificial neural networks provide a novel approach for encrypting and decrypting data.
1) Artificial neural networks (ANNs) are processing systems inspired by biological neural networks, consisting of interconnected nodes that process information via algorithms or hardware components. ANNs can accurately model functions like visual processing in the retina.
2) ANNs are useful for problems like facial recognition that are difficult to solve with algorithms due to their ability to learn from examples in a way similar to the human brain.
3) ANNs have many applications, including pattern recognition, modeling complex relationships in large datasets, and real-time systems due to their parallel architecture.
1. Neural networks can be examined at multiple levels from individual axons between neurons to fibre tracts between brain areas.
2. Types of connectivity include structural revealed by DTI, functional from correlated activity, and effective showing causal relationships.
3. Network analysis examines topological properties like modular clusters, small-world organization with high clustering and short path lengths, as well as spatial organization of brain regions.
VLSI for neural networks and their applications was presented. Biological neural networks refer to networks of biological neurons that perform physiological functions. Artificial neural networks are mathematical models inspired by biological neural networks. Neural networks can be digital, analog, or hybrid and have applications in areas like pattern and speech recognition, economy, sociology, and basic sciences like investigating the impact of treatments over time. In conclusion, artificial neural networks that simulate human biological neurons have potential for wide implementation and can be trained on input data and then apply that knowledge to new cases.
Artificial Neural Network and its Applicationsshritosh kumar
Abstract
This report is an introduction to Artificial Neural
Networks. The various types of neural networks are
explained and demonstrated, applications of neural
networks like ANNs in medicine are described, and a
detailed historical background is provided. The
connection between the artificial and the real thing is
also investigated and explained. Finally, the
mathematical models involved are presented and
demonstrated.
My invited talk at the 23rd International Symposium of Mathematical Programmi...Anirbit Mukherjee
This document provides an overview of the author's research on neural networks. It begins with an introduction to the papers the overview is based on and the collaborators involved. It then discusses open questions about characterizing the functions represented by neural networks and some of the author's results, including: proving certain functions require a depth of log(n+1) to represent; showing depth separations between network depths; and establishing gaps between different network architectures for Boolean functions. The author outlines ongoing work on fully characterizing neural network functions and establishing stronger depth separations.
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
This document discusses using a convolutional neural network (CNN) to detect breast cancer from medical images. CNNs are a type of deep learning model that can learn image features without manual feature engineering. The proposed system would take a sample medical image as input, preprocess it, and compare it to images in a database labeled as cancerous or non-cancerous. If cancer is detected, the system would determine the cancer stage and recommend appropriate treatment. The CNN model would be built and trained using libraries like Keras, TensorFlow, and Numpy to classify images and detect breast cancer at early stages for better treatment outcomes.
This document provides an overview of artificial neural networks (ANNs). It discusses ANN basics such as their structure being inspired by biological neural networks in the brain. The document covers different types of ANNs including feedforward and feedback networks. It also discusses ANN properties like learning strategies, applications, advantages like handling noisy data, and disadvantages like requiring training. The conclusion states that ANNs are flexible and suited for real-time systems due to their parallel architecture.
This document provides an overview and summary of a student project report on simulating a feed forward artificial neural network in C++. The report includes an abstract, table of contents, list of figures, and 5 chapters that discuss the objectives of the project, provide background on artificial neural networks, describe the design and implementation of a 3-layer feed forward neural network using backpropagation, present the results, and provide references. The design section explains the backpropagation algorithm and provides pseudocode for calculating outputs at each layer. The implementation section provides pseudocode for training patterns and minimizing error.
Artificial neural networks seminar presentation using MSWord.Mohd Faiz
This document provides an overview of artificial neural networks. It discusses neural network architectures including feedforward and recurrent networks. It covers neural network learning methods such as supervised learning, unsupervised learning, and reinforcement learning. Backpropagation is described as a method for training neural networks by calculating partial derivatives of the error function. Higher order learning algorithms and considerations for designing neural networks like choosing the number of hidden layers and activation functions are also summarized.
Robust Feature Learning with Deep Neural Networks
http://snu-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/display.do?tabs=viewOnlineTab&doc=82SNU_INST21557911060002591
This document is a preface to a book about neural networks. It provides an overview of the book's contents and objectives. The book aims to present a variety of standard neural network architectures along with their training algorithms and examples of applications. It is intended as both a textbook and reference for students and researchers interested in using neural networks. The preface outlines the scope and organization of the material covered in the book.
This document discusses functional brain networks and network science approaches to studying the brain. It begins by defining complex systems and network science. It then outlines the main types of brain networks - anatomical and functional networks. Functional brain networks are constructed from time series data measuring brain activity and can be analyzed using network measures to study properties like segregation, integration and resilience.
The document provides an introduction to neural networks, including:
- Biological neural networks transmit signals via neurons connected by synapses and axons.
- Artificial neural networks are composed of simple processing elements (neurons) that operate in parallel and are determined by network structure and connection strengths (weights).
- Multilayer neural networks consist of an input layer, hidden layers, and output layer connected by weights to solve complex problems. Learning involves updating weights so the network can efficiently perform tasks.
This document summarizes artificial neural networks. It discusses how neural networks are composed of interconnected neurons that can learn complex behaviors through simple principles. Neural networks can be used for applications like pattern recognition, noise reduction, and prediction. The key components of neural networks are neurons, synapses, weights, thresholds, and activation functions. Neural networks offer advantages like adaptability and fault tolerance, though they are not exact and can be complex. Examples of neural network applications discussed include object trajectory learning, radiosity for virtual reality, speechreading, target detection and tracking, and robotics.
Artificial Neural Network Paper Presentationguestac67362
The document provides an introduction to artificial neural networks. It discusses how neural networks are designed to mimic the human brain by using interconnected processing elements like neurons. The key aspects covered are:
- Neural networks can perform tasks like pattern recognition that are difficult for traditional algorithms.
- They are composed of interconnected nodes that transmit scalar messages to each other via weighted connections like synapses.
- Neural networks are trained by presenting examples, allowing the weighted connections to adjust until the network produces the desired output for each input.
Neural networks are inspired by biological neural networks and are composed of interconnected processing elements called neurons. Neural networks can learn complex patterns and relationships through a learning process without being explicitly programmed. They are widely used for applications like pattern recognition, classification, forecasting and more. The document discusses neural network concepts like architecture, learning methods, activation functions and applications. It provides examples of biological and artificial neurons and compares their characteristics.
Web server load prediction and anomaly detection from hypertext transfer prot...IJECEIAES
As network traffic increases and new intrusions occur, anomaly detection solutions based on machine learning are necessary to detect previously unknown intrusion patterns. Most of the developed models require a labelled dataset, which can be challenging owing to a shortage of publicly available datasets. These datasets are often too small to effectively train machine learning models, which further motivates the use of real unlabeled traffic. By using real traffic, it is possible to more accurately simulate the types of anomalies that might occur in a real-world network and improve the performance of the detection model. We present a method able to predict and categorize anomalies without the aid of a labelled dataset, demonstrating the model’s usability while also gathering a dataset from real noisy network traffic. The proposed long short-term memory (LTSM) based intrusion detection system was tested in a real-world setting of an antivirus company and was successful in detecting various intrusions using 5-minute windowing over both the predicted and real update curves thereby demonstrating its usefulness. Our contribution was the development of a robust model generally applicable to any hypertext transfer protocol (HTTP) traffic with almost real-time anomaly detection, while also outperforming earlier studies in terms of prediction accuracy.
Congestion Prediction in Internet of Things Network using Temporal Convolutio...vitsrinu
The document proposes a novel congestion prediction approach for Internet of Things (IoT) networks using a Temporal Convolutional Network (TCN). It aims to more accurately predict network congestion compared to other machine learning models. The key contributions are using a Taguchi method to optimize the TCN model hyperparameters, applying dropout to avoid overfitting on heterogeneous IoT data, and developing a Home IoT testbed to generate real network traffic data for model training and evaluation. Experimental results show the TCN approach achieves 95.52% accuracy in predicting IoT network congestion.
The document discusses artificial neural networks and their application to cryptography. It begins by explaining that artificial neural networks are designed to model the way the brain performs tasks in a massively parallel manner. It then provides details on the basic structure of artificial neural networks, including processing units, weighted connections, and learning rules. The document next discusses using artificial neural networks for cryptography, including implementing a sequential machine with a Jordan network for encryption/decryption and using a chaotic neural network to encrypt digital signals in a secure manner. It concludes that artificial neural networks provide a novel approach for encrypting and decrypting data.
1) Artificial neural networks (ANNs) are processing systems inspired by biological neural networks, consisting of interconnected nodes that process information via algorithms or hardware components. ANNs can accurately model functions like visual processing in the retina.
2) ANNs are useful for problems like facial recognition that are difficult to solve with algorithms due to their ability to learn from examples in a way similar to the human brain.
3) ANNs have many applications, including pattern recognition, modeling complex relationships in large datasets, and real-time systems due to their parallel architecture.
1. Neural networks can be examined at multiple levels from individual axons between neurons to fibre tracts between brain areas.
2. Types of connectivity include structural revealed by DTI, functional from correlated activity, and effective showing causal relationships.
3. Network analysis examines topological properties like modular clusters, small-world organization with high clustering and short path lengths, as well as spatial organization of brain regions.
VLSI for neural networks and their applications was presented. Biological neural networks refer to networks of biological neurons that perform physiological functions. Artificial neural networks are mathematical models inspired by biological neural networks. Neural networks can be digital, analog, or hybrid and have applications in areas like pattern and speech recognition, economy, sociology, and basic sciences like investigating the impact of treatments over time. In conclusion, artificial neural networks that simulate human biological neurons have potential for wide implementation and can be trained on input data and then apply that knowledge to new cases.
Artificial Neural Network and its Applicationsshritosh kumar
Abstract
This report is an introduction to Artificial Neural
Networks. The various types of neural networks are
explained and demonstrated, applications of neural
networks like ANNs in medicine are described, and a
detailed historical background is provided. The
connection between the artificial and the real thing is
also investigated and explained. Finally, the
mathematical models involved are presented and
demonstrated.
My invited talk at the 23rd International Symposium of Mathematical Programmi...Anirbit Mukherjee
This document provides an overview of the author's research on neural networks. It begins with an introduction to the papers the overview is based on and the collaborators involved. It then discusses open questions about characterizing the functions represented by neural networks and some of the author's results, including: proving certain functions require a depth of log(n+1) to represent; showing depth separations between network depths; and establishing gaps between different network architectures for Boolean functions. The author outlines ongoing work on fully characterizing neural network functions and establishing stronger depth separations.
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
This document discusses using a convolutional neural network (CNN) to detect breast cancer from medical images. CNNs are a type of deep learning model that can learn image features without manual feature engineering. The proposed system would take a sample medical image as input, preprocess it, and compare it to images in a database labeled as cancerous or non-cancerous. If cancer is detected, the system would determine the cancer stage and recommend appropriate treatment. The CNN model would be built and trained using libraries like Keras, TensorFlow, and Numpy to classify images and detect breast cancer at early stages for better treatment outcomes.
This document provides an overview of artificial neural networks (ANNs). It discusses ANN basics such as their structure being inspired by biological neural networks in the brain. The document covers different types of ANNs including feedforward and feedback networks. It also discusses ANN properties like learning strategies, applications, advantages like handling noisy data, and disadvantages like requiring training. The conclusion states that ANNs are flexible and suited for real-time systems due to their parallel architecture.
This document provides an overview and summary of a student project report on simulating a feed forward artificial neural network in C++. The report includes an abstract, table of contents, list of figures, and 5 chapters that discuss the objectives of the project, provide background on artificial neural networks, describe the design and implementation of a 3-layer feed forward neural network using backpropagation, present the results, and provide references. The design section explains the backpropagation algorithm and provides pseudocode for calculating outputs at each layer. The implementation section provides pseudocode for training patterns and minimizing error.
Artificial neural networks seminar presentation using MSWord.Mohd Faiz
This document provides an overview of artificial neural networks. It discusses neural network architectures including feedforward and recurrent networks. It covers neural network learning methods such as supervised learning, unsupervised learning, and reinforcement learning. Backpropagation is described as a method for training neural networks by calculating partial derivatives of the error function. Higher order learning algorithms and considerations for designing neural networks like choosing the number of hidden layers and activation functions are also summarized.
Robust Feature Learning with Deep Neural Networks
http://snu-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/display.do?tabs=viewOnlineTab&doc=82SNU_INST21557911060002591
This document is a preface to a book about neural networks. It provides an overview of the book's contents and objectives. The book aims to present a variety of standard neural network architectures along with their training algorithms and examples of applications. It is intended as both a textbook and reference for students and researchers interested in using neural networks. The preface outlines the scope and organization of the material covered in the book.
This document discusses functional brain networks and network science approaches to studying the brain. It begins by defining complex systems and network science. It then outlines the main types of brain networks - anatomical and functional networks. Functional brain networks are constructed from time series data measuring brain activity and can be analyzed using network measures to study properties like segregation, integration and resilience.
The document provides an introduction to neural networks, including:
- Biological neural networks transmit signals via neurons connected by synapses and axons.
- Artificial neural networks are composed of simple processing elements (neurons) that operate in parallel and are determined by network structure and connection strengths (weights).
- Multilayer neural networks consist of an input layer, hidden layers, and output layer connected by weights to solve complex problems. Learning involves updating weights so the network can efficiently perform tasks.
This document summarizes artificial neural networks. It discusses how neural networks are composed of interconnected neurons that can learn complex behaviors through simple principles. Neural networks can be used for applications like pattern recognition, noise reduction, and prediction. The key components of neural networks are neurons, synapses, weights, thresholds, and activation functions. Neural networks offer advantages like adaptability and fault tolerance, though they are not exact and can be complex. Examples of neural network applications discussed include object trajectory learning, radiosity for virtual reality, speechreading, target detection and tracking, and robotics.
Artificial Neural Network Paper Presentationguestac67362
The document provides an introduction to artificial neural networks. It discusses how neural networks are designed to mimic the human brain by using interconnected processing elements like neurons. The key aspects covered are:
- Neural networks can perform tasks like pattern recognition that are difficult for traditional algorithms.
- They are composed of interconnected nodes that transmit scalar messages to each other via weighted connections like synapses.
- Neural networks are trained by presenting examples, allowing the weighted connections to adjust until the network produces the desired output for each input.
Neural networks are inspired by biological neural networks and are composed of interconnected processing elements called neurons. Neural networks can learn complex patterns and relationships through a learning process without being explicitly programmed. They are widely used for applications like pattern recognition, classification, forecasting and more. The document discusses neural network concepts like architecture, learning methods, activation functions and applications. It provides examples of biological and artificial neurons and compares their characteristics.
Web server load prediction and anomaly detection from hypertext transfer prot...IJECEIAES
As network traffic increases and new intrusions occur, anomaly detection solutions based on machine learning are necessary to detect previously unknown intrusion patterns. Most of the developed models require a labelled dataset, which can be challenging owing to a shortage of publicly available datasets. These datasets are often too small to effectively train machine learning models, which further motivates the use of real unlabeled traffic. By using real traffic, it is possible to more accurately simulate the types of anomalies that might occur in a real-world network and improve the performance of the detection model. We present a method able to predict and categorize anomalies without the aid of a labelled dataset, demonstrating the model’s usability while also gathering a dataset from real noisy network traffic. The proposed long short-term memory (LTSM) based intrusion detection system was tested in a real-world setting of an antivirus company and was successful in detecting various intrusions using 5-minute windowing over both the predicted and real update curves thereby demonstrating its usefulness. Our contribution was the development of a robust model generally applicable to any hypertext transfer protocol (HTTP) traffic with almost real-time anomaly detection, while also outperforming earlier studies in terms of prediction accuracy.
Congestion Prediction in Internet of Things Network using Temporal Convolutio...vitsrinu
The document proposes a novel congestion prediction approach for Internet of Things (IoT) networks using a Temporal Convolutional Network (TCN). It aims to more accurately predict network congestion compared to other machine learning models. The key contributions are using a Taguchi method to optimize the TCN model hyperparameters, applying dropout to avoid overfitting on heterogeneous IoT data, and developing a Home IoT testbed to generate real network traffic data for model training and evaluation. Experimental results show the TCN approach achieves 95.52% accuracy in predicting IoT network congestion.
An efficient approach on spatial big data related to wireless networks and it...eSAT Journals
Abstract
Spatial big data acts as a important key role in wireless networks applications. In that spatial and spatio temporal problems contains the distinct role in big data and it’s compared to common relational problems. If we are solving those problems means describing the three applications for spatial big data. In each applications imposing the specific design and we are developing our work on highly scalable parallel processing for spatial big data in Hadoop frameworks by using map reduce computational model. Our results show that enables highly scalable implementations of algorithms using Hadoop for the purpose of spatial data processing problems. Inspite of developing these implementations requires specialized knowledge and user friendly.
Keywords: Spatial Big Data, Hadoop, Wireless Networks, Map reduce
A data estimation for failing nodes using fuzzy logic with integrated microco...IJECEIAES
Continuous data transmission in wireless sensor networks (WSNs) is one of the most important characteristics which makes sensors prone to failure. A backup strategy needs to co-exist with the infrastructure of the network to assure that no data is missing. The proposed system relies on a backup strategy of building a history file that stores all collected data from these nodes. This file is used later on by fuzzy logic to estimate missing data in case of failure. An easily programmable microcontroller unit is equipped with a data storage mechanism used as cost worthy storage media for these data. An error in estimation is calculated constantly and used for updating a reference “optimal table” that is used in the estimation of missing data. The error values also assure that the system doesn’t go into an incremental error state. This paper presents a system integrated of optimal data table, microcontroller, and fuzzy logic to estimate missing data of failing sensors. The adapted approach is guided by the minimum error calculated from previously collected data. Experimental findings show that the system has great potentials of continuing to function with a failing node, with very low processing capabilities and storage requirements.
Robust encryption algorithm based sht in wireless sensor networksijdpsjournal
In bound applications, the locations
of events reportable by a device network have to be compelled to stay
anonymous. That is, unauthorized observers should be unable to notice the origin of such events by
analyzing the network traffic. I analyze 2 forms of downsides: Communication overhead a
nd machine load
problem. During this paper, I gift a brand new framework for modeling, analyzing, and evaluating
obscurity in device networks. The novelty of the proposed framework is twofold: initial, it introduc
es the
notion of “interval indistinguishabi
lity” and provides a quantitative live to model obscurity in wireless
device networks; second, it maps supply obscurity to the applied mathematics downside I showed that
the
present approaches for coming up with statistically anonymous systems introduce co
rrelation in real
intervals whereas faux area unit unrelated. I show however mapping supply obscurity to consecutive
hypothesis testing with nuisance Parameters ends up in changing the matter of exposing non
-
public supply
data into checking out associate d
egree applicable knowledge transformation that removes or minimize the
impact of the nuisance data victimization sturdy cryptography algorithmic rule. By doing therefore,
I
remodel the matter of analyzing real valued sample points to binary codes, that ope
ns the door for
committal to writing theory to be incorporated into the study of anonymous networks. In existing wor
k,
unable to notice unauthorized observer in network traffic. However our work in the main supported
enhances their supply obscurity against
correlation check. the most goal of supply location privacy is to
cover the existence of real events.
SECURING BGP BY HANDLING DYNAMIC NETWORK BEHAVIOR AND UNBALANCED DATASETSIJCNCJournal
The Border Gateway Protocol (BGP) provides crucial routing information for the Internet infrastructure. A problem with abnormal routing behavior affects the stability and connectivity of the global Internet. The biggest hurdles in detecting BGP attacks are extremely unbalanced data set category distribution and the dynamic nature of the network. This unbalanced class distribution and dynamic nature of the network results in the classifier's inferior performance. In this paper we proposed an efficient approach to properly managing these problems, the proposed approach tackles the unbalanced classification of datasets by turning the problem of binary classification into a problem of multiclass classification. This is achieved by splitting the majority-class samples evenly into multiple segments using Affinity Propagation, where the number of segments is chosen so that the number of samples in any segment closely matches the minority-class samples. Such sections of the dataset together with the minor class are then viewed as different classes and used to train the Extreme Learning Machine (ELM). The RIPE and BCNET datasets are used to evaluate the performance of the proposed technique. When no feature selection is used, the proposed technique improves the F1 score by 1.9% compared to state-of-the-art techniques. With the Fischer feature selection algorithm, the proposed algorithm achieved the highest F1 score of 76.3%, which was a 1.7% improvement over the compared ones. Additionally, the MIQ feature selection technique improves the accuracy by 3.5%. For the BCNET dataset, the proposed technique improves the F1 score by 1.8% for the Fisher feature selection technique. The experimental findings support the substantial improvement in performance from previous approaches by the new technique.
Securing BGP by Handling Dynamic Network Behavior and Unbalanced DatasetsIJCNCJournal
The Border Gateway Protocol (BGP) provides crucial routing information for the Internet infrastructure. A problem with abnormal routing behavior affects the stability and connectivity of the global Internet. The biggest hurdles in detecting BGP attacks are extremely unbalanced data set category distribution and the dynamic nature of the network. This unbalanced class distribution and dynamic nature of the network results in the classifier's inferior performance. In this paper we proposed an efficient approach to properly managing these problems, the proposed approach tackles the unbalanced classification of datasets by turning the problem of binary classification into a problem of multiclass classification. This is achieved by splitting the majority-class samples evenly into multiple segments using Affinity Propagation, where the number of segments is chosen so that the number of samples in any segment closely matches the minority-class samples. Such sections of the dataset together with the minor class are then viewed as different classes and used to train the Extreme Learning Machine (ELM). The RIPE and BCNET datasets are used to evaluate the performance of the proposed technique. When no feature selection is used, the proposed technique improves the F1 score by 1.9% compared to state-of-the-art techniques. With the Fischer feature selection algorithm, the proposed algorithm achieved the highest F1 score of 76.3%, which was a 1.7% improvement over the compared ones. Additionally, the MIQ feature selection technique improves the accuracy by 3.5%. For the BCNET dataset, the proposed technique improves the F1 score by 1.8% for the Fisher feature selection technique. The experimental findings support the substantial improvement in performance from previous approaches by the new technique.
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...IJNSA Journal
Through continuous observation and modelling of normal behavior in networks, Anomaly-based Network Intrusion Detection System (A-NIDS) offers a way to find possible threats via deviation from the normal model. The analysis of network traffic based on time series model has the advantage of exploiting the relationship between packages within network traffic and observing trends of behaviors over a period of time. It will generate new sequences with good features that support anomaly detection in network traffic and provide the ability to detect new attacks. Besides, an anomaly detection technique, which focuses on the normal data and aims to build a description of it, will be an effective technique for anomaly detection in imbalanced data. In this paper, we propose a combination model of Long Short Term Memory (LSTM) architecture for processing time series and a data description Support Vector Data Description (SVDD) for anomaly detection in A-NIDS to obtain the advantages of them. This model helps parameters in LSTM and SVDD are jointly trained with joint optimization method. Our experimental results with KDD99 dataset show that the proposed combined model obtains high performance in intrusion detection, especially DoS and Probe attacks with 98.0% and 99.8%, respectively.
A COMBINATION OF TEMPORAL SEQUENCE LEARNING AND DATA DESCRIPTION FOR ANOMALYB...IJNSA Journal
This document summarizes a research paper that proposes combining long short-term memory (LSTM) and support vector data description (SVDD) for anomaly detection in anomaly-based network intrusion detection systems (A-NIDS). The paper argues that analyzing network traffic as a time series using LSTM can capture relationships between packets, but LSTM alone does not directly optimize for anomaly detection. It also notes that A-NIDS often only have normal data available for training. Therefore, the paper proposes combining LSTM to learn temporal features from network traffic with SVDD, an anomaly detection technique that builds a description of normal data. The combined model trains LSTM and SVDD parameters jointly using a joint optimization method. An evaluation on the KDD99 dataset
This document summarizes 5 references related to machine learning and data mining for computer security and anomaly detection. Reference 1 discusses using decision trees to classify server traffic based on a set of designed features. Reference 2 argues that analyzing distributions of packet features can detect and identify a diverse set of anomalies. Reference 3 examines machine learning issues in anomaly detection for computer security. Reference 4 provides an overview of using machine learning and data mining for problems in computer security. Reference 5 covers basic statistical techniques for computer intrusion detection and network monitoring.
BIG DATA ANALYTICS FOR USER-ACTIVITY ANALYSIS AND USER-ANOMALY DETECTION IN...Nexgen Technology
GET IEEE BIG DATA,JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
Intelligent black hole detection in mobile AdHoc networksIJECEIAES
Security is a critical and challenging issue in MANET due to its open-nature characteristics such as: mobility, wireless communications, self-organizing and dynamic topology. MANETs are commonly the target of black hole attacks. These are launched by malicious nodes that join the network to sabotage and drain it of its resources. Black hole nodes intercept exchanged data packets and simply drop them. The black hole node uses vulnerabilities in the routing protocol of MANETS to declare itself as the closest relay node to any destination. This work proposed two detection protocols based on the collected dataset, namely: the BDD-AODV and Hybrid protocols. Both protocols were built on top of the original AODV. The BDD-AODV protocol depends on the features collected for the prevention and detection of black hole attack techniques. On the other hand, the Hybrid protocol is a combination of both the MI-AODV and the proposed BDD-AODV protocols. Extensive simulation experiments were conducted to evaluate the performance of the proposed algorithms. Simulation results show that the proposed protocols improved the detection and prevention of black hole nodes, and hence, the network achieved a higher packet delivery ratio, lower dropped packets ratio, and lower overhead. However, this improvement led to a slight increase in the end-to-end delay.
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysisijceronline
This document summarizes a research paper that proposes a new method for identifying denial of service (DoS) attacks using multivariate correlation analysis (MCA). The method involves three main steps: 1) generating basic features from network traffic, 2) using MCA to extract correlations between features and generate triangle area maps, and 3) using an anomaly-based detection mechanism to distinguish attacks from normal traffic based on differences from pre-generated normal profiles. The researchers evaluate their method on the KDD Cup 99 dataset and achieve moderate detection performance. However, they identify issues related to differences in feature scales that reduce detection of some attacks. They propose using statistical normalization to address this.
Reliable and Efficient Data Acquisition in Wireless Sensor NetworkIJMTST Journal
The sensors in the WSN sense the surrounding, collects the data and transfers the data to the sink node. It
has been observed that the sensor nodes are deactivated or damaged when exposed to certain radiations or
due to energy problems. This damage leads to the temporary isolation of the nodes from the network which
results in the formation of the holes. These holes are dynamic in nature and can grow and shrink depending
upon the factors causing the damage to the sensor nodes. So a solution has been presented in the base paper
where the dual mode i.e. Radio frequency and the Acoustic mode are considered so that the data can be
transferred easily. Based on this a survey has been done where several factors are studied so that the
performance of the system can be increased.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
Security Method in Data Acquisition Wireless Sensor Network Dharmendrasingh417
This document discusses security methods for data acquisition in wireless sensor networks. It first introduces wireless sensor networks and some of their challenges, including security issues. It then outlines the objectives of exploring routing algorithms and an intrusion prevention system to authenticate nodes and ensure data integrity and confidentiality. The document describes the proposed system of sensor nodes communicating with router pairs running dual routing algorithms and an intrusion prevention system to filter unauthorized data packets. It presents some experimental results on security and power consumption and concludes that the existing system focuses on self-powered routing but more research is still needed on secure and energy-efficient solutions.
Titles with Abstracts_2023-2024_Data Mining.pdfinfo751436
Data mining projects offer several advantages across various industries. Here are some key benefits:
Knowledge Discovery:
Data mining allows organizations to discover hidden patterns, trends, and relationships within large datasets that may not be immediately apparent. This knowledge can be invaluable for making informed decisions.
Improved Decision Making:
By analyzing historical data, data mining enables better decision-making processes. Businesses can use insights gained from data mining to make strategic decisions, optimize operations, and identify areas for improvement.
Customer Segmentation:
Data mining helps in identifying customer segments based on their behavior, preferences, and purchasing patterns. This allows businesses to tailor their marketing strategies, leading to more targeted and effective campaigns.
Fraud Detection:
In industries such as finance and healthcare, data mining is used for detecting fraudulent activities. Analyzing patterns in transactions or claims data can help identify anomalies that may indicate fraudulent behavior.
Predictive Analysis:
Data mining enables predictive modeling, allowing organizations to forecast future trends and outcomes. This is particularly useful in fields like finance, marketing, and healthcare for predicting stock prices, customer behavior, or disease outbreaks.
Process Optimization:
By analyzing operational data, organizations can identify bottlenecks, inefficiencies, and areas for improvement. This leads to more streamlined and efficient business processes.
Personalization:
Data mining enables businesses to create personalized experiences for customers. This is evident in recommendation systems used by companies like Amazon and Netflix, which analyze user behavior to suggest products or content tailored to individual preferences.
Healthcare Insights:
In healthcare, data mining can be used to analyze patient records, identify disease patterns, and optimize treatment plans. This can contribute to better patient outcomes and more efficient healthcare delivery.
Risk Management:
Industries such as insurance and finance benefit from data mining for risk assessment. By analyzing historical data, organizations can assess and mitigate risks more effectively.
Scientific Discovery:
In scientific research, data mining is used to analyze large datasets generated by experiments. This can lead to the discovery of new patterns, correlations, or insights that may not be apparent through traditional methods.
Competitive Advantage:
Organizations that effectively leverage data mining gain a competitive advantage. The insights derived from data can help businesses stay ahead of market trends and make strategic decisions that give them an edge over competitors.
Cost Savings:
By identifying and addressing inefficiencies, data mining can contribute to cost savings. This is especially important in industries with tight profit margins.
Similar to Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint (20)
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
Digital Artefact 1 - Tiny Home Environmental Design
Cao nicolau-mc dermott-learning-neural-cybernetics-2018-preprint
1. IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX 1
Learning Neural Representations for Network
Anomaly Detection
Van Loi Cao, Miguel Nicolau and James McDermott
Abstract—This paper proposes latent representation models for
improving network anomaly detection. Well-known anomaly de-
tection algorithms often suffer from challenges posed by network
data, such as high dimension and sparsity, and a lack of anomaly
data for training, model selection, and hyperparameter tuning.
Our approach is to introduce new regularizers to a classical
Autoencoder (AE) and a Variational Autoencoder (VAE), which
force normal data into a very tight area centered at the origin in
the non-saturating area of the bottleneck unit activations. These
trained AEs on normal data will push normal points towards
the origin, whereas anomalies, which differ from normal data,
will be put far away from the normal region. The models are
very different from common regularized AEs, Sparse AE and
Contractive AE, in which the regularized AEs tend to make
their latent representation less sensitive to changes of the input
data. The bottleneck feature space is now used as a new data
representation. A number of one-class learning algorithms are
used for evaluating the proposed models. The experiments testify
that our models help these classifiers to perform efficiently and
consistently on high-dimensional and sparse network datasets,
even with relatively few training points. More importantly, the
models can minimize the effect of model selection on these
classifiers since their performance is insensitive to a wide range
of hyperparameter settings.
Index Terms—Anomaly detection, latent representation, high
dimension, one-class classification, autoencoders.
I. INTRODUCTION
THE rapid growth of computer networks has enabled them
to function as a central information system in modern
life. The increase in the size, services and applications, and
infrastructure of computer networks such as the Internet of
Things (IoT), has made them complex and heterogeneous.
Thus, they confront various critical threats such as malicious
activities, network intruders and cyber criminals. Identifying
and preventing these detrimental cyber activities have high pri-
ority these days [1]. Analyzing and monitoring network traffic
to identify such malicious actions in large-scale networks are
crucial tasks, and ideally should be carried out automatically
with little supervision by network administrators [2]. Anomaly
detection is a data analysis task where the goal is to detect
patterns deviating greatly from normal data. It is suitable for
automatically identifying illegal, malicious activities and other
forms of network abuse from the normal behaviors of network
systems [3], [4]. Many machine learning algorithms have been
Manuscript received December 22, 2017; revised March 13, 2018. This
work is funded by Vietnam International Education Development (VIED) and
by agreement with the Irish Universities Association.
VL. Cao is with the School of Computer Science, University College
Dublin, Dublin, Ireland (e-mail: loi.cao@ucdconnect.ie).
J. McDermott and M. Nicolau are with University College Dublin, Dublin,
Ireland (e-mail: james.mcdermott2@ucd.ie and miguel.nicolau@ucd.ie).
employed for developing anomaly detection models [1], [2],
[3]. However, several issues, such as the high dimension and
complex types of network data, the lack of labelled anomalous
traffic, and the rapid evolution of intrusion methods, make
network anomaly detection a challenging task. In this work,
we aim to cope with these issues by proposing latent repre-
sentation models which compress normal data into a specific
region of a latent feature space. This is expected to facilitate
modelling of normal data.
As stated, one of the major issues is that labelled anomalous
data tends not be available for constructing network anomaly
detection models [3]. Collecting anomalies is extremely dif-
ficult due to privacy and security concerns of computer net-
works, and the shortage of intrusion network traffic and events
in host logs [5], [6]. Network administrators tend to avoid
divulging data that could compromise the privacy of their
clients or privileged information of their networks. Labeling a
huge volume of anomalous data covering all possible kinds of
attacks from a real-world network would be a challenging and
time-consuming task. Moreover, malicious actions or intrusive
methods are evolving over time. Thus, it may require a
significant amount of time to gather and label these data after
the awareness of the detailed information and behavior of new
attacks becomes available. Furthermore, new anomalies, such
as zero-day vulnerabilities, often cause serious damage to net-
work systems. Thus anomaly detection models are required to
cope with new anomalous actions efficiently. Most supervised
learning algorithms using knowledge of previous anomalies
are unable to detect novelties [1]. These issues strongly suggest
that the training process should be as independent as possible
from the availability of anomalous data, and anomaly detection
models should be able to respond in a flexible and timely way
to any new anomalous actions.
However, the absence of anomalies implies the crucial issue
that no validation set is available for estimating hyperparam-
eters. Most well-known anomaly detection algorithms, such
as one-class Support Vector Machine (OCSVM) [7] or Local
Outlier Factor (LOF) [8], are highly dependent on the choice of
parameters [8], [9] (more details will be discussed in Section II
and III). Supposing a small proportion of anomalies are
available for estimating parameters, this may damage the per-
formance of anomaly detection models since new, completely
different anomalies may appear in the future. Therefore, it
is desirable that network anomaly detection models should
provide a good prediction on unseen data on a wide range
of parameter settings, and have the ability to detect any new
forms of anomalies instantly as they appear.
2. 2 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
The high dimension and complexity of network data is
another challenge to network anomaly detection. Network
traffic is typically described by a huge number of features,
such as in CISCO NetFlow data, and in different data types,
such as hierarchies (IP addresses), categories (protocols and
services) or continuous attributes [3], [10]. Anomaly detection
techniques often require some preprocessing on input data,
which may result in a higher-dimensional and sparser version
of the data. The curse of dimensionality is a problem for
anomaly detection algorithms [11]. This leads to a high pro-
portion of irrelevant features effectively producing noise that
conceals true anomalies in network data. If enough subspaces
that contain a subset of features are given, at least one
subspace (mostly relevant features) can be found in which
anomalies appear far from normal data. However, the search
for such subspaces is systematically difficult in high dimension
since the number of subspaces increases exponentially with
the dimensionality, which is called the exponential search
space problem. The curse of dimensionality also results in
concentration of distances. The relative difference between
the pairwise distance of any two datapoints and that of others
vanishes with increasing dimensionality. This is a challenge
to distance-based anomaly detection algorithms. Therefore,
network anomaly detection algorithms are required to deal
with high-dimensional and sparse data1
, by discovering more
robust and relevant features.
Unsupervised learning techniques, such as Support Vector
Data Description (SVDD), OCSVM and LOF, have been
widely used for anomaly detection [3]. These techniques
have successfully addressed the task of modeling normal
data without any assumption about its underlying distribu-
tion. LOF [8] is an advanced technique for high-dimensional
anomaly detection, which uses the local density deviation of a
given datapoint from its neighbors as an anomaly score. When
LOF is trained on only normal data, it can be used as a one-
class classifier. Recently, Kernel Density Estimation (KDE)
has been employed for building anomaly detection models,
and proven to efficiently model normal data with unknown
underlying distributions [12], [13]. In practice however, these
anomaly detection algorithms have some drawbacks: less
generalization ability in high dimension due to the curse of
dimensionality phenomenon [11], [14], and the difficulty of
tuning hyperparameters. These algorithms are non-parametric
methods, thus their query time is potentially high (more details
in Sections II and III).
Autoencoders (AEs) [15], [16] are a neural network archi-
tecture which have emerged as a suitable approach to anomaly
detection [5], [17], [18], [19] and as building blocks in deep
learning [20], [21], [22] in recent years. An AE is a feed-
forward neural network which attempts to reconstruct the
original input data at the output layer. The middle hidden
layer, sometimes called the bottleneck layer, like a nonlin-
ear PCA, compresses the redundancies while preserving and
differentiating non-redundant information in the input [17].
1A data with a majority of zero elements is considered as a sparse dataset.
Sparsity is a term used to represent the ratio of the number of zero entries
to the total number of entries in a dataset, and it is in the range of [0, 1]. In
this paper, a dataset with a sparsity above 0.5 is regarded as a sparse one.
In the anomaly detection context, an AE trained on normal
data will behave well on normal instances and will result
in small reconstruction errors (REs), but poorly reconstruct
anomalies giving large REs. Thus, RE is commonly used
as a measure of anomaly score. Alternatively, the middle
hidden layer of a trained AE can be used as a new feature
representation (called a latent representation) for improving
the performance of density-based anomaly detection [13] or
anomaly detection based on self-organizing maps [23]. The
central idea is that the latent representation which is lower-
dimension, and more robust to capture normal behaviors,
would help simple classifiers to identify anomalies. However,
the normal data is allowed to be freely distributed in the latent
feature space. The AE encoder could learn to map points from
the normal class into very different regions of the latent feature
space. Thus, the distribution of normal data in the latent feature
space may have an arbitrary shape which may not encourage
the stability of anomaly detection algorithms.
In order to overcome the limitations of the well-known
anomaly detection algorithms, we aim to find a new data
representation for facilitating simple anomaly detection al-
gorithms. The new representation is aimed to have useful
characteristics: lower dimension, straightforward to capture the
structure of normal data, a similar shape of normal data in
the new representation for different input distributions, and
normal data to be distributed in a small region in the feature
space and anomalies to be expected to appear in the rest
of the space. This will potentially improve the performance
of anomaly detection algorithms, and may make them less
sensitive to parameter settings. Our approach is to develop two
AEs, a classical AE and a Variational Autoencoder (VAE),
for constructing such a data representation by introducing
some constraints on the distribution of normal data in the
bottleneck layer. The new regularizers will encourage these
AEs to learn to represent latent data in a more meaningful
way - training data (which is assumed to be normal) appears
close together, and is distributed in a specific region in the
latent feature space. The bottleneck layers of these trained
AEs will then be used as the new data representation. Fig. 1
gives an example of data representation in the original space
(a), in the latent feature space of AEs (b), and in the latent
feature space of our models (c). The normal data shown in
Fig 1(b) is closer together than that in Fig 1(a), and has an
arbitrary shape. In Fig 1(c), the normal data is constrained to
be distributed in a good shape close to the origin. A number
of one-class classification algorithms are then employed to
capture the region representing normal behavior in the latent
feature space, and identify any datapoint not belonging to
this region as anomalies. More details will be presented in
Section IV.
The remainder of the paper is organized as follows. In
Section II and III, we briefly describe several anomaly detec-
tion algorithms, and highlight some related work in anomaly
detection. Our methods are presented in Section IV. This is
followed by Section V showing the evaluation and discussion
of our models. Section VI draws some conclusions and sug-
gests future work.
3. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 3
x
x
0
1
(a)
z
z
0
1
(b)
Normal Anomaly
z0
z1
(c)
Fig. 1. Illustrations of data in the original feature space (a), the latent feature
space of AEs (b), and the latent feature space of our models (c).
II. MATHEMATICS OF ONE-CLASS CLASSIFICATION
ALGORITHMS
This section is to briefly describe anomaly detection al-
gorithms used in this paper. This includes Centroid, Mean
distance, KDE, LOF and OCSVM as well as autoencoders.
A. Anomaly detection algorithms
Centroid (CEN): This is a parametric method which uses
a single Gaussian to model training data. The distance (i.e.
radius) from the centroid (the origin) to an observation reflects
the degree of abnormality of the observation. A larger value
implies a higher probability that the datapoint is an anomaly.
By imposing a threshold on the distance, a query datapoint
can be classified as either normal or an anomaly. This method
has no hyperparameters, and works under the assumption that
the training data has a Gaussian distribution.
Mean Distance (MDIS): The mean of the Euclidean distance
from a datapoint to normal training set can be used as
anomaly score. By imposing a threshold on the mean distance,
the anomaly score of a given datapoint above the threshold
indicates an anomaly. MDIS has no hyperparameters, and is a
non-parametric method.
Kernel Density Estimation (KDE): KDE is used for estimat-
ing the probability density function of a sample in data [24].
KDE can be used for constructing an anomaly detection model
as presented in [12]. However, the main drawback of the model
is its computational cost at querying stage, especially on large
datasets. The performance in terms of classification accuracy
of KDE-based classifiers will depend on the choice of the
bandwidth h of a kernel function [12].
Local Outlier Factor (LOF): LOF [8] considers the data-
points that have a considerably lower local density than their
neighbors as anomalies. It estimates a density deviation score,
called local outlier factor, of a given datapoint with respect to
its neighbors. The larger the LOF score a given datapoint
has, the higher the probability the datapoint is anomalous.
The algorithm has shown its power on network anomaly
detection [25]. In practice however, it has some limitations
when dealing with high-dimensional data [2], and the choice
of the number of neighbors k is still an open question.
One-class Support Vector Machine (OCSVM): OCSVM [7]
first maps the normal data into a feature space via a kernel
function, and searches for a hyperplane with maximum margin
between the region containing most of normal data (normal
region) and the origin in the feature space. The idea behind this
is to allocate the region encompassing the origin for anomalies
to appear. That is to say, the OCSVM decision function returns
a positive value in the normal region far from the origin, and
a negative value in the anomaly region near the origin.
B. Autoencoder
An autoencoder [15], [16] is a neural network which con-
sists of two parts: encoder and decoder as shown in Fig. 2(a).
The encoder is defined as a feature extractor that allows the
explicit representation of an input x in a feature space. Let
f✓ denote the encoder, and X = x1
, x2
, ...xn
be a dataset.
The encoder f✓ will map the input xi
2 X into a latent vector
zi
= f✓(xi
), where zi
is the code or latent representation. The
decoder g✓ will map the latent representation zi
back into the
input space, which forms a reconstruction x̂i
= g✓(zi
). The
encoder and decoder are commonly represented as single-layer
neural networks in the form of non-linear functions of affine
mappings as follows:
f✓ (x) = sf (Wx + b) (1)
g✓(z) = sg
⇣
W
0
z + b
0
⌘
(2)
where W and W
0
are the weight matrices of the encoder and
decoder, and b and b
0
are the bias vectors of the encoder and
decoder. sf and fg are the activation functions of the encoder
and decoder, such as a logistic sigmoid or hyperbolic tangent
non-linear function, or a linear identity function.
Autoencoders learn to minimize the loss function in (3)
with respect to the parameters ✓ = {W, W
0
, b, b
0
}, using a
learning algorithm such as Stochastic Gradient Descent (SGD)
with back-propagation. The reconstruction loss function over
training instances can be written as:
LAE(✓; x) =
1
n
n
X
i=0
l(xi
, x̂i
) =
1
n
n
X
i=0
l(xi
, g✓(f✓(xi
))) (3)
where l(xi
, x̂i
) is the discrepancy between the input xi
and
its reconstruction x̂i. The choice of the reconstruction loss
depends largely on the appropriate distributional assumptions
on given data. The mean squared error (MSE)2
is commonly
used for real-valued data, whereas a cross-entropy loss3
can be
used for binary data. By compressing input data into a lower
dimensional space, the classical autoencoder avoids simply
learning the identity, and removes redundant information [17].
Denoising autoencoders (DAEs) [26], [27] are regularized
autoencoders that are trained to reconstruct the original input
from a corrupted version of the input. This will allow DAEs
to capture the structure of the input distribution, and again
prevent them from learning the identity. The loss function of
AEs in (3) is rewritten for DAEs as follows:
LDAE(✓; x) =
n
X
i=0
Ep(x̃|xi)
⇥
l(xi
, g✓(f✓(x̃)))
⇤
(4)
where x̃ is the corrupted version of xi
drawn from p(x̃|xi
).
Ep(x̃|xi) is the expectation of a reconstruction loss at xi
over
a number of samples x̃ drawn from p(x̃|xi
). This is because the
2LAE(✓; x) = 1
n
Pn
i=1 k xi x̂i k2
3LAE(✓; x) = 1
n
Pn
i=1 xi log(x̂i) + (1 xi) log(1 x̂i)
4. 4 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
corruption process is performed stochastically on the original
input each time a point xi
is considered. There are many ways
to corrupt the input, such as Gaussian noise or salt and pepper
noise, but randomly masking features of the input to zero is
the most commonly used. This loss function can be optimized
by a SGD as in optimizing the AEs loss function.
C. Variational Autoencoder
The Variational Autoencoder (VAE) [28] is a neural network
that consists of two parts: a probabilistic encoder representing
the approximate posterior q (z|x) to the intractable true pos-
terior p✓(z|x), and a probabilistic decoder that refers to the
generative model p✓(x|z) as shown in Fig 2(b). The objective
of VAE is to optimize the variational lower bound on the
marginal likelihood of data w.r.t. variational parameters and
generative parameters ✓. The marginal likelihood is computed
as a sum over the marginal likelihoods of individual datapoint
since it is intractable, log p✓(x1
, ..., xn
) =
Pn
i=1 log p✓(xi
),
where log p✓(xi
) can be written as:
log p✓(xi
) = DKL q (z|xi
)kp✓(z|xi
) + L(✓, ; xi
) (5)
The term L(✓, ; xi
) is the lower bound on the marginal likeli-
hood of datapoint xi
since the first term, the Kullback-Leibler
divergence (KL-divergence) of the approximate posterior from
the true posterior, is non-negative. The lower bound can be
written as follows:
L(✓, ; xi
) = Eq (z|x)[ log q (z|x) + log p✓(x, z)]
= DKL q (z|xi
)kp✓(z) + Eq (z|xi)[log p✓(xi
|z)] (6)
where p✓(xi
|z) is the likelihood of xi
given the latent variable
z, and p✓(z) is the prior over latent variables.
However, the second term in (6) requires a random latent
variable z sampling from the approximate posterior q (z|x).
This is problematic since back-propagation can not flow
through a random node z. When q (z|x) is restricted to
some kinds of parametric distributions, e.g. Gaussian, the
random variable z can be reparameterized as a deterministic
function z = g (✏, x) where ✏ is an auxiliary variable with
independent marginal p(✏). This yields a lower-variance lower
bound estimator called SGVB (Stochastic Gradient Variational
Bayes): L̃(✓, ; xi
)
= DKL q (z|xi
)kp✓(z) +
1
L
L
X
l=1
log p✓(xi
|zi,l
) (7)
where zi,l
= g (✏i,l
, xi
) and ✏l
⇠ p(✏). In (7), the KL-
divergence term forces q (z|x) to be as close as possible to
p✓(z) and works as a regularizer, whereas the second term is
an expected negative reconstruction error.
For analytically integrating the KL-divergence in (7), the
true posterior p✓(z|x) is assumed to be an approximate Gaus-
sian with approximately diagonal covariance. Let the prior
p✓(z) = N(0, I), and the approximate posterior is multivariate
Gaussian with a diagonal covariance structure q (z|xi
) =
N(µi
, ( i
)2
), where µi
and i
are mean and s.d. evaluated
at datapoint i. Let µi
j and i
j denote the j-th element of µi
Encoder
Bottleneck
Decoder
(a)
z =! + #. %
! #
z
% ∼ ' 0,1
Encoder
Bottleneck
Decoder
(b)
One-class
Classifiers
Latent
representation
(c)
Fig. 2. The architectures of AEs (a), VAEs (b), and the hybrids of the latent
representation models and one-class classifiers (c).
and i
respectively, where J is the dimensionality of z. The
KL-divergence in (7) is written as follows:
DKL q (z|xi
)kp✓(z) = DKL N(µi
, ( i
)2
)kN(0, I)
=
1
2
J
X
j=1
✓
( i
j)2
+ (µi
j)2
1 log(( i
j)2
)
◆
(8)
Taking DKL q (z|xi
)kp✓(z) in (7), we get the objective
function of VAE at datapoint i as follows:
L(✓, ; xi
) w
1
2
J
X
j=1
✓
( i
j)2
+ (µi
j)2
1 log(( i
j)2
)
◆
+
1
L
L
X
l=1
log p✓(xi
|zi,l
) (9)
where zi,l
= µi
+ i
✏l
and ✏l
⇠ N(0, I). L is the number of
samples per datapoint. In practice, it can be set to 1 as in [28].
When optimizing (maximizing) the objective function at (9)
by Stochastic Gradient Ascent, VAEs learn the recognition
model parameters jointly with the generative model param-
eters ✓. Given datapoint xi
, the probabilistic encoder outputs
the parameters of the approximate posterior at this datapoint
µi
and i
. An actual value zi,l
⇠ q (z|xi
) obtained through
zi,l
= µi
+ i
✏l
is the input for the probabilistic decoder. The
output of the decoder is the reconstruction x̂i
. The distribution
of the encoder output is Gaussian, whereas that of the decoder
depends on the type of data (Gaussian for real-value data or
Bernoulli for binary).
III. RELATED WORK
In this section, we discuss recent trends and some state-of-
the-art anomaly detection algorithms. This includes Support
Vector Machines [7], [29], [30], and autoencoder-based meth-
ods [5], [14], [17], [18], [19], [31].
Schölkopf et al. [7] and Campbell et al. [30] presented
hyperplane-based one-class SVM approaches as already dis-
cussed. In [7], their aim is to map the input data into the
feature space via a kernel function, and then find a hyperplane
with a maximum margin between the region containing normal
data and the origin in the feature space. The half space
5. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 5
containing the origin is identified as the anomalous region.
The trade-off between the two objectives, maximizing the
margin and minimizing the number of target vectors falling
into the anomalous region, is controlled by the outlier fraction
⌫ 2 (0, 1). The larger the value of ⌫, the more normal vectors
are rejected as outliers and the more normal vectors become
support vectors. When ⌫ approaches 1 almost all normal
vectors become support vectors. The method was evaluated
on the US postal service database of handwritten digits, and
the results show that the classifier performed well. However,
how to choose values for the hyperparameter ⌫ and kernel
parameters such as gamma (related to bandwidth h in KDE)
is still an open question. Instead of allocating the origin region
for anomalies, Campbell et al. [30] proposed a model that
learns to capture the region containing normal instances in
feature space. They attempted to find a hyperplane with respect
to the center of the distribution of normal data, and anomalies
were assumed to appear in the other side. Linear programming
techniques are employed instead of the quadratic programming
in Schölkopf’s approach, that can make their model learn large
datasets rapidly.
Tax and Duin [29] proposed a method called Support Vector
Data Description for anomaly detection. In this approach,
normal data is again first mapped into a feature space corre-
sponding to a kernel function. It then finds a hypersphere with
minimum radius which encompasses almost all normal vectors
in the feature space. Any query datapoints lying inside the
hypersphere are considered as normal and others as anomalies.
In order to achieve good classification accuracy, it is desir-
able to reduce the volume of the hypersphere by rejecting
some fraction of training data (the outlier fraction known
as parameter C) when training this model. This illustrates
a theme present in all one-class classification research, the
trade-off between false positive and false negative rates. They
introduced different kernel functions to SVDD that make the
method more flexible, and the Gaussian kernel was found to be
the most suitable for many datasets. When using the Gaussian
kernel, the method is comparable to OCSVM [7]. However,
the technique requires a large number of normal examples,
and extra outlier objects for training in order to improve the
classification accuracy [29]. Both SVDD and OCSVM have
demonstrated their effectiveness on anomaly detection, but
their limitations are the ability to model large-scale and high-
dimensional data due to their time and space complexity [32].
The approach of using stand-alone AEs to build anomaly de-
tection systems was proposed in [5], [18], [19], in which AEs
act as either anomaly detection methods or feature reduction
techniques. Hawkins et al. [18] trained an AE (also known as a
replicator neural network) with three narrow hidden layers on
normal data. Its RE was used as an “outlier score”: an outlier
score above a predetermined threshold indicated an anomaly.
A step-wise activation function was used for the neurons in the
middle hidden layer, which mapped input data into a number
of possible clusters. Each of these clusters was associated with
an active state of these neurons. These neurons were active
with specific steps on a particular class of data (normal or
anomaly). Thus, the labels of these clusters can be used as
an alternative approach for indicating anomalies. The model
was evaluated on the Wisconsin Breast Cancer (WBC) and
the KDD’99 datasets, and both of these models (RE-based
and cluster-based) produced high accuracy. Furthermore, Fiore
et al. [5] constructed an AE using Discriminative Restricted
Boltzmann Machines to test the hypothesis that there is a
deep similarity among normal behaviors. They expected that
their model can describe all the characteristics of normal
traffic when comparing it against unseen anomalous traffic.
Their experiments involving real-world network traces and
the KDD’99 datasets confirmed that its performance suffered
when testing in a network greatly different from that where
training data was collected. In contrast, Sakurada et al. [19]
employed an AE as a nonlinear feature reduction technique for
anomaly detection. They attempted to clarify the properties
of AEs by comparing a classical AE and a DAE to linear
PCA and Kernel PCA. These techniques were evaluated on
an artificial dataset and on spacecraft telemetry data. They
concluded that DAEs not only outperform linear PCA and
Kernel PCA in terms of accuracy, but also can avoid the high
computation costs of kernel PCA.
Hybrid approaches or extensions of AEs have been recently
proposed for anomaly detection [14], [31]. Veeramachaneni
et al. [31] proposed an ensemble learner to combine three
single one-class classifiers: AE-based, density-based, and ma-
trix decomposition-based techniques. They also used a human
expert to provide ongoing correct labels from which the
algorithms can learn. The models were tested on a large
network log file dataset, and achieved promising results. Erfani
et al. [14] introduced a hybrid of a Deep Belief Network
(DBN) and OCCs, such as OCSVM and SVDD, for solving
the problem of high-dimensional anomaly detection. The DBN
was pre-trained in the greedy layer-wise fashion, that is unsu-
pervised training of each Restricted Boltzmann Machine one-
by-one. OCSVM [7] and SVDD [29] were then built on top of
the pre-trained DBN. This structure takes advantages of high
decision classification accuracy from these OCCs and nonlin-
ear feature reduction from DBNs. The model was evaluated
on eight high-dimensional UCI datasets. The results showed
that the performance of the hybrid models was comparable to
AEs and better than stand-alone OCSVM and SVDD, and the
training and testing times improved significantly.
IV. PROPOSED MODEL
We aim to find a new data representation that facilitates
simple anomaly detection algorithms. This section clarifies
how to construct the data representation by introducing new
regularizers to an AE and a VAE. The new regularizers
together with reconstruction loss will help these AEs to give
a robust representation of normal behavior. The regularizers
will encourage the encoders of these AEs to condense normal
data as close together as possible at a particular region in the
latent feature space, while reconstruction loss promotes these
AEs to keep normal points from overlapping each other. In
order to separate the normal region from anomalies, normal
points will be “pushed” towards the origin at the non-saturating
area of the bottleneck unit outputs by the regularizers. That
is, each coordinate (given by the output of the bottleneck unit
6. 6 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
activation) of an encoded point will tend to be pushed closer
to the non-saturating value (zero) of the activation function.
Thus, a trained AE on normal data can keep normal datapoints
close to the origin, whereas anomalous datapoints, if they
differ from normal datapoints, will therefore tend to differ
greatly, and appear in other regions. A number of one-class
classifiers are employed for evaluating the proposed models.
Fig. 2(c) illustrates the hybrid of the data representation
models and one-class classifiers. More details are shown in
Subsections IV-A and IV-B.
Our models are very different from other common
regularized AEs, including Sparse AEs and Contractive AEs.
Sparse AEs attempt to construct a sparse representation in
an overcomplete setting in which a few of the outputs of
the hidden unit activations can vary at a time, and others
are set to a saturating value [33]. Thus, the latent data is
penalized close to the saturating value at zero [34], or the
hidden bias vectors are controlled [35]. Contractive AEs seek
a latent representation that is as insensitive as possible w.r.t the
variances in the input data [36]. Thus, the outputs of the hidden
units are constrained to be close to their marginal values (e.g.
0 or 1 in sigmoid function).
A. Shrink Autoencoder
A new regularizer is added to the loss function of an AE
which encourages the AE to construct a representation of
normal data which will be easy for one-class classification
algorithms. The regularizer is designed to penalize normal
datapoints whose vectors in the latent space are of large
magnitude, that is it will restrict the normal data to lie close
to the origin. Hence, this is called a shrink regularizer, and
the AE is named Shrink AE (SAE). The loss function in (3)
can be redefined for this situation as follows:
LSAE(✓; xi
, z) =
1
n
n
X
i=1
l(xi
, x̂i
) +
1
n
n
X
i=1
k zi
k2
(10)
where x̂i and zi are the reconstruction and the latent vector
of the observation xi respectively. The first term is the recon-
struction error, 1
n
Pn
i=1 k xi
x̂i
k2
, and the second term is
the shrink regularizer. The parameter controls the trade-off
between the two terms in the loss function.
B. Dirac delta Variational Autoencoder
VAEs attempt to encode data so that it is distributed as a
standard Gaussian in the latent space. Thus, normal data will
reside in a large area centered at the origin. Our strategy is
to compress normal data into a smaller area near the origin.
Therefore, we redesign the KL-divergence at (8) by forcing
the approximate posterior q (z|x) to be as close as possible
to a new prior p✓(z) with very small standard deviation.
Let us recall the KL-divergence between two multivariate
Gaussian distributions in Rn
, P1 = N(µ1, ⌃1) and P2 =
N(µ2, ⌃2), defined in [37] as:
DKL (P1kP2) =
1
2
tr(
⌃1
⌃2
) + (µ2 µ1)T
⌃ 1
2 (µ2 µ1)
n + log
✓
det(⌃2)
det(⌃1)
◆
(11)
Let µi
and ⌃i
denote the variational mean and the covariance
matrix evaluated at datapoint i, q (z|xi
) = N(µi
, ⌃i
), and J
be the dimensionality of z. Consider a constant ↵ (↵ ⌧ 1.0)
to be the variance of the prior probability, p✓(z) = N(0, ↵I). I
is a identity matrix. Applying these to (11), the KL-divergence
between q (z|xi
) and p✓(z) can be written as follows:
DKL q (z|xi
)kp✓(z) =
1
2
tr((↵I) 1
⌃i
)+(µi
)T
(↵I) 1
(µi
)
J + log
✓
det(↵I)
det(⌃i)
◆
(12)
Taking I and ↵ in (12), we get: DKL q (z|xi
)kp✓(z)
=
1
2
tr((↵) 1
⌃i
)+(↵) 1
(µi
)T
(µi
) J+log
✓
(↵)J
det(⌃i)
◆
=
1
2↵
[tr(⌃i
)+(µi
)T
(µi
) ↵J+↵J log ↵ ↵ log(det(⌃i
))]
(13)
Because ⌃i
is a diagonal matrix of size J ⇥ J, ⌃i
can be
used as a vector of its J diagonal elements. Let µi
j and ( i
j)2
denote the j–th element of µi
and ⌃i
respectively.
Taking tr(⌃i
) and det(⌃i
), we get:
DKL q (z|xi
)kp✓(z) =
1
2↵
J
X
j=1
( i
j)2
+
J
X
j=1
(µi
j)2
↵
J
X
j=1
1
+ ↵
J
X
j=1
log ↵ ↵ log(
J
Y
j=1
( i
j)2
)
=
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵+ ↵ log ↵ ↵ log(( i
j)2
)] (14)
Now we apply the KL-divergence in (14) to (7). The
negative log likelihood loss in (7) is replaced by MSE between
xi
and its reconstruction x̂i
since we will apply our models
only on real-valued datasets. The objective function given
at (7) can be rewritten as follows:
L(✓, ; xi
) w
1
n
n
X
1
k xi
x̂i
k
2
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵ + ↵ log ↵ ↵ log(( i
j)2
)] (15)
The prior can be seen as a Dirac delta distribution because
↵ is very small. Thus, this VAE is named Dirac delta Varia-
tional Autoencoder (DVAE). Maximizing (15) is equivalent
to minimizing its KL-divergence and RE components. We
introduce a parameter to control the trade-off between
the two components in (15). The objective function can be
rewritten in a form of the loss function of DVAE as follows:
LDVAE(✓, ; xi
) =
1
n
n
X
1
k xi
x̂i
k
2
+
1
2↵
J
X
j=1
[( i
j)2
+ (µi
j)2
↵ + ↵ log ↵ ↵ log(( i
j)2
)] (16)
7. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 7
V. EVALUATION AND DISCUSSION
This section is to evaluate the SAE and DVAE algorithms on
constructing the data representation for improving the perfor-
mance of anomaly detection algorithms. This is demonstrated
by the experimental results produced from five simple one-
class classification (OCC) algorithms LOF, CEN, KDE, MDIS,
OCSVM using the latent representations of SAE and DVAE on
fourteen problems. In order to highlight the strengths of SAE
and DVAE, the results are also compared to those from: (1)
the stand-alone OCCs (without any AE latent representation),
(2) the OCCs using the latent representations of a denoising
AE (DAE) and a VAE, and (3) the RE-based OCC. For
measuring the accuracy of the models, we evaluate the area
under the resulting ROC curve (AUC) by trying many different
thresholds, and create a confusion matrix by choosing only one
threshold. A number of experiments and analysis for exploring
different aspects of the latent representations of SAE and
DVAE are carried out as follows:
• Evaluate the effect of dimensionality and sparsity on
the classification accuracy of the OCCs using the latent
representations given by SAE and DVAE.
• Explore the effect on classification accuracy of OCSVM
and LOF of their parameters ⌫, , and k. Investigate the
distribution of latent vectors on normal and anomaly data.
• Measure the effect of training size on the AUCs and query
time created by SAE-OCCs and DVAE-OCCs.
• Evaluate the AUCs from the OCCs on specific categories
of attack types in NSL-KDD and UNSW-NB15.
A. Experiments
1) Datasets: The experiments are conducted on fourteen
datasets including network problems as shown in Table I. The
eight network datasets are mostly well-known problems in the
domain of network security. Although the main objective is
to cope with the challenges arising in high-dimensional net-
work data, the models are also evaluated on six non-network
datasets from the UCI Machine Learning Repository [38].
This is because we intend to evaluate the performance of
our models on a diversity of data, and expect to emphasize
their strength on high-dimensional network-related datasets.
The normal traffic in CTU13, UNSW-NB15 and NSL-KDD
is considered as normal data, whereas all the attacks are
treated as anomalies. In PenDigits, the digits ‘0’ and ‘2’ are
chosen as the normal and anomalous classes respectively. For
GLASS, window glass is considered as the normal class, and
other classes as the anomalous class. In the other datasets, the
normal and anomalous classes are indicated following [39].
The CTU13 is a publicly available botnet dataset provided
in 2011 [40]. The data covers a wide range of real-world
botnet traffic mixed with normal traffic and background traf-
fic (unlabeled data). The CTU13 consists of thirteen botnet
scenarios, and each of them involves a specific type of
malware. We choose four scenarios in CTU13, and split each
of them into 40% for training (normal traffic) and 60% for
evaluating (normal and botnet traffic) following [41]. We use
most of the 14 features in CTU13 except source/destination
IP addresses. Three categorical features, protocol, sTos and
dTos, are encoded by one-hot-encoding, which results in higher
dimensional versions of these scenarios.
TABLE I
FOURTEEN DATASETS FOR EVALUATING THE PROPOSED MODELS
Dataset Dimension4 Training
set
Normal
Test
Anomaly
Test
PageBlocks 10 3930 983 112
WPBC 32 118 30 10
PenDigits 16 780 363 364
GLASS 9 130 33 11
Shuttle 9 3410 11478 3022
Arrhythmia 259 189 48 37
Rbot (CTU13-10) 38 6338 9509 63812
Murlo (CTU13-8) 40 29128 43694 3677
Neris (CTU13-9) 41 11986 17981 110993
Virut (CTU13-13) 40 12775 19164 24002
Spambase 57 2230 558 363
UNSW-NB155 196 56000 37000 45332
NSL-KDD5 122 67343 9711 12833
InternetAds 1558 1582 396 77
NSL-KDD is a filtered version of the KDD’99 dataset [42],
which was suggested to address the inherent issues mentioned
in [43]. Although NSL-KDD still suffers from some problems
discussed in [44], it can be reasonable to use the data as
an effective benchmark for comparing anomaly detection
algorithms in this work due to the shortage of public intrusion
data. Each 41-feature record in NSL-KDD is labeled as either
normal or a specific attack group in the four main categories:
Denial of Service (DoS), Remote to Local (R2L), User to
Local (U2R) and Probe. NSL-KDD consists of two parts:
KDDTrain+
and KDDTest+
which are drawn from differ-
ent distributions (additional 14 types of attacks in KDDTest+
only). Three categorical features, protocol type, service and
flag, are preprocessed by one-hot-encoding which increases
the number of features to 122.
UNSW-NB15 has been recently provided and is expected to
address the inherent issues in the KDD’99 dataset and NSL-
KDD [45]. Each record comprising 47 features is labeled
either as realistic normal traffic or one of the nine modern
attack categories: Fuzzers, Analysis, Backdoor, DoS, Exploit,
Generic, Reconnaissance, Shellcode and Worm. The dataset
is decomposed into two sets, UNSW NB15 training-set and
UNSW NB15 testing-set, for training and evaluating. The
categorical attributes, such as protocol, service and state, are
preprocessed by one-hot-encoding which increases the number
of features to 196. The labelled anomalies in the training parts
of NSL-KDD and UNSW-NB15 are discarded.
PenDigits and Shuttle are already partitioned into training
and testing parts, thus we simply delete labelled anomalies
in the training parts to form training sets. For Spambase,
InternetAds, PageBlocks, WPBC, GLASS and Arrhythmia, we
take 80% of normal data for training and 20% of normal and
anomalies for testing. All datasets are normalized into [-1, 1]
since the activation function of the output layer of these AEs
is the tanh function, and missing values are discarded.
4The dimensions of the four CTU13 datasets, UNSW-NB15 and NSL-KDD
are preprocessing by on-hot-encoding.
5The training sets of UNSW-NB15 and NSL-KDD are much larger than
other datasets, thus we will sample a small proportion (10%) for training.
8. 8 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
2) Parameter Settings: Anomalies are not available during
training, so cross-validation can not be used to tune hyperpa-
rameters. This is one of the major difficulties for this task.
We configure the hyperparameters of AEs and OCCs using
common values and rules of thumb, and then confirm that
performance is not sensitive to these values.
OCC Parameters: The Gaussian kernel is used for KDE and
OCSVM. The scaling parameter related to the bandwidth h
by = 1
2h2 is set by a default value, = 1
nf as in [46], where
nf is the number of input features. The trade-off parameter
⌫ is set to two separate values6
, 0.1 and 0.5, which refers
to OCSVM⌫=0.1 and OCSVM⌫=0.5. In LOF, the number of
nearest neighbors k is chosen as 10% of the training size.
AE Parameters: The architectures of SAE and DVAE are
configured as follows: the number of hidden layers is equal to
5 as in [14], the size of the bottleneck layer m is chosen by
the rule of thumb presented in [13], m = [1 +
p
n], where n is
the number of input features. The choice of mini-batch size is
dependent on the size of training sets. This is needed because
the sizes of the datasets vary by a factor of 500. For small
training sets (< 2000), we split into 20 batches. For large, we
set mini-batch size to 100. We also want to provide a similar
number of batches for each iteration in training processes
which will help early-stopping work efficiently. In order to
eliminate learning rate and the number of training iterations,
we employ the Adadelta algorithm [47] together with early-
stopping techniques [48] for training these networks, which
enables the training processes to operate automatically and
avoid over-fitting. The hyperbolic tangent function is chosen
as the activation function for these AEs. Weights are initialized
following the scheme in [49].
In practice, the KL-divergence in the DVAE loss function
is scaled by log10 since its value is extremely large in early
epochs. The distribution of latent data before training seems to
be very similar to the standard Gaussian distribution. The prior
p✓(z) is a Dirac delta distribution, thus the KL-divergence is
very large, especially at early iterations of the training process.
Fig. 3 (also Fig. 5 in the supplementary material) illustrates
the distribution of latent data (the first feature z0) during the
training process. Therefore, the log10 scaling is expected to
reduce the domination of this term on the loss function.
Fig. 3. Histogram of latent data (the first feature z0) during the training of
DVAE (↵ = 10 8) on Spambase.
SAE and DVAE are trained to minimize the loss functions
in (10) and (16) by an adaptive SGD algorithm (Adadelta) as in
the training of MLPs. We do not apply a pretraining procedure
for these networks since modern back-propagation methods
(weight initialization [49] and Adadelta [47]), together with
6This is expected to show the influence of ⌫ on the performance of OCSVM.
the new regularization terms, are expected to encourage the
networks to learn the parameters in hidden layers effectively.
Early stopping is controlled by two parameters. Training will
terminate when the loss does not improve by an absolute value
of 10 3
for t iterations. t is calculated as 2000 / number
of batches (where number of batches is already defined in
this section). Note that only normal data is employed for the
training process.
We use the same model selection for setting up a five hidden
layer DAE and a five hidden layer VAE7
. However, the DAE
is trained in greedy layer-wise fashion following the original
scheme proposed in [20], [21]. In the pretraining procedure,
each single denoising autoencoder is trained to minimize MSE
between the reconstruction formed from a corrupted version8
of the input, and the original input. This is optimized by
SGD with a common value for learning rate, 10 2
, and 200
iterations9
to initialize weights and biases for the DAE. The
DAE and VAE are then fine-tuned (end-to-end) as in the
training of SAE and DVAE.
Estimating : This is carried out for estimating the param-
eter in the loss functions of SAE (10) and DVAE (16). The
regularizers (shrink in SAE and KL-divergence in DVAE),
force normal datapoints as close together as possible at the
origin, whereas the reconstruction loss attempts to keep them
from overlapping in order to reconstruct them at the output
layer. The two components tend to conflict with each other.
Thus, an appropriate value of should be chosen to bal-
ance the two components. However, anomalous data is not
available for tuning or determining the number of training
iterations in order to avoid overfitting. According to [50], there
are three phases in the training process of a feed-forward
network. The generalization error includes two components
called approximation error and complexity error. In the first
phase, the approximation error dominates the complexity error,
and the generalization error decreases gradually. In phase 2,
these components are approximately balanced, and the gener-
alization error continues to decrease further. The complexity
error is increasingly large after phase 2, and dominates the
approximation error due to large network weights, which can
lead to oscillation and high generalization errors (phase 3).
Thus, the training process should be stopped in phase 2.
Therefore, we investigate these loss functions and their two
components on five values, SAE 2 {0.1, 1, 5, 10, 50} and
DVAE 2 {0.001, 0.01, 0.05, 0.1, 0.5} on four datasets over
1000 epochs. Firstly, we observe three phases on the SAE
training error curves. The larger the value of , the longer
phase 2 will last, which makes it easy to choose early stopping
parameters. When is large (about 10) phase 2 is longer, but
= 50 makes the training error less stable on phase 2. = 10
seems to be a good value which allows us to choose common
values for early stopping parameters. When we apply early
stopping with SAE = 10, we see that the stopping point is
7The equation (9) is rewritten in a form of the VAE loss function since the
VAE is trained under the same training scheme in DVAE: LVAE(✓, ; xi) =
1
n
Pn
1 k xi x̂i k
2
+ 1
2
PJ
j=1[( i
j)2 + (µi
j)2 1 log(( i
j)2)].
8It is obtained by randomly setting 10% of the input features to zero.
9There is no need for using early-stopping here since this is aimed to
initialize weights and biases to be close to a good solution.
9. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 9
mostly in phase 2. We also observe AUC curves, and the early
stopping appears to perform well. Even AUCs are very good
at the first few epochs on some datasets, but we are not using
AUCs to choose . Similarly, we choose DVAE = 0.05. For
brevity we present only the curves of SAE on CTU13-10 with
SAE = 10 in Fig. 4, and on the four datasets in Figs. 1–4 in
the supplementary material.
Fig. 4. SAE loss function and its components (RE and Shrink losses) (w.r.t the
left y-axis), and the AUCs created by SAE-LOF, SAE-CEN and SAE-OCSVM
(w.r.t the right y-axis) during the training process of SAE on CTU13-10.
3) Main experiments: The bottleneck layers of the trained
DAE, VAE, SAE and DVAE are used as latent representa-
tions for six one-class classifiers LOF, CEN, MDIS, KDE,
OCSVM⌫=0.1 and OCSVM⌫=0.5. We use the terms DAE-
OCCs, VAE-OCCs, SAE-OCCs, and DVAE-OCCs to refer to
the six one-class classifiers when using the latent representa-
tions of DAE, VAE, SAE and DVAE respectively. The REs of
these AEs are also used as anomaly score that produces four
further RE-based classifiers. The performance of these stand-
alone one-class classifiers on original data are considered as
baselines. All experiments are implemented in Python 2.7
and run on a machine with an Intel Core 2 Duo i5-3360M
CPU at 2.8 GHz, 8 GB RAM and RAM frequency of 1600
MHz, and the implementation of our algorithms is available on
GitHub (https://github.com/vanloicao/SAEDVAE). The OCCs
provided by scikit-learn are employed [46]. The main results
are shown in Table II.
B. Analysis and discussion
Discussion: Table II presents the AUCs achieved by DAE-
OCCs, VAE-OCCs, SAE-OCCs and DVAE-OCCs, and their
corresponding RE-based classifiers from the 2nd
to the 5th
rows respectively. The results created by the six stand-alone
one-class classifiers are shown in the first row. Each column
represents the AUCs created by a number of classifiers on the
same problem. We use gray-scale to present the performance
of these classifiers on each dataset. In each column, the highest
AUC is highlighted by the lightest gray. The fourteen datasets
are arranged in ascending sparsity order.
Table II shows that when working on the latent repre-
sentations produced by SAE and DVAE, the six one-class
classifiers perform better in terms of classification accuracy
than those using DAE, VAE or stand-alone OCCs on the eight
network-related datasets. These datasets are typically very
high-dimensional and sparse, such as InternetAds with 1558
features. This suggests that the latent representations produced
by SAE and DVAE facilitate these one-class classifiers in deal-
ing with high-dimensional and sparse network-related datasets.
However, VAE-OCCs produces relatively poor performance.
This can be explained as follows: the VAE regularizer has less
influence on learning the representation since the latent data
is already in a good shape before training (see Fig.3). Thus,
most of the representation power of the VAE may be used
for reconstruction. Moreover, normal data resides in a large
region that may give more “room” for anomalies to appear
inside the region. The normal data is also not forced on the
non-saturated part of the activation function.
The hybrid SAE-OCCs and DVAE-OCCs also yield very
similar AUCs on each network-related dataset, even though
these one-class classifiers originate from different algorithms,
and their parameters (e.g. ⌫) are set to different values. This
is clear to see in the 4th
and 5th
rows where sparsity > 0.50.
This implies that SAE and DVAE may constrain normal data
in their latent representations in a well-shaped distribution that
is straightforward for these classification algorithms to capture
normal behaviors, and less sensitive to parameter settings.
Moreover, SAE-OCCs and DVAE-OCCs produce comparable
or superior AUCs in comparison to the RE-based DAE classi-
fier on the network-related datasets, especially for high sparsity
and dimensionality. The influence of OCC parameters and the
distribution of latent vectors are explored later.
The influence of dimensionality and sparsity: We next inves-
tigate the influence of sparsity and dimensionality of data on
the classification accuracy produced from hybrid DAE-OCCs,
SAE-OCCs and DVAE-OCCs. We use the term AUC-DIFF to
refer to the difference in AUC between a classifier (e.g. LOF)
on the original data and on the data encoded by an AE. A
positive value of AUC-DIFF indicates an improvement due to
the AE encoding. AUC-DIFF is plotted against sparsity and
dimensionality of datasets shown in Fig. 5(a) and Fig. 5(b).
It can be seen from Fig. 5(a) that there is a clear increasing
trend in the AUC-DIFF lines of SAE-OCCs and DVAE-OCCs,
while the AUC-DIFF graph of DAE-OCCs tends to decrease.
Similar patterns can also be found when investigating the
influence of dimensionality, shown in Fig. 5(b). The ranking of
datasets by sparsity is similar to the ranking by dimensionality,
therefore these two pieces of evidence are partly overlapping.
The conclusion is that the benefit of the new AE encodings
is greater for sparse, high-dimension datasets, whereas the
benefit of the existing DAE encoding is greater for small, non-
sparse datasets.
The influence of OCC parameters: This is to assess the
influence of OCC parameters, ⌫, and k, on the perfor-
mance in terms of classification accuracy of OCSVM and
LOF when using the latent representations of DAE, SAE
and DVAE. The parameter is fixed being equal to 1
nf for
investigating ⌫, whereas ⌫ is set to 0.1 when examining .
Each of these parameters is examined on fifty different values,
⌫ 2 [0.01, 0.5] and 2 [2⇥10 4
, 2⇥104
]. We plot AUCs from
DAE-OCSVM, SAE-OCSVM and DVAE-OCSVM against ⌫
in Fig. 6(a), and against in Fig. 6(b). The figures show that
the AUC curves of SAE-OCSVM and DVAE-OCSVM tend to
be stable while those of DAE-OCSVM vary according to the
values of ⌫ or . This implies that the latent representations
10. 10 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
TABLE II
AUCS FROM THE STAND-ALONE ONE-CLASS CLASSIFIERS, HYBRID DAE-OCCS, SAE-OCCS AND DVAE-OCCS, AND THE RE-BASED CLASSIFIERS.
Represen-
-tation
Methods
One-class
Classifiers
Datasets (Sparsity)
P
a
g
e
B
lo
c
k
s
(0
.0
0
)
W
P
B
C
(0
.0
2
)
P
e
n
D
ig
it
s
(0
.1
3
)
G
L
A
S
S
(0
.1
8
)
S
h
u
tt
le
(0
.2
2
)
A
rr
h
y
th
m
ia
(0
.5
0
)
C
T
U
1
3
-1
0
(0
.7
1
)
C
T
U
1
3
-0
8
(0
.7
3
)
C
T
U
1
3
-0
9
(0
.7
3
)
C
T
U
1
3
-1
3
(0
.7
3
)
S
p
a
m
b
a
s
e
(0
.8
1
)
U
N
S
W
-N
B
1
5
(0
.8
4
)
N
S
L
-K
D
D
(0
.8
8
)
In
te
rn
e
tA
d
s
(0
.9
9
)
Stand-alone
LOF 0.971 0.600 0.995 0.972 0.984 0.788 0.902 0.899 0.955 0.963 0.751 0.745 0.793 0.762
CEN 0.944 0.580 0.966 0.961 0.881 0.816 0.996 0.971 0.915 0.916 0.816 0.738 0.955 0.816
MDIS 0.927 0.640 0.962 0.970 0.898 0.786 0.998 0.966 0.734 0.891 0.731 0.801 0.929 0.694
KDE 0.928 0.637 0.961 0.967 0.883 0.787 0.998 0.958 0.720 0.889 0.731 0.800 0.924 0.693
OCSVM⌫=0.5 0.934 0.610 0.961 0.961 0.863 0.794 0.998 0.958 0.851 0.925 0.736 0.807 0.935 0.704
OCSVM⌫=0.1 0.934 0.557 0.968 0.832 0.760 0.807 0.983 0.797 0.852 0.898 0.736 0.792 0.890 0.710
DAE
LOF 0.933 0.553 0.997 0.931 0.985 0.654 0.751 0.896 0.891 0.793 0.392 0.736 0.662 0.476
CEN 0.922 0.693 0.964 0.959 0.931 0.738 0.972 0.949 0.628 0.730 0.476 0.743 0.881 0.337
MDIS 0.905 0.700 0.950 0.994 0.901 0.707 0.981 0.960 0.653 0.855 0.466 0.765 0.888 0.342
KDE 0.903 0.690 0.954 0.992 0.892 0.706 0.980 0.939 0.616 0.857 0.460 0.756 0.861 0.335
OCSVM⌫=0.5 0.912 0.630 0.958 0.989 0.885 0.665 0.981 0.938 0.655 0.711 0.454 0.690 0.854 0.325
OCSVM⌫=0.1 0.920 0.557 0.976 0.606 0.762 0.668 0.937 0.775 0.702 0.332 0.578 0.536 0.697 0.314
RE-Based 0.969 0.540 0.997 0.986 0.821 0.824 0.998 0.988 0.943 0.972 0.805 0.873 0.959 0.842
VAE
LOF 0.512 0.480 0.549 0.444 0.489 0.479 0.490 0.499 0.507 0.500 0.509 0.505 0.501 0.474
CEN 0.514 0.497 0.549 0.526 0.489 0.461 0.490 0.500 0.507 0.499 0.507 0.504 0.501 0.472
MDIS 0.509 0.517 0.553 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467
KDE 0.509 0.527 0.554 0.523 0.490 0.488 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.467
OCSVM⌫=0.5 0.510 0.517 0.555 0.521 0.490 0.484 0.490 0.498 0.507 0.500 0.507 0.504 0.501 0.466
OCSVM⌫=0.1 0.515 0.537 0.553 0.537 0.491 0.466 0.490 0.498 0.507 0.499 0.505 0.505 0.501 0.463
RE-Based 0.928 0.657 0.959 0.961 0.883 0.784 0.998 0.957 0.698 0.881 0.734 0.801 0.923 0.694
SAE
= 10
LOF 0.954 0.607 0.996 0.959 0.817 0.762 1.000 0.983 0.960 0.975 0.813 0.894 0.937 0.943
CEN 0.964 0.610 0.995 0.915 0.800 0.754 0.999 0.991 0.950 0.969 0.835 0.886 0.963 0.935
MDIS 0.967 0.603 0.996 0.898 0.794 0.757 0.999 0.990 0.950 0.968 0.826 0.887 0.964 0.936
KDE 0.967 0.607 0.996 0.884 0.783 0.756 0.999 0.990 0.949 0.968 0.825 0.886 0.964 0.934
OCSVM⌫=0.5 0.967 0.610 0.996 0.876 0.773 0.756 0.999 0.990 0.950 0.970 0.823 0.891 0.964 0.935
OCSVM⌫=0.1 0.956 0.600 0.996 0.890 0.781 0.740 0.999 0.988 0.944 0.971 0.825 0.893 0.961 0.933
RE-Based 0.929 0.637 0.959 0.959 0.884 0.787 0.997 0.958 0.720 0.888 0.734 0.800 0.925 0.690
DVAE
= 0.05
↵ = 10 8
LOF 0.908 0.327 0.987 0.705 0.841 0.807 0.999 0.978 0.954 0.973 0.810 0.876 0.958 0.900
CEN 0.906 0.450 0.988 0.774 0.849 0.777 0.999 0.982 0.956 0.963 0.809 0.879 0.960 0.892
MDIS 0.914 0.437 0.987 0.749 0.810 0.794 0.999 0.984 0.957 0.964 0.806 0.873 0.961 0.883
KDE 0.917 0.430 0.987 0.749 0.802 0.796 0.999 0.985 0.957 0.964 0.806 0.872 0.961 0.882
OCSVM⌫=0.5 0.920 0.450 0.988 0.769 0.802 0.797 0.999 0.987 0.957 0.974 0.808 0.872 0.961 0.882
OCSVM⌫=0.1 0.922 0.460 0.988 0.791 0.804 0.780 0.999 0.988 0.956 0.973 0.817 0.872 0.959 0.881
RE-Based 0.928 0.640 0.958 0.953 0.880 0.785 0.998 0.922 0.715 0.836 0.734 0.803 0.924 0.694
(a) (b) (c)
Fig. 5. The influence of sparsity (a) and dimensionality (b) on the AUCs produced by six one-class classifiers using latent representations of DAE, SAE and
DVAE. The visualization of the latent data (the first two features z0 and z1) created by DAE, SAE and DVAE (c) on CTU13-10.
11. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 11
of SAE and DVAE make OCSVM perform consistently over
a wide range of ⌫ and values.
The number of neighbors k is chosen in the range from
1% to 50% of training size. For example, if k is 10% of a
training dataset of size 200 samples, k is equal to 20. The
AUCs of hybrid DAE-LOF, SAE-LOF and DVAE-LOF are
computed, and plotted against 50 values of k as shown in
Fig. 6(c). The AUC curves of the hybrid SAE-LOF and DVAE-
LOF seem to level off within the range of k while there is
no clear trend for the AUC curve of DAE-LOF. Thus, the
latent representations of SAE and DVAE strengthen LOF to
be insensitive to the choice of k. More results are shown in
Fig. 6 of the supplementary material.
These experiments confirm that the one-class classifiers,
such as OCSVM and LOF, perform consistently on wide
ranges of parameter settings when using the latent represen-
tations of SAE and DVAE. This can be explained by: (1)
normal data is represented in very well-shaped (Gaussian)
distributions, and allocated in a small region highly isolated
from the regions where anomalies are expected to appear; (2)
the normal data from different sources will have a similar
representation. Fig. 5(c) is a typical example (also Fig. 7 in
the supplementary material). Therefore, OCSVM and LOF can
model normal data very well even though these classifiers use
few datapoints for support vectors in OCSVM (e.g. ⌫ = 0.01)
or for nearest neighbors in LOF (e.g. k = 1% training size).
This happens on several datasets.
The influence of training size: We investigate the influence
of training size on the latent representations of SAE and
DVAE for anomaly detection tasks. Four datasets of more than
10000 training instances are chosen for this experiment, that
is CTU13-09, CTU13-13, NSL-KDD and UNSW-NB15. Each
dataset is sub-sampled multiple times (sizes ranging from 500
to 10000) to give smaller training set sizes for this experiment.
Model selection is used as described in Subsection V-A2. The
AUCs and query times produced from the hybrid SAE-OCCs
and DVAE-OCCs are plotted against these training sizes as
shown in Fig. 8 and Fig. 9 in the supplementary material. The
results clearly show that the six one-class classifiers produce
very similar AUCs amongst the five sizes on the same dataset.
This suggests that the representation models, SAE and DVAE,
tend to be consistent on a wide range of training sizes, and
are less sensitive to training size than the hybrid DBN-OCCs
in [14, see Fig. 5]. This is a positive result because it appears
that excessive amounts of data are not required to make this
method perform well. In terms of the complexity at query time,
CEN out-performs other OCCs, and its query time does not
scale with training size.
Specific kinds of attacks: Our representation models are also
examined on the thirteen specific attack groups in NSL-KDD
and UNSW-NB15 as shown in Table III. This table has a
similar structure to Table II, without arrangement according to
sparsity. In general, the hybrid SAE-OCCs and DVAE-OCCs
produce big improvements in the classification accuracy in
comparison to their baselines on most of the attack groups,
especially on the attack groups where the baseline is already
good. This presents a common theme in classification methods.
Moreover, the performance of SAE-CEN is evaluated on
NSL-KDD by a confusion matrix as shown in Table IV.
The confusion matrix is not the same as in the multi-class
classification problem. This is because the classifiers built from
only normal data use a threshold to classify unseen data into
either the normal or anomalous class. This means that we can
not measure the incorrect classification of a normal datapoint
to a specific attack group, or an attack group to other attack
groups. Therefore, precision values are only computed for
normal and anomaly in the table. In this work, the threshold
is set to correctly classify 90% on normal training data.
TABLE IV
CONFUSION MATRIX OF THE HYBRID SAE-CEN ON NSL-KDD
Actual class
Precision
N
o
r
m
a
l
P
r
o
b
l
e
D
o
S
R
2
L
U
2
R
Prediction
Normal 8658 3 601 848 10 85.6%
Anomaly 1053 2418 6857 2039 57 91.5%
Recall 89.2% 99.9% 91.9% 70.6% 85.1% 88.8%
Note: the values in bold are correctly classified.
In terms of classification accuracy, the performance of these
one-class classification algorithms are comparable, when the
encoding is good (e.g. the encoding of SAE and DVAE). When
considering computational complexity, CEN, which is a sim-
ple method without hyperparameters, is very computationally
efficient at both modeling and querying. Thus, it is nominated
as the best model in our experiments.
VI. CONCLUSION AND FUTURE WORK
In this paper, we proposed latent representation models,
SAE and DVAE, which help anomaly detection methods
to cope with high-dimensional and sparse network datasets.
Classical AEs do not bring data to a “nice” distribution by
themselves, and the distribution they create is arbitrary. In the
tasks where we rely on good behavior of the encoding, we have
to control the distribution. Even with the standard VAE regu-
larization which does control the distribution, it does not put
the network “under pressure” to use all of its representational
power to represent normal data. Our approaches do so, forcing
normal data into a very tight area centered at the origin in
the non-saturating area of the bottleneck unit activations. This
helps AEs trained on normal data to keep normal datapoints
close to the origin and push anomalies far away.
We have demonstrated the latent representation created by
our models helps well-known anomaly detection algorithms
to perform efficiently and consistently on high-dimensional
and sparse network data, even with relatively few training
points. Amongst these algorithms, CEN is very computation-
ally efficient and is easily feasible to perform in real-time.
More importantly, the representation reduces the difficulty of
model selection for these algorithms since their performance
is insensitive to a wide range of hyperparameter settings.
In future we propose to investigate latent representations
using Gaussian mixture models. We also plan to propose an
alternative method for estimating the hyperparameter in
the loss functions of SAE and DVAE, possibly using multi-
objective optimization.
12. 12 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. X, XXXX XXXX
(a) (b) (c)
Fig. 6. The influence of ⌫ (a) and (b), and k (c) on the performance of OCSVM and LOF respectively when using the latent representations of DAE, SAE
and DVAE on CTU13-13.
TABLE III
AUCS FROM THE CLASSIFIERS MENTIONED IN TABLE II ON SPECIFIC ATTACK GROUPS OF NSL-KDD AND UNSW-NB15.
Representation
Methods
One-class
Classifiers
NSL-KDD UNSW-NB15
P
ro
b
e
D
o
S
R
2
L
U
2
R
F
u
z
z
e
rs
A
n
a
ly
s
is
B
a
c
k
d
o
o
r
D
o
S
E
x
p
lo
it
s
G
e
n
e
ri
c
R
e
c
o
n
n
-
-a
is
s
a
n
c
e
S
h
e
ll
c
o
d
e
W
o
rm
s
Stand-alone
LOF 0.752 0.796 0.821 0.703 0.455 0.635 0.597 0.614 0.670 0.984 0.436 0.354 0.614
CEN 0.974 0.957 0.933 0.934 0.576 0.732 0.748 0.723 0.633 0.895 0.555 0.508 0.676
MDIS 0.986 0.949 0.831 0.885 0.596 0.890 0.900 0.843 0.660 0.969 0.636 0.583 0.679
KDE 0.985 0.945 0.820 0.871 0.601 0.883 0.893 0.840 0.658 0.969 0.639 0.591 0.684
OCSVM⌫=0.5 0.986 0.957 0.838 0.905 0.652 0.855 0.876 0.845 0.733 0.920 0.658 0.603 0.784
OCSVM⌫=0.1 0.958 0.936 0.714 0.789 0.576 0.712 0.733 0.746 0.731 0.961 0.555 0.469 0.853
DAE
LOF 0.620 0.666 0.690 0.509 0.473 0.609 0.560 0.588 0.626 0.985 0.462 0.420 0.561
CEN 0.984 0.926 0.680 0.755 0.551 0.788 0.799 0.744 0.571 0.927 0.626 0.608 0.606
MDIS 0.966 0.912 0.761 0.746 0.565 0.818 0.828 0.770 0.588 0.955 0.644 0.606 0.651
KDE 0.964 0.904 0.666 0.743 0.563 0.799 0.809 0.751 0.571 0.949 0.646 0.614 0.642
OCSVM⌫=0.5 0.982 0.917 0.584 0.795 0.580 0.770 0.798 0.732 0.499 0.827 0.671 0.618 0.732
OCSVM⌫=0.1 0.734 0.834 0.323 0.308 0.391 0.289 0.305 0.417 0.420 0.694 0.527 0.468 0.722
RE-Based 0.981 0.971 0.911 0.930 0.632 0.992 0.957 0.940 0.888 0.979 0.592 0.476 0.816
VAE
LOF 0.489 0.504 0.511 0.488 0.503 0.487 0.522 0.494 0.505 0.501 0.489 0.500 0.464
CEN 0.488 0.504 0.511 0.489 0.504 0.487 0.522 0.494 0.506 0.502 0.488 0.501 0.468
MDIS 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465
KDE 0.489 0.503 0.512 0.489 0.504 0.486 0.523 0.494 0.505 0.501 0.489 0.499 0.465
OCSVM⌫=0.5 0.489 0.503 0.512 0.489 0.504 0.487 0.523 0.494 0.504 0.501 0.489 0.499 0.464
OCSVM⌫=0.1 0.489 0.504 0.511 0.490 0.504 0.487 0.522 0.494 0.505 0.501 0.489 0.499 0.462
RE-Based 0.985 0.945 0.818 0.871 0.605 0.882 0.893 0.840 0.660 0.968 0.642 0.598 0.686
SAE
= 10
LOF 0.964 0.952 0.877 0.920 0.683 0.993 0.963 0.942 0.884 0.992 0.706 0.645 0.909
CEN 0.985 0.971 0.925 0.953 0.646 0.984 0.961 0.952 0.902 0.989 0.625 0.567 0.910
MDIS 0.988 0.971 0.926 0.950 0.629 0.994 0.961 0.952 0.909 0.988 0.646 0.573 0.909
KDE 0.988 0.971 0.925 0.949 0.623 0.993 0.961 0.952 0.909 0.988 0.642 0.559 0.906
OCSVM⌫=0.5 0.987 0.972 0.923 0.948 0.632 0.994 0.965 0.956 0.917 0.988 0.656 0.579 0.907
OCSVM⌫=0.1 0.987 0.973 0.912 0.908 0.648 0.994 0.967 0.957 0.921 0.988 0.642 0.554 0.902
RE-Based 0.985 0.946 0.822 0.872 0.601 0.881 0.891 0.838 0.657 0.969 0.640 0.592 0.685
DVAE
= 0.05
↵ = 10 8
LOF 0.977 0.974 0.896 0.934 0.635 0.996 0.956 0.949 0.898 0.990 0.537 0.457 0.895
CEN 0.983 0.971 0.915 0.929 0.605 0.995 0.958 0.941 0.882 0.990 0.666 0.603 0.881
MDIS 0.982 0.972 0.915 0.927 0.616 0.994 0.955 0.940 0.866 0.990 0.653 0.572 0.854
KDE 0.982 0.972 0.915 0.927 0.608 0.993 0.956 0.939 0.864 0.990 0.658 0.578 0.852
OCSVM⌫=0.5 0.982 0.973 0.914 0.926 0.601 0.993 0.960 0.942 0.869 0.990 0.661 0.584 0.860
OCSVM⌫=0.1 0.981 0.972 0.908 0.908 0.599 0.994 0.961 0.942 0.871 0.990 0.659 0.586 0.860
RE-Based 0.985 0.945 0.820 0.872 0.602 0.888 0.898 0.843 0.660 0.971 0.642 0.593 0.682
REFERENCES
[1] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly
detection techniques,” Journal of Network and Computer Applications,
vol. 60, pp. 19–31, 2016.
[2] M. Usama, J. Qadir, A. Raza, H. Arif, K.-L. A. Yau, Y. Elkhatib,
A. Hussain, and A. Al-Fuqaha, “Unsupervised machine learning for
networking: Techniques, applications and research challenges,” arXiv
preprint arXiv:1709.06599, 2017.
[3] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
[4] V. V. Phoha, Internet security dictionary. Springer Science & Business
Media, 2007.
[5] U. Fiore, F. Palmieri, A. Castiglione, and A. De Santis, “Network
anomaly detection with the Restricted Boltzmann Machine,” Neurocom-
puting, vol. 122, pp. 13–23, 2013.
[6] K. Shafi and H. A. Abbass, “Evaluation of an adaptive genetic-based
signature extraction system for network intrusion detection,” Pattern
Analysis and Applications, vol. 16, no. 4, pp. 549–566, 2013.
[7] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
Williamson, “Estimating the support of a high-dimensional distribution,”
Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
[8] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying
density-based local outliers,” in ACM SIGMOD record, vol. 29, no. 2.
ACM, 2000, pp. 93–104.
[9] S. S. Khan and M. G. Madden, “One-class classification: taxonomy of
study and review of techniques,” The Knowledge Engineering Review,
13. VL. CAO et al.: LEARNING NEURAL REPRESENTATIONS FOR NETWORK ANOMALY DETECTION 13
vol. 29, no. 3, pp. 345–374, 2014.
[10] A. N. Mahmood, C. Leckie, and P. Udaya, “An efficient clustering
scheme to exploit hierarchical data in network traffic analysis,” TKDE,
vol. 20, no. 6, pp. 752–767, 2008.
[11] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised
outlier detection in high-dimensional numerical data,” Statistical Analy-
sis and Data Mining: The ASA Data Science Journal, vol. 5, no. 5, pp.
363–387, 2012.
[12] V. L. Cao, M. Nicolau, and J. McDermott, “One-class classification for
anomaly detection with kernel density estimation and genetic program-
ming,” in EuroGP, Portugal, vol. 9594. Springer, 2016, pp. 3–18.
[13] V. L. Cao, M. Nicolau, J. McDermott et al., “A hybrid autoencoder and
density estimation model for anomaly detection,” in Parallel Problem
Solving from Nature. Springer, 2016, pp. 717–726.
[14] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, “High-
dimensional and large-scale anomaly detection using a linear one-class
SVM with deep learning,” Pattern Recognition, vol. 58, pp. 121–134,
2016.
[15] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
length and Helmholtz free energy,” in Advances in neural information
processing systems, 1994, pp. 3–10.
[16] H. Bourlard and Y. Kamp, “Auto-association by multilayer perceptrons
and singular value decomposition,” Biological cybernetics, vol. 59, no. 4,
pp. 291–294, 1988.
[17] N. Japkowicz, C. Myers, and M. Gluck, “A novelty detection approach
to classification,” in IJCAI, 1995, pp. 518–523.
[18] S. Hawkins, H. He, G. Williams, and R. Baxter, “Outlier detection
using replicator neural networks,” in Data warehousing and knowledge
discovery. Springer, 2002, pp. 170–180.
[19] M. Sakurada and T. Yairi, “Anomaly detection using autoencoders with
nonlinear dimensionality reduction,” in Proc MLSDA. ACM, 2014, p. 4.
[20] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
[21] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
2006.
[22] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-
wise training of deep networks,” in Advances in neural information
processing systems, 2007, pp. 153–160.
[23] D. Rajashekar, A. N. Zincir-Heywood, and M. I. Heywood, “Smart
phone user behaviour characterization based on autoencoders and self
organizing maps,” in ICDMW. IEEE, 2016, pp. 319–326.
[24] M. P. Wand and M. C. Jones, Kernel smoothing. CRC Press, 1994.
[25] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A
comparative study of anomaly detection schemes in network intrusion
detection,” in Proc SIAM International Conference on Data Mining.
SIAM, 2003, pp. 25–36.
[26] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting
and composing robust features with denoising autoencoders,” in Proc
ICML. ACM, 2008, pp. 1096–1103.
[27] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
“Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion,” JMLR, vol. 11, no. 11,
pp. 3371–3408, 2010.
[28] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
preprint arXiv:1312.6114, 2013.
[29] D. M. Tax and R. P. Duin, “Support vector data description,” Machine
learning, vol. 54, no. 1, pp. 45–66, 2004.
[30] C. Bennett and K. Campbell, “A linear programming approach to novelty
detection,” Advances in neural information processing systems, vol. 13,
no. 13, p. 395, 2001.
[31] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li,
“AI2: Training a big data machine to defend,” in Proc BigDataSecurity,
HPSC, and IDS. IEEE, 2016, pp. 49–54.
[32] S. M. Erfani, M. Baktashmotlagh, S. Rajasegarar, S. Karunasekera, and
C. Leckie, “R1SVM: A randomised nonlinear approach to large-scale
anomaly detection,” in AAAI Conference on Artificial Intelligence, 2015.
[33] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
review and new perspectives,” PAMI, vol. 35, no. 8, pp. 1798–1828,
2013.
[34] M. Ranzato, Y.-l. Boureau, and Y. L. Cun, “Sparse feature learning for
deep belief networks,” in Advances in neural information processing
systems, 2008, pp. 1185–1192.
[35] M. Ranzato, C. Poultney, S. Chopra, and Y. L. Cun, “Efficient learning
of sparse representations with an energy-based model,” in Advances in
neural information processing systems, 2007, pp. 1137–1144.
[36] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive
auto-encoders: Explicit invariance during feature extraction,” in Proc
ICML, 2011, pp. 833–840.
[37] J. Duchi, “Derivations for linear algebra and optimization,” Berkeley,
California, 2007.
[38] M. Lichman, “UCI machine learning repository,” 2013. [Online].
Available: http://archive.ics.uci.edu/ml
[39] G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Micenková,
E. Schubert, I. Assent, and M. E. Houle, “On the evaluation of unsu-
pervised outlier detection: measures, datasets, and an empirical study,”
Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891–927,
2016.
[40] S. Garcia, M. Grill, J. Stiborek, and A. Zunino, “An empirical compar-
ison of botnet detection methods,” Computers & Security, vol. 45, pp.
100–123, 2014.
[41] D. C. Le, A. N. Zincir-Heywood, and M. I. Heywood, “Data analytics
on network traffic flows for botnet behaviour detection,” in SSCI. IEEE,
2006, pp. 1–7.
[42] “KDD Cup Dataset,” 1999, available at the following website
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[43] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed
analysis of the KDD CUP 99 data set,” in CISDA. IEEE, 2009, pp.
1–6.
[44] J. McHugh, “Testing intrusion detection systems: a critique of the 1998
and 1999 DARPA intrusion detection system evaluations as performed
by Lincoln laboratory,” TISSEC, vol. 3, no. 4, pp. 262–294, 2000.
[45] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for
network intrusion detection systems (UNSW-NB15 network data set),”
in MilCIS. IEEE, 2015, pp. 1–6.
[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” JMLR, vol. 12, pp.
2825–2830, 2011.
[47] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
preprint arXiv:1212.5701, 2012.
[48] L. Prechelt, “Early stopping-but when?” Neural Networks: Tricks of the
trade, pp. 553–553, 1998.
[49] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proc International Conference
on Artificial Intelligence and Statistics, 2010, pp. 249–256.
[50] C. Wang, S. S. Venkatesh, and J. S. Judd, “Optimal stopping and effec-
tive machine complexity in learning,” in Advances in neural information
processing systems, 1994, pp. 303–310.
Van Loi Cao Loi received a BSc and a MSc
in Computer Science from Le Quy Don Technical
University, Vietnam. He worked for the university as
an assistant lecturer. In 2015, he moved to Ireland
to study a PhD in University College Dublin under
the supervision of Assoc. Prof. James McDermott
and Assoc. Prof. Miguel Nicolau, and is funded
by VIED, Vietnam. His main research interests are
neural network, machine learning, evolutionary com-
putation, and information security.
Miguel Nicolau Miguel is an Assoc Professor in
UCD. He received a BSc in Belgium, followed by
a BSc, MSc and PhD in the University of Limerick.
He then worked as an Expert Engineer in the INRIA
Institute in Paris, France. In 2010 he moved back
to Ireland, and worked as a Research Fellow and
Lecturer in UCD. His teaching experience spans over
15 years, and includes positions at University of
Limerick, Fudan University in Shanghai, and UCD.
James McDermott James holds a BSc in Com-
puter Science with Mathematics, from the National
University of Ireland, Galway. His PhD was in the
University of Limerick. His post-doctoral research
was in UCD and Massachusetts Institute of Technol-
ogy. He is now an Associate Professor in University
College Dublin. His main research interests are
in evolutionary computation, machine learning, and
computer music.