Deep Learning-Based Hybrid Intelligent Intrusion Detection System
Introduction
At the host and network levels, machine learning (ML) techniques are frequently utilized to create
effective intrusion detection (ID) systems for suitable mitigation and effective detection of hostile
cyber threats. Cybersecurity threats, on the other hand, are on the rise. Such risks can be detected
with the help of an ID system. Existing ID systems are unable to detect malicious threats, owing to
the use of classical machine learning algorithms that are less concerned with proper categorization
and feature selection. As a result, establishing a reliable and intelligent identification system is a top
goal.
The major goal of this research was to create a hybrid intelligent intrusion detection system (HIIDS)
that could learn critical feature representations from enormous unlabelled raw network traffic data
quickly and automatically.
We developed an efficient and HIIDS to detect and classify surprising attacks using a spark MLlib
(machine learning library)-based robust classifier such as logistic regression (LR), extreme gradient
boosting (XGB), and a state-of-the-art DL such as a long short-term memory autoencoder (LSTMAE).
To recognize temporal features, an LSTM is used, and an AE is used to detect global features more
effectively. Experiments on a publicly available dataset, the current real-life ISCX-UNB dataset, were
used to assess the efficacy of our suggested strategy.
For the ISCX-UNB dataset, the 10-fold cross validation test yielded a high accuracy rate of up to
97.52 percent.
Background Study
The intrusion detection (ID) system is a well-known method of detecting hostile network activity. A
well-designed ID system can identify the characteristics of a wide range of dangerous behaviours
and respond immediately by delivering warnings.
There are three types of ID systems. The first is a signature-based system (SBS). The anomaly-based
system is the second (ABS). The detection of stateful protocol analysis is the third. According to their
designs, ID systems can be divided into three categories: network-based detection systems (NIDS),
host-based detection systems (HIDS), and hybrid approaches.
Proposed System
The HIIDS was created utilizing the Spark MLlib and LSTMAE deep learning methods in this paper,
which is an effective cyber security strategy. We used an ISCX2012 dataset to train the HIIDS. We
used multiple robust classification algorithms to develop the HIIDS, including LR and XGB for
anomaly detection at stage 1 and the LSTMAE deep learning technique for misuse detection at stage
2. The proposed HIIDS, which is based on DL classification, combines the advantages of both
Signature-based (SB) and Anomaly-based (AB) techniques, lowering computational complexity while
improving ID accuracy and DR.
A two-stage ID system in which Spark MLlib is treated as an anomaly in Stage 1 and LSTMAE is
treated as a misuse in Stage 2. These two phases of the ID architecture are computationally efficient
when using full features datasets and provide higher accuracy with a low FAR probability.
Stage 1: Detecting anomalies with Spark MLlib classifiers
Stage 2 Misuse detection utilizing cutting-edge deep learning techniques like LSTMAE.
Stage 1 used Spark MLlib to detect abnormalities that could be intrusions, while Stage 2 used the
LSTMAE DL model to further classify attacks if they happen.
The hybrid IDS's architecture is depicted in the diagram above; first, network traffic was sorted and
pre-processed. All essential conversions for both Stage-1 Spark MLlib and Stage-2 LSTMAE-based
modules of HIIDS were accomplished during pre-processing; both stages had their own supported
data formats. To show the usefulness of the proposed HIIDS, we employed 1,512,000 network traffic
packets obtained from ISCX-2012 datasets for our hybrid ID experiment.
Datasets
The simulation of the proposed HIID technique was carefully deliberated because selecting a good ID
dataset plays a key role in evaluating the ID system. The ISCX 2012 ID data includes both malicious
and non-malicious traffic behaviours throughout a seven-day period.
To make network traffic activities and states, the ISCX-2012 has two separate profiles. An abnormal
attack's multi-stage or abnormal phases are detected by a profile, which also performs feature
characterization and mathematical dissemination of the procedure. For example, the profile can
include network traffic packet size distributions in the protocol's explicit patterns and time
distribution requests, whereas the profile is produced based on the sophisticated preceding attacks
on a specific day. There are four types of attack scenarios in the profile.
(1) Internal network traffic intrusion
In most cases, a susceptible application program, such as Adobe Acrobat Reader, takes advantage of
a network's internal infiltration. After successful penetration, a backdoor can be installed on the
victim's workstation, which will launch a series of malicious assaults against the victim's network.
Nmap and port scan were commonly used to detect these types of malicious threats.
(2) HTTP denial-of-service attacks
For a set period of time, the attacker makes a network resource unavailable. This is often
accomplished by flooding a network resource with excessive requests, causing the network to
overwork and obstruct the fulfillment of some or all valid requests. The Slow HTTP test, Hulk, Slow
loris, and Goldeneye were largely used to collect these types of DoS attacks.
(3) DDOS attack using an IRC botnet
These types of assaults usually happen when multiple networks compete for a victim's bandwidth or
resources. As a result, DDoS attacks are frequently the result of multiple compromised networks (for
example, a botnet) flooding the target network with large network traffic. LOIC has been used in
these attacks for UDP, TCP, and HTTP.
(4) Use of force SSH
This is the most prevalent sort of attack, which may be used to crack passwords as well as discover
secret content and pages of various web applications. FTP and SSH Patator programs have been used
to launch these types of assaults.
Implementation details
Experiments using the ISCX-2012 ID datasets via normal and attack classifications are used to assess
the HIIDS' dominance. False positives, false negatives, genuine positives, attack detection precision,
and error rate are all examples of false positives. To demonstrate the effectiveness of our proposed
ID system, we ran the first stage in Scala using Spark MLlib anomaly detection, the second stage for
misuse detection, and the DL technique in Java using Deeplearning4j. The simulation was run on a
64-bit cluster computer with 32 cores, 32 GB of RAM, and Ubuntu 14.04 as the operating system.
Java (JDK) 1.8, Spark v2.3.0, Deeplearning4j 1.0.0 alpha, and Scala 2.11.8 comprised the software
stack. CuDNN was used to train the deep learning on an RTX 2080 Ti GPU, while CUDA aided the
pipeline performance.
To assess HIIDS performance, we divided the data into train and test datasets. We used training data
and analyzed our hybrid technique with testing data to create an efficient HIID framework. Figure 1
shows a block diagram of our predicted HIID. To show the dominance of our proposed hybrid
strategy, we used the ISCX2012 with all of its original features. The network traffic, which is mixed
with harmful and non-malicious data, passes via Stage 1 of the Spark MLlib, which categorizes data
into malicious and non-malicious categories. Malicious traffic was used to model Stage 2 LSTMAE,
and malicious traffic was further divided into four types of attacks. The hybrid technique reduces
computing complexity while applying extensive characteristics to the ISCX-2012 dataset, resulting in
improved ID accuracy and a lower FAR. For training, 80 percent of the data was used with 10-fold
cross-validation, and the model was tested with a 20% held-out dataset.
• Stage1
For identifying cybersecurity assaults, Apache Spark is a capable large data processing engine.
With over 55 machine learning algorithms [1,2], Spark MLlib is the most efficient big data
analytics library currently available. Spark MLlib is best for machine learning jobs and is 10
times faster for iterative tasks than Hadoop-based big data processing technologies. Spark
evolution's MLlib began as a part of an ML-based project in 2012, and by 2013, it had evolved
into a useful open-source library for ML workloads. Spark MLlib contains several machine
learning algorithms, such as classification, clustering, regression, and dimensionality
reductions, that are critical to the development of classic machine learning real-time
applications; its mechanisms have been established by several scholars to advance high-
dimensional data analytics around the world.
At Stage 1, MLlib-based anomaly attack detection was first modelled using an established
training set that included both normal and malicious data and was used to evaluate the IDS's
anomaly module. The attack detected on the original traffic was sent through the LSTMAE's
Stage-2, where the misuse attack detection technique detected and classified the assault.
• Stage2
In this stage, LSTMAE was used to define network traffic usage and the goals of further
classifying the anomalous traffic according to certain policies. Figure 1 shows an overview of
the abuse detection module utilizing an LSTMAE. The LSTM is an improved version of the RNN,
which was first developed in [3,4] to address vanishing and exploding gradient problems. All of
the RNN's hidden layers are replaced by memory blocks, which are made up of a memory cell
that stores information and has three critical gates (input, output, and forget gate) that play
dynamic roles in LSTM [5]. The capacity of LSTM to capture long dependencies and learn
decently from variable quantity sequences is its most powerful feature.
LSTMAE is utilized as a misused attack detection mechanism in this module. The goal of the LSTMAE
abuse attack detection techniques is to further classify the aberrant data from Stage-1 into one of
four categories: DOS, Scan, HTTP, and R2L. While the LSTMAE is used in abuse ID, the technology
was originally developed for anomalous traffic. A test set is an input to the training model that
determines if the performance of the training model is malicious (abnormal) or not. When a match is
discovered, an alarm sounds. When compared to other hand-crafted procedures, LSTMAE can
efficiently collect more internal information.
Overall Analysis
For the ISCX-2012 data, this summarizes the outcomes of the present technique. Because these
datasets were created after the KDD and DARPA data, only a few tentative results are available. As a
result, the optimum outcomes at each stage are specified by FAR and accuracy, based on the
previous simulation results. It is obvious that the suggested ID system outperforms state-of-the-art
approaches in terms of accuracy and FAR. Because of the Spark MLlib and LSTMAE method, this is
possible. It's important to note that the comparisons are only meant to serve as a guide, as different
studies have used different volumes of data, sampling procedures, and pre-processing methods. As a
result, a straightforward examination of parameters like testing and training time is rarely
appropriate. Although the proposed ID system outperformed other approaches for the evaluated
metrics, it cannot be said that the proposed methodology completely outperformed them.
Conclusion
The HIIDS was created utilizing the Spark MLlib and LSTMAE deep learning methods in this paper,
which is an effective cybersecurity strategy. We used an ISCX2012 dataset to train the HIIDS. At
Stage 1, we used many strong classification algorithms, such as LR and XGB, to detect anomalies, and
at Stage 2, we used the LSTMAE deep learning technique to detect misuse. The proposed HIIDS,
which is based on DL classification, combines the advantages of both Signature-based (SB) and
Anomaly-based (AB) techniques, lowering computational complexity while improving ID accuracy
and DR.
References
[1] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman et al., “MLlib: Machine learning in
apache spark,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016.
[2] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust et al., “Apache spark: A unified engine for
big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
[3] S. Hochreiter and J. S. Huber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp.
1735– 1780, 1997
[4] F. Gers, N. Schraudolph and J. S. Huber, “Learning precise timing with LSTM recurrent networks,”
Journal of Machine Learning Research, vol. 3, pp. 115–143, 2002.
[5] M. A. khan, M. Karim and Y. Kim, “A two-stage big data analytics framework with real-world
applications using spark machine learning and long short-term memory network,” Symmetry, vol.
10, no. 485, pp. 1–22, 2018.
[6] Khan, M.A.; Kim, Y. “Deep Learning-Based Hybrid Intelligent Intrusion Detection System” Comput.
Mater. Contin. 2021, 68, 671–687.

Deep learning based hybrid intelligent intrusion detection system

  • 1.
    Deep Learning-Based HybridIntelligent Intrusion Detection System Introduction At the host and network levels, machine learning (ML) techniques are frequently utilized to create effective intrusion detection (ID) systems for suitable mitigation and effective detection of hostile cyber threats. Cybersecurity threats, on the other hand, are on the rise. Such risks can be detected with the help of an ID system. Existing ID systems are unable to detect malicious threats, owing to the use of classical machine learning algorithms that are less concerned with proper categorization and feature selection. As a result, establishing a reliable and intelligent identification system is a top goal. The major goal of this research was to create a hybrid intelligent intrusion detection system (HIIDS) that could learn critical feature representations from enormous unlabelled raw network traffic data quickly and automatically. We developed an efficient and HIIDS to detect and classify surprising attacks using a spark MLlib (machine learning library)-based robust classifier such as logistic regression (LR), extreme gradient boosting (XGB), and a state-of-the-art DL such as a long short-term memory autoencoder (LSTMAE). To recognize temporal features, an LSTM is used, and an AE is used to detect global features more effectively. Experiments on a publicly available dataset, the current real-life ISCX-UNB dataset, were used to assess the efficacy of our suggested strategy. For the ISCX-UNB dataset, the 10-fold cross validation test yielded a high accuracy rate of up to 97.52 percent. Background Study The intrusion detection (ID) system is a well-known method of detecting hostile network activity. A well-designed ID system can identify the characteristics of a wide range of dangerous behaviours and respond immediately by delivering warnings. There are three types of ID systems. The first is a signature-based system (SBS). The anomaly-based system is the second (ABS). The detection of stateful protocol analysis is the third. According to their designs, ID systems can be divided into three categories: network-based detection systems (NIDS), host-based detection systems (HIDS), and hybrid approaches. Proposed System The HIIDS was created utilizing the Spark MLlib and LSTMAE deep learning methods in this paper, which is an effective cyber security strategy. We used an ISCX2012 dataset to train the HIIDS. We used multiple robust classification algorithms to develop the HIIDS, including LR and XGB for anomaly detection at stage 1 and the LSTMAE deep learning technique for misuse detection at stage 2. The proposed HIIDS, which is based on DL classification, combines the advantages of both Signature-based (SB) and Anomaly-based (AB) techniques, lowering computational complexity while improving ID accuracy and DR. A two-stage ID system in which Spark MLlib is treated as an anomaly in Stage 1 and LSTMAE is treated as a misuse in Stage 2. These two phases of the ID architecture are computationally efficient when using full features datasets and provide higher accuracy with a low FAR probability.
  • 2.
    Stage 1: Detectinganomalies with Spark MLlib classifiers Stage 2 Misuse detection utilizing cutting-edge deep learning techniques like LSTMAE. Stage 1 used Spark MLlib to detect abnormalities that could be intrusions, while Stage 2 used the LSTMAE DL model to further classify attacks if they happen. The hybrid IDS's architecture is depicted in the diagram above; first, network traffic was sorted and pre-processed. All essential conversions for both Stage-1 Spark MLlib and Stage-2 LSTMAE-based modules of HIIDS were accomplished during pre-processing; both stages had their own supported data formats. To show the usefulness of the proposed HIIDS, we employed 1,512,000 network traffic packets obtained from ISCX-2012 datasets for our hybrid ID experiment. Datasets The simulation of the proposed HIID technique was carefully deliberated because selecting a good ID dataset plays a key role in evaluating the ID system. The ISCX 2012 ID data includes both malicious and non-malicious traffic behaviours throughout a seven-day period. To make network traffic activities and states, the ISCX-2012 has two separate profiles. An abnormal attack's multi-stage or abnormal phases are detected by a profile, which also performs feature characterization and mathematical dissemination of the procedure. For example, the profile can include network traffic packet size distributions in the protocol's explicit patterns and time distribution requests, whereas the profile is produced based on the sophisticated preceding attacks on a specific day. There are four types of attack scenarios in the profile.
  • 3.
    (1) Internal networktraffic intrusion In most cases, a susceptible application program, such as Adobe Acrobat Reader, takes advantage of a network's internal infiltration. After successful penetration, a backdoor can be installed on the victim's workstation, which will launch a series of malicious assaults against the victim's network. Nmap and port scan were commonly used to detect these types of malicious threats. (2) HTTP denial-of-service attacks For a set period of time, the attacker makes a network resource unavailable. This is often accomplished by flooding a network resource with excessive requests, causing the network to overwork and obstruct the fulfillment of some or all valid requests. The Slow HTTP test, Hulk, Slow loris, and Goldeneye were largely used to collect these types of DoS attacks. (3) DDOS attack using an IRC botnet These types of assaults usually happen when multiple networks compete for a victim's bandwidth or resources. As a result, DDoS attacks are frequently the result of multiple compromised networks (for example, a botnet) flooding the target network with large network traffic. LOIC has been used in these attacks for UDP, TCP, and HTTP. (4) Use of force SSH This is the most prevalent sort of attack, which may be used to crack passwords as well as discover secret content and pages of various web applications. FTP and SSH Patator programs have been used to launch these types of assaults. Implementation details Experiments using the ISCX-2012 ID datasets via normal and attack classifications are used to assess the HIIDS' dominance. False positives, false negatives, genuine positives, attack detection precision, and error rate are all examples of false positives. To demonstrate the effectiveness of our proposed ID system, we ran the first stage in Scala using Spark MLlib anomaly detection, the second stage for misuse detection, and the DL technique in Java using Deeplearning4j. The simulation was run on a 64-bit cluster computer with 32 cores, 32 GB of RAM, and Ubuntu 14.04 as the operating system. Java (JDK) 1.8, Spark v2.3.0, Deeplearning4j 1.0.0 alpha, and Scala 2.11.8 comprised the software stack. CuDNN was used to train the deep learning on an RTX 2080 Ti GPU, while CUDA aided the pipeline performance. To assess HIIDS performance, we divided the data into train and test datasets. We used training data and analyzed our hybrid technique with testing data to create an efficient HIID framework. Figure 1 shows a block diagram of our predicted HIID. To show the dominance of our proposed hybrid strategy, we used the ISCX2012 with all of its original features. The network traffic, which is mixed with harmful and non-malicious data, passes via Stage 1 of the Spark MLlib, which categorizes data into malicious and non-malicious categories. Malicious traffic was used to model Stage 2 LSTMAE, and malicious traffic was further divided into four types of attacks. The hybrid technique reduces computing complexity while applying extensive characteristics to the ISCX-2012 dataset, resulting in improved ID accuracy and a lower FAR. For training, 80 percent of the data was used with 10-fold cross-validation, and the model was tested with a 20% held-out dataset. • Stage1
  • 4.
    For identifying cybersecurityassaults, Apache Spark is a capable large data processing engine. With over 55 machine learning algorithms [1,2], Spark MLlib is the most efficient big data analytics library currently available. Spark MLlib is best for machine learning jobs and is 10 times faster for iterative tasks than Hadoop-based big data processing technologies. Spark evolution's MLlib began as a part of an ML-based project in 2012, and by 2013, it had evolved into a useful open-source library for ML workloads. Spark MLlib contains several machine learning algorithms, such as classification, clustering, regression, and dimensionality reductions, that are critical to the development of classic machine learning real-time applications; its mechanisms have been established by several scholars to advance high- dimensional data analytics around the world. At Stage 1, MLlib-based anomaly attack detection was first modelled using an established training set that included both normal and malicious data and was used to evaluate the IDS's anomaly module. The attack detected on the original traffic was sent through the LSTMAE's Stage-2, where the misuse attack detection technique detected and classified the assault. • Stage2 In this stage, LSTMAE was used to define network traffic usage and the goals of further classifying the anomalous traffic according to certain policies. Figure 1 shows an overview of the abuse detection module utilizing an LSTMAE. The LSTM is an improved version of the RNN, which was first developed in [3,4] to address vanishing and exploding gradient problems. All of the RNN's hidden layers are replaced by memory blocks, which are made up of a memory cell that stores information and has three critical gates (input, output, and forget gate) that play dynamic roles in LSTM [5]. The capacity of LSTM to capture long dependencies and learn decently from variable quantity sequences is its most powerful feature. LSTMAE is utilized as a misused attack detection mechanism in this module. The goal of the LSTMAE abuse attack detection techniques is to further classify the aberrant data from Stage-1 into one of four categories: DOS, Scan, HTTP, and R2L. While the LSTMAE is used in abuse ID, the technology was originally developed for anomalous traffic. A test set is an input to the training model that
  • 5.
    determines if theperformance of the training model is malicious (abnormal) or not. When a match is discovered, an alarm sounds. When compared to other hand-crafted procedures, LSTMAE can efficiently collect more internal information. Overall Analysis For the ISCX-2012 data, this summarizes the outcomes of the present technique. Because these datasets were created after the KDD and DARPA data, only a few tentative results are available. As a result, the optimum outcomes at each stage are specified by FAR and accuracy, based on the previous simulation results. It is obvious that the suggested ID system outperforms state-of-the-art approaches in terms of accuracy and FAR. Because of the Spark MLlib and LSTMAE method, this is possible. It's important to note that the comparisons are only meant to serve as a guide, as different studies have used different volumes of data, sampling procedures, and pre-processing methods. As a result, a straightforward examination of parameters like testing and training time is rarely appropriate. Although the proposed ID system outperformed other approaches for the evaluated metrics, it cannot be said that the proposed methodology completely outperformed them. Conclusion The HIIDS was created utilizing the Spark MLlib and LSTMAE deep learning methods in this paper, which is an effective cybersecurity strategy. We used an ISCX2012 dataset to train the HIIDS. At Stage 1, we used many strong classification algorithms, such as LR and XGB, to detect anomalies, and at Stage 2, we used the LSTMAE deep learning technique to detect misuse. The proposed HIIDS, which is based on DL classification, combines the advantages of both Signature-based (SB) and Anomaly-based (AB) techniques, lowering computational complexity while improving ID accuracy and DR. References [1] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman et al., “MLlib: Machine learning in apache spark,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016. [2] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust et al., “Apache spark: A unified engine for big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016. [3] S. Hochreiter and J. S. Huber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735– 1780, 1997 [4] F. Gers, N. Schraudolph and J. S. Huber, “Learning precise timing with LSTM recurrent networks,” Journal of Machine Learning Research, vol. 3, pp. 115–143, 2002. [5] M. A. khan, M. Karim and Y. Kim, “A two-stage big data analytics framework with real-world applications using spark machine learning and long short-term memory network,” Symmetry, vol. 10, no. 485, pp. 1–22, 2018. [6] Khan, M.A.; Kim, Y. “Deep Learning-Based Hybrid Intelligent Intrusion Detection System” Comput. Mater. Contin. 2021, 68, 671–687.