ONDaSCA: On-demand Network Data Set Creation Application for Intrusion Detection System

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
ONDaSCA: On-demand Network Data Set Creation
Application for Intrusion Detection System
Felix Larbi Aryeh
Computer Sci. and Eng. Department
University of Mines and Technology
Tarkwa, Ghana
flaryeh@umat.edu.gh
Boniface Kayode Alese
Department of Cybersecurity
The Federal University of Technology
Akure, Nigeria
bkalese@futa.edu.gh
Olayemi Olasehinde
Department of Computer Science
Federal Polytechnic, Ile Oluji
Ondo State, Nigeria
olaolasehinde@fedpolel.edu.ng
Christian Kwaku Amuzuvi
Renewable Energy Engineering
Department
University of Mines and Technology
Tarkwa, Ghana
ckamuzuvi@umat.edu.gh
Abstract—With the heightening reliance on Information
Technology in recent times, it has becoming more relevant to
find measures to secure every online device, data and
information. A Network Intrusion Detection System (NIDS) is
one of the security options to consider to help protect such
devices, data and information. However, IDS needs to be up to
date to mitigate current threats to secure systems. A critical
issue in the development of the right IDS is the scarcity of
current data sets used for training these IDS and the impact on
system performance. This paper presents an On-demand
Network Data Set Creation Application (ONDaSCA) a
Graphical User Interface software capable of generating
labelled network intrusion data set. ONDaSCA grants IDS
users or researchers the option to choose a raw data set and
processed this data set as output, real-time packet capture and
offline upload of existing PCAP file and two (2) difference
packet capturing methods (Tshark and Dumpcap). ONDaSCA
is highly customisable and an IDS user or researcher can
leverage its capabilities to suit their needs. The abilities of this
software are compared with other similar products that
generate data set for use by IDS model.
Keywords— On-demand Network Data Set Creation
Application (ONDaSCA), Packet Capture (PCAP), Offline
Capture Mode (OCM), Realtime Capture Mode (RCM)
I. INTRODUCTION
In recent times where most institutions and organisations
depend heavily on computer networks and the internet in
order to flourish, these businesses fall at the mercy of hacker
and viruses. Hence, there is a need to protect these
organisations network. Computer Network security is an
integral factor to consider when working online irrespective
of the size of the organisation. While there exists no network
entirely immune to attacks, having a high-quality Network
Security system in place can help reduce most know attacks.
Identity or Information theft is on the rise in recent times;
hence it is the duty of every organisation to keep all its
clients and organisations data safe and secure from
unauthorise access. Having a high-quality Network Security
system will help prevent organisations and business from
becoming victims to data theft.
The most common network security threats today that
spread over the internet are grouped into two (2) main
categories videlicet; structured and unstructured. A
structured network attack is one performed by an
experienced individual with computer skills and intentionally
targeting a specific domain or organisation. Whereas, an
unstructured network is orchestrated by someone who has
less knowledge about the attacking domain and only use
tools that are readily available online. It is essential to note
that irrespective of the attach category, an exposure of
confidential information of an organisation can have adverse
effect on the company's integrity.
Network Intrusion techniques keep emerging each day;
thus, there is the need for intrusion detection algorithm also
to evolve to help circumvent such techniques. The
community responsible for NIDS help build and secure
algorithms based on released standard NIDS data sets.
Unfortunately, such data sets are often simulated and become
outdated with time, thereby defeating its purpose. New
techniques of network intrusions keep evolving daily; hence
intrusion detection algorithms also need to evolve alongside
to keep pace. An IDS data set released about a year or two
(2) ago might be irrelevant to use for building current IDS
model [1].
In the process of testing and validation of any intrusion
detection model, it is critical to use a reliable data set to
improve the efficiency of the IDS models. The data set
quality does not necessarily determine the methods to use to
detect anomaly in the data set but instead should exhibit its
potential efficiency on a production environment when
deployed. Reference [2] worked on a detailed analysis of
both KDD Cup 1999 and the NSL-KDD data set. These data
set have widely be used for the testing and evaluation of
intrusion detection. Unfortunately, these data set do not
reflect current network security threat mainly because they
have multiple missing records, different distribution of
testing and training data sets and a vast amount of redundant
records in their training set [3][2].
Every network is unique and the traffic exhibited in
Network A will be different from Network B. Hence, a
perfectly working IDS at Network A might not be good
enough when deploy on Network B. Thus, IDS encounter
dissimilar network traffic contingent on the kind of services
running on that network. Researchers have made many
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 5, May 2020
111 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

attempts to use more recent data sets to train their IDS but to
no avail. Using most available online data set restricts the
researcher to the environment used to generate the data set
which usually is different from the researcher’s network
environment [4].
Again, IDS users of such data sets may be running
different software or application versions and services utterly
different from the creators of the data sets thereby giving
room for the IDS user having different kinds of
vulnerabilities. When dealing with online data set, the
security features of the network that is used in the data set
creation is very paramount. Every institution or organisations
firewalls and encrypted communication channels differ from
each other; hence the network traffic observed and captured
for the data set creation will always be different from each
other. This serves as a vivid justification for the creation of
on-demand data set tailored for unique network.
Many researchers have tried to create on-demand data
sets to help improve their IDS model; however, most of them
have some form of limitations. Tools used used for the
creation of an on-demand data set usually insert malicious
network packets into the captured network traffic files and
create a raw data set in the case of reference [1][5].
Certain qualities should exist in a data set for it to be
recognized for use in the field of network intrusion detection.
Such qualities have been identified and verified by
researchers [6].
II. RELATED WORKS
A good quality data set is one that allows researchers to
identify the ability of an IDS to detect anomalies, preferably
allowing to draw valid conclusions about the appropriateness
of the IDS (its efficiency, accuracy, validity scope, etc) [7].
Although many of the available data sets nowadays are
valuable to the research community, they, unfortunately, fail
to fulfill all the requirements proposed by [8]. The existing
data sets can be classified into two categories: static data sets
and dynamically generated data sets (Figure 1 gives a
historical overview).
Fig. 1. A Historical Timeline Overview Of Data sets Intended For
Intrusion Detection [8]
Static data set is usually generated once within a limited
period. They are generated either synthetically or over a real-
world network. Figure 1 shows both static and dynamic data
sets available for IDS evaluation purposes. Static data sets,
according to researchers, are not adequate primarily because
of two (2) main reasons. Firstly, such data sets are out of
date. Most of them were created based on old versions of
software and applications, network protocols and attacks that
exploited old vulnerabilities. Thus, such a data set does not
reflect the current trend of attacks in recent times. Again,
these data sets contain lots of defects that indicated synthetic
generation [8].
Network behaviour and traffic patterns change
continuously and attacks evolve rapidly. There is a need for
the creation of data sets that are modifiable, extensible, and
reproducible. Therefore, researchers have proposed tools that
are capable of generating synthetic data sets dynamically.
The main idea of these tools is to consider the contemporary
traffic characteristics and replicate them in synthetic traffic,
which in turn emphasises the realism of the generated data
sets. In this section, we present the two existing dynamic
data sets, followed by a brief comparison of them in Table 2.
Reference [5] titled “ID2T: A DIY data set creation
toolkit for Intrusion Detection Systems” presented an IDS
toolkit for the creation of labelled data sets that contain user-
defined synthetic attacks. In this paper, the ID2T toolkits
labels the generated attacks and merge the already existing
PCAP file with the generated synthetic attack files. Here, the
labels are either internally as user defined modifications to
the attack packets or externally as different files specifying
time intervals where attacks are exhibited. To access the
quality of the generated data sets, emphasis was placed on
the creation of new evaluation techniques. Unfortunately, the
generated data set could not encompass all scenarios and the
toolkit was unable to support the processing of arbitrarily
large network capture files. The already captured PCAP files
served as the input of the system.
Reference [9] titled “INSECS-DCS: A Highly
Customisable Network Intrusion Data set Creation
Framework” also developed a data set creation application or
software that runs on a network. Several phases led to the
creation of the data set. These include Interface selection,
Packet capturing and Packet processing.
In this paper, we have studied the available existing IDS
data sets creation applications in order to find the
characteristics of a reliable data set.
III. EXPERIMENTS / METHODOLOGY
This section describes the detailed process used to obtain
the processed data set. ONDaSCA was developed using
Python mainly because of the availability of the enormous
libraries related to the processing of computer network
packet. It also makes it easy to extend the paper to other
areas of our research.
A live network was used to aid in the completion of the
ONDaSCA data set generating software. The capturing of a
real-live network packet was conducted at the University of
Mines and Technology (UMaT) Network Operation Centre
(NOC).
A. UMaT Network Layout
The main network backbone which serves as the central
base station, is situated at the Network Operation Centre
(NOC) located at the centre of the university campus. The
NOC provides network connectivity to six (6) mini base
stations by association with a Cisco Aironet 1300 wireless
bridge with 12dBi Omni-antenna located at the NOC. Figure
2 shows the Network Infrastructure of UMaT.
Vol. 18, No. 5, May 2020
ISSN 1947-5500

The Cisco Aironet 1300 wireless bridge serves as a signal
transmitter by receiving network signals from NOC and
transferring it to the six base stations videlicet; Hostels,
Faculties and Departments, University Basic School,
Administration and the Library.
The Internet Service Provider (ISP) provides a point to
point Internet connection to the NOC. The LAN switch is
connects to a Linux Server which acts a Proxy server and a
firewall for Internet access to the university community as
shown in Figure 2.
The Cisco Aironet 1300 bridges at the mini base stations
associate with the Cisco Aironet 1300 at the NOC which
then send all signals directly to one of the Ethernet ports of
the internal switch. All offices and AP join the internal
network through the individual wall jacks directly connected
to the switches at each mini base stations.
Fig. 2. UMaT Network Infrastructure
B. ONDaSCA for Data set creation
On-demand Network Data Set Creation Application
(ONDaSCA) is proposed to improve on limitations of
INSecS-DCS. Hence, the primary aim is to enhance the
creation of Network Intrusion data sets tailored to private
networks, where the network managers can specify relevant
attributes for further analysis [9]. There are very few data
sets that have proper documentation and can be used
effectively to train and test an IDS. Although considerably
old, NSL-KDD is one such example [10]. ONDaSCA allows
network managers to create a custom data set, with attributes
they feel are the most important for their network. Figure 3
and Figure 4 illustrate the Command Line Interface (CLI)
and Graphical User Interface (GUI) of the ONDaSCA
software. The CLI tool provides the user the same
capabilities as the GUI tool with the difference being that the
command line version runs directly from the terminal and
uses less system resources.
The steps involved in the data set creation is stated in the
following subsections of this section, with each step
possessing its software component.
1) Interface Selection: This step is one of the crucial
steps involved in the capture of the network packet. The
selection of the wrong interface for a network packet
capture can result in an undesired PCAP file. Interface
capture in ONDaSCA is only available in Real-time Mode.
This grants the user the chance to select the network
interface to be used for the capturing of packet on the live
network. Depending on the number of network interfaces, a
list containing all the available interface is displayed.
2) Packet capturing: The next step is the capturing of
network packet from either a live-network or an upload of
existing pcap file. After extensive evaluation of many
packet capturing toolkits and software, the tshark and
dumpcap were found to be an excellent choice. ONDaSCA
uses these two (2) tools or software for packet capturing
mainly because they provide excellent filtering options to
choose from as compared to others.
a) Offline Capture Mode (OCM): The OCM gives the
user the chance to upload an existing pcap file for
processing into a Comma Separated Value (CSV) file for
use by an IDS model. Here, the user browses the location of
the file from the computer and select the output of the
destination for the processed CSV file.
b) Realtime Capture Mode (RCM): ONDaSCA has
two (2) main types of capturing tools embedded within it for
real-time capture. The Dumpcap and the tshark. Reference
[9] provided users with only the tshark option for capturing
the network packet.
3) Processing of the captured file: The captured packet
at this stage is in PCAP format and thus, need to undergo
processing to convert the PCAP ﬁles to CSV format for use
by an ML algorithm. The CICFlowMeter can be used for
analysing PCAP files. In a birectional flow, the first packet
usually determined the forward and backward directions.
The forward flow and backward flow are respectively
source to destination and distination to source. There, about
eighty (80) statistical features relating to network traffic can
be calculated indepentently in the forward and backward
directions. Some of the features include, Number of packets,
Duration, Length of packets and Number of bytes [12][13].
CSV is one of the standard format used by python libraries
in ML. It is easy to read and process with most of these
libraries. This format enables researchers to use ONDaSCA
in ML readily. The attribute or features generated from the
PCAP files will be the similar to the of the CSE-CIC-
IDS2018 on AWS dataset.
Fig. 3. Command Line Interface of On-demand Network Data Set
Creation Application (ONDaSCA)
Vol. 18, No. 5, May 2020
ISSN 1947-5500

Fig. 4. Graphical User Interface of On-demand Network Data Set Creation
Application (ONDaSCA)
C. The ONDaSCA processed data set
The term processed data set in this research is an
intrusion detection data set that is converted to a CSV file
format. The contrast is an unprocessed or raw data set, which
only present data in raw PCAP format. ONDaSCA gives the
researcher a GUI to upload the PCAP to be converted into a
processed intrusion detection data set. Each record or tuple
in the processed data set contains all the attributes related to
a single packet that has been preprocessed. ONDaSCA
provides a GUI and embeds CICFlowMeter to process the
data set. The CSV format of the processed data set is similar
to the standard CSE-CIC-IDS2018 on AWS data set
generated by Canadian Institute for Cybersecurity [11].
Hence, any data set generated with ONDaSCA can be
considered as a standard data set tailored for the researcher's
use. Researchers can now leverage the capabilities of
ONDaSCA to create an on-demand data set that can serve
their institution or organisation.
TABLE I. COMPARISON OF THE CHARACTERISTICS OF ONDASCA
WITH OTHER DATA SET CREATION TOOLS
Capabilities ONDaSCA ID2T INSecS-DSC
Ability to Label data set Yes Yes Yes
Dumpcap capturing mode Yes No No
Tshark capturing mode Yes No Yes
Has a GUI Yes Yes No
Has CLI Yes No Yes
Open Source Yes Yes Yes
Real-time Packet Capture Yes No Yes
Allows attack injection within
the software
No Yes No
Support GUI upload of Raw
PCAP file for conversion
Yes No No
Choose Network Interface for
capturing packet
Yes No No
Ability to divide trafﬁc into
time window and get overall
trafﬁc attributes
Yes No Yes
Ability to select input method
(packets captured on a
network of choice or get a raw
PCAP data set from another
source)
Yes No Yes
Processed data set that can fed
into WEKA and other ML
tools directly
Yes No Yes
Attribute selection for
processed data set
Yes No Yes
IV. EVALUATION
The application or software closest to our application
(ONDaSCA) is the Intelligent Network Security System -
Dataset Creation Software (INSecS-DCS) and the ID2T
toolkit developed by reference [1] at the Telecooperation
Lab, Technische Universit Darmstadt. Table II provides vital
advantages of ONDaSCA to aid in comparison with the
ID2T toolkit and INSecS-DCS. The generation of network
attack is built inside the ID2T framework by use of attack
scripts and the PCAP file used as input. Nevertheless,
network attack generation software is already highly
advanced in the hacker community and updated versions
keep emerging, it becomes redundant to include attack
generation as an inbuilt software feature [9].
V. CONCLUSION AND FUTURE WORK
This research paper presented an On-demand Network
Data Set Creation Application (ONDaSCA) that is capable of
creating labeled Network intrusion data set. Here, a testbed
environment can run specific network attack traffic for
ONDaSCA to capture the PCAP file to help create labelled
data set. ONDaSCA has the following options:
• Choosing a raw data set and processed data set as
output;
• An RCM for network packet capture stream in the
form of PCAP file;
• An OCM for upload of already existing PCAP file as
input for processing; and
• Two (2) tools (Dumpcap and Tshark) for network
packet capturing for the generation of PCAP files.
ONDaSCA is highly customisable and an IDS user or
researcher can leverage its capabilities to suit their needs.
The software will be available for use by researchers under
an MIT license.
ACKNOWLEDGMENT
The Department of Cybersecurity at The Federal
University of Technology, Akure, The Information and
Communication Technology (ICT) Unit at the University of
Mines and Technology, UMaT, Tarkwa provided the
facilities for this research.
REFERENCES
[1] E. Vasilomanolakis, C. G. Cordero, N. Milanov, and M. Muhlhauser,
“Towards the creation of synthetic, yet realistic, intrusion detection
datasets,” NOMS 2016 - 2016 IEEE/IFIP Network Operations and
Management Symposium, Apr. 2016.
[2] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed
analysis of the KDD CUP 99 data set,” 2009 IEEE Symposium on
Computational Intelligence for Security and Defense Applications,
Jul. 2009.
[3] J. McHugh, “Testing Intrusion Detection Systems A Critique of the
1998 and 1999 DARPA Intrusion Detection System Evaluations as
Performed by Lincoln Laboratory. ACM Trans on Information
System Security, 3, 264-294. - References - Scientific Research
Publishing,” Scirp.org, 2000. [Online]. Available:
https://www.scirp.org/reference/ReferencesPapers.aspx?ReferenceID
=1730334. [Accessed: 21-Jan-2020].
[4] A. Gharib, I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “An
Evaluation Framework for Intrusion Detection Dataset,” 2016
International Conference on Information Science and Security
(ICISS), Dec. 2016.
[5] C. G. Cordero, E. Vasilomanolakis, N. Milanov, C. Koch, D.
Hausheer and M. Muhlhauser, M. “ID2T: A DIY data set creation
toolkit for Intrusion Detection Systems”, IEEE Conference on
Communications and Network Security (CNS), 2015.
[6] A. O. Adetunmbi, S. O. Falaki, O. S. Adewale, and B. K. Alese,
“Network intrusion detection based on rough set and k-nearest
Vol. 18, No. 5, May 2020
ISSN 1947-5500

neighbour,” 2008. [Online]. Available:
https://www.semanticscholar.org/paper/NETWORK-INTRUSION-
DETECTION-BASED-ON-ROUGH-SET-AND-Adetunmbi-
Falaki/fe9a7f896ae79f29e612910d8001dfe9202e9311. [Accessed:
14-Feb-2020].
[7] M. Bhuyan, D. Bhattacharyya, and J. Kalita, “Towards Generating
Real-life Datasets for Network Intrusion Detection,” International
Journal of Network Security, vol. 17, no. 6, pp. 675–693, 2015.
[8] Carlos Garcia Cordero, Emmanouil Vasilomanolakis, Aidmar
Wainakh, M. Mühlhäuser, and Simin Nadjm-Tehrani, “On generating
network traffic datasets with synthetic attacks for intrusion detection,”
undefined, 2019. [Online]. Available:
https://www.semanticscholar.org/paper/On-generating-network-
traffic-datasets-with-attacks-Cordero-
Vasilomanolakis/2caa42c15671810c8236ac553753eec1d83f7e48.
[Accessed: 09-May-2020].
[9] N. Rajasinghe, J. Samarabandu, and X. Wang, “INSECS-DCS: A
Highly Customisable Network Intrusion Dataset Creation
Framework,” 2018 IEEE Canadian Conference on Electrical &
Computer Engineering (CCECE), May 2018.
[10] B. Ingre and A. Yadav, “Performance analysis of NSL-KDD dataset
using ANN,” 2015 International Conference on Signal Processing and
Communication Engineering Systems, Jan. 2015.
[11] I. Sharafaldin, A. Habibi Lashkari, and A. A. Ghorbani, “Toward
Generating a New Intrusion Detection Dataset and Intrusion Traffic
Characterization,” Proceedings of the 4th International Conference on
Information Systems Security and Privacy, 2018.
[12] A. H. Lashkari, G. Draper-Gil, M. Saiful, and A. A. Ghorbani,
“Characterization of Tor Traffic using Time based Features,”
undefined, 2017. [Online]. Available:
https://www.semanticscholar.org/paper/Characterization-of-Tor-
Traffic-using-Time-based-Lashkari-Draper-
Gil/d76f32eb3af1a163c0fde624e9fc229671ca75b6. [Accessed: 14-
May-2020].
[13] “Network Traffic Flow analyzer,” www.netflowmeter.ca. [Online].
Available: http://www.netflowmeter.ca/netflowmeter.html.
Vol. 18, No. 5, May 2020
ISSN 1947-5500

ONDaSCA: On-demand Network Data Set Creation Application for Intrusion Detection System

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to ONDaSCA: On-demand Network Data Set Creation Application for Intrusion Detection System

Similar to ONDaSCA: On-demand Network Data Set Creation Application for Intrusion Detection System (20)

Recently uploaded

Recently uploaded (20)

ONDaSCA: On-demand Network Data Set Creation Application for Intrusion Detection System