Anomaly Detection Using CFS Subset Selection and Neural Networks in WEKA
1. Anomaly Detection by Using CFS Subset and Neural
Network with WEKA Tools
Dr. J Jabez1
, Dr. S Gowri2
, Dr. S Vigneshwari3
, Albert Mayan. J4
,Dr. Senduru Srini-
vasulu5
1,2,3,4,5
School of Computing, Sathyabama Insitute of Science & Technology, Chennai, Ta-
milnadu, India.
1jabezme@gmail.com ,2gowriamritha2003@gmail.com,
3vikiraju@gmail.com,4albertmayan@gmail.com,5sendurusrinivas@gmail
.com
Abstract. An Intrusion Detection System (IDS) is a product application or con-
traption that screens the framework or system practices for methodology en-
croachment or dangerous activities and makes reports to the organizational
framework. The principal centralization of IDPS (Intrusion discovery and aver-
sion frameworks) is to perceive the possible events, information logging about
them and interruption tries report. Furthermore, the associations are utilizing
IDPS for different purposes, such as recognizing issues identified with ap-
proaches of security, recording and keeping the people existing dangers from
encroaching arrangements of security. In this paper, anomaly is identified utiliz-
ing enhanced CFS (Correlation based Feature Selection), which is basically a
subset technique and is based upon Extreme Learning Machine, Multilayer Per-
ceptron and Feature Selection. This project scope involves identification of
anomalies in the early stages and to increase the accuracy of identification or
detection.
Keywords: IDS, Feature Selection, Anomaly, Multilayer Perceptron, EML.
1 Introduction
In today's scenario there are number of activities which are malicious and are present
in the system. The Intrusion Detection System (IDS) recognizes malicious activities
inside and outside of the system. Securing systems from interruptions or assaults is
getting harder day by day as the intrusions are highly advanced and growing very fast
in the networks. The odds of information loss, hacking and interruption have been
increasing with more users of the Internet [1,2,3,4].
The alertness created due to integration of networks helps in decreasing the dam-
age if and when detected or needed [5]. Multilayer perceptron approach, to enhance
the distinguishing proof precision for low visit ambushes and area security, has got
two stages, for instance, preparing with normal huge datasets and testing with inter-
ruption datasets [6,7].Important archetype of machine learning is Neural Net-
work(NN), to conclude complicated real time issues is enforced in IDS[8,14]. But two
2. 2
features of Network based IDS that make it futile are (i) Lesser Preciseness in detec-
tion, mainly in case of low frequency attacks and (ii) Poor cohesion of anomaly detec-
tion.
2 Literature Survey
This segment clarifies the endeavors done in the territory of Network based
IDS(NIDS) and the greater part of the detection works depended on KDD dataset. An
expert system in view of principles and factual methodologies are the two noteworthy
methodologies generally used to guarantee interruption detection[9,10].
The detection rate of the attack remains at 78 % while the rate detection of other
Haystack [11,12] later built up a system to evaluate an intrusion detection strategy in
light of user and abnormality techniques. Six sorts of interruption were distinguished
and those fuses the disguise assaults, unapproved client's break-ins endeavor, vindic-
tive utilization, spillage, benefit disavowal, and access control of security system. The
run of the mill profile, results in exploring the call successions between interruption
discovery and confirmation against human system.
An ambush in this structure is considered as the grouping deviation from average
profile succession. Thus, this structure works detached using effectively assembled
information and executes View-Table-Algorithm (VTA) for learning program profiles
basically[15,16].
3 Intrusion Detection System(IDS)
Intrusion acknowledgement is the best approach to check and researching the exer-
cises occurring in a system or network structure with a particular true objective to
recognize signs of security issues. There are two key systems of IDS: Anomaly loca-
tion and abuse acknowledgement. Anomaly location tries to recognize lead that does
not fit in with a run of the mill direct, Misuse acknowledgement attempts to organize
illustrations and signs of certainly known assaults in the traffic of the system. Basic
usefulness of IDS is to go about as a detached alarming system. The intrusion is dis-
tinguished the IDS produces an alert and gives all the pertinent data (time, IP bundles,
and so on.) that set off the caution[7,8,17].
Our principal point is to createIDS(IntrusionDetectionSystem) inlightof anomaly
locationdisplaythat would beexact, difficult to cheatby the littlevarietiesindesigns,low
infalsecautions, versatile and is continuous. The Figure 1 depicts the proposed system
design were the intrusion bundles are gotten from the web. At first, the highlights are
extricated from information parcels and after that sent to our proposed IDS [9]. At
that point, proposed IDS figure the separation between the removed highlights and
prepared model. Here, the prepared model consists of enormous datasets with the
dispersed capacity condition to enhance the execution of Intrusion Detection system.
Subsequently, the exception esteem is more prominent than the predetermined limit
then it produces the false alarm.
3. 3
Fig.1. Proposed System Architecture
WEKA Tool. WEKA means Waikato Environment for Knowledge Analysis, it is a
Java based program and is preferred machine learning software. WEKA tool is a
freely accessible programming. It is supporting many several data mining standards
like clustering, data preprocessing, regression, feature selection, visualization and
classification [10]. The WEKA allows in finding out the hidden information or data
from the file systems and database with the use of visual interface and simple options.
Correlation based feature selection (CFS) CFS is one of the most straightforward
component determination strategies. It depends on the presumption that features are
restrictively free given the class; this includes subset which is utilized to assess the
given hypothesis [11]. Good component subset is one that contains exceedingly asso-
ciated within a given class and yet it is uncorrelated with each other. One of the bene-
fits of CFS is that of an algorithm based upon filters, which makes it significantly
quicker in comparison with a Wrapper Selection Method as it doesn't have to create
learning algorithms.
4. 4
4 Proposed Algorithm
To overcome the existing problem, we proposed some novel technique as CFS subset
algorithm and Neural Network with WEKA Tools. The CFS subset is selecting the
most frequent and important technique characteristics. The selection of characteristic
is for identifying and removing the unnecessary and inappropriate characteristics. The
measurement of the characteristic and attribute are very coefficient.
CFS SubsetAlgorithm. The selection of the feature is a process that allows selecting
the relevant feature in real subset. The selection of the characteristics is most frequent
and important technique in the field of data pre-processing in mining of the data. The
selection of characteristic is to identify and remove the unnecessary and inappropriate
characteristics [13]. There are two types of learning process that is supervised and
unsupervised learning, and this feature could be applied in both learning methods.
The characteristics subset of the optimality is getting measured by criteria of evalua-
tion. The dimension of the domain is expanding in N number of characteristics. Find-
ing a subset of optimal characteristic is generally inflexible and many other issues
relevant to the selection of characteristic has been displayed to the NP-hard. A general
selection process in the characteristic consists of some stages that are i. The genera-
tion of subset, ii. Evaluation of subset, iii. Stopping Criterion, iv. Validation of Result.
Another technique is Neural Network, where three features have been used that is
Multi Layer Perceptron, Logistic Regression and Extreme Learning Machine. Where,
the MLP (Multi Layer Perceptron) has been used for the training of the Neural Net-
work [12]. The logistic regression is also known as the analysis of regression that is
being in use for the outcome prediction of categorical dependent variable on the basis
of predictor variables. It is being in use for the estimation of parameters empirical
values in the model of qualitative response. It also measures the connection among
the independent variables and dependent variables. It could be the multinomial or
binomial. A well known measurement attribute is the linear correlation coefficient for
which the formula is given below.
Correlation (r) =
N ∑ XY− ∑ X−∑ Y
√N ∑ X2−∑ X2 N ∑ Y2− ∑Y2
(1)
H(Y) = − ∑ p(y)yRy
log(p(y)) (2)
H(Y ∕ X) = − ∑ p(x)yRy
∑ p(y/x)yRy
log(p(x/y)) (3)
C(Y/ X) =
H(Y)−H(Y/X)
H(Y)
(4)
where X and Y are the two features/attributes.
The Multilayer Perceptron (MLP) is using the back propagation that learning by
the set of weights for predicting the label of class, where the label of class is attacking
5. 5
on every connection. For the better result, we reduce the training time of neural net-
work and consider about the size of input to keep it small.
4.1 Algorithm for MLP
Step 1: Provide the data of input that should be in relation with the attribute file for-
mat, we are using a tool box named as WEKA over the MLP for calculating the every
input activation, as the name ‘a’ and ‘u’.
Step 2: Calculate the every tuples by using the given formula. △i (t) = (di(t) −
yi (t))g′
(ai(t))
Step 3: The derivatives of Back propagate get the errors for the hidden layers by us-
ing this formula ∂i (t) = g (ui(t) ∑ ∆k(t)wkik )
Step 4: Calculate updated weight using:
vij(t + 1) = vij(t) + η ∂i(t)xj(t)
wij(t + 1) = wij(t) + η ∂i(t)zj(t)
5 Results and Discussion
In our study, a dataset is extracted and numbers of experiments are performed based
on the dataset in order to measure the IDS performance. Experiments were carried out
based on the following configuration: Windows 7, Intel Pentium (R), CPU G2020 and
processer speed 2.90 GHz respectively.
The extracted data set includes trained data of about two thousand connection records
and test data includes five thousand connection records. In addition, dataset includes a
group of forty one derived features received from every connection and also a group
of labels that identifies the connection record status whether it is a normal type or
attacked type. Features of symbolic variables, discrete features, and continuous fea-
tures fall into four specific groups: 1. First group includes common features of TCP
connection, which includes intrinsic features, Connection duration, type of network
service (telnet, http) and protocol type (UDP, TCP). 2. This group suggests the con-
tent features inside the connection to represent the domain knowledge and it is used to
estimate the payload content of the TCP packets (like number of login failed at-
tempts). 3.The similar feature of host examine the established connection in previous
two seconds, which is having the identical target host as existing connection, and the
estimation of the statistics relevant to protocol service, behavior, etc. 4. The similar
features of the services examine the connection having same services in last two sec-
onds same as the existing services.
6. 6
Fig. 2 Big-Dataset size Vs Execution Time
Figure 2 shows the overview of various execution times with various size of da-
taset. The proposed Intrusion Detection System takes less execution time at every
level rather than other existing machine learning approaches. This is because of the
less trained datasets. The distance computation is easy between the trained and testing
dataset respectively.
Figure 3 shows the anomaly detection rate in the computer network. The proposed
Intrusion Detection System identifies almost all type of attacks such as Probe, DoS,
U2R and R2L. The anomaly detection rate depends on the outlier values testing data.
If the propagation value increase then the dataset assumed will acts as intrusion da-
taset.
Fig 3. Big-Dataset size Vs Anomaly Detection
7. 7
Fig 4. Big-Dataset size Vs CPU Utilization
Figure 4 shows the graphical comparison of CPU utilization levels with various sizes
of datasets. In the machine learning approaches’, CPU utilization is very high when
compared with proposed approach. Most of the research papers have assigned ma-
chine learning approaches only with the help of huge quantity of training datasets and
training functions. In our proposed approach we are using only limited datasets to
train the proposed IDS.
6 Conclusion
This work proposed a new approach called as CFS subset algorithm and Neural Net-
work, where the MLP, Logistic Regression and ELM (Extremely Learning Machine)
for identifying the intrusion in computer network. Our training model contains two
huge dataset with the distributed environment that improves the process of Intrusion
detection system. The approaches of machine learning system identifying the intru-
sion in computer network with frequent time of execution and prediction of storage.
When compared to the existing IDS technique, the proposed IDS system taking less
time for execution and storing the test in dataset. Here in this study, the performance
of proposed IDS is better than other existing machine learning approaches and can
significantly detect every anomaly data in computer network. In future, the proposed
work could be used in several distance computation function amid of the testing and
trained data. Our research work can be considered to improve the efficiency of IDS in
a better manner.
References
1. Chih-Fong Tsai a, Yu-Feng Hsu b, Chia-Ying Lin c, Wei-Yang Lin d "Intrusion detection
by machine learning A review" Expert Systems with Applications Elsevier (2009).