2. II. ATTRIBUTE SELECTION
The information gets determine used in C4.5
algorithm is utilized to choose the check feature at every
node in the hierarchy. Such a compute is referred to as a
feature choice determine or calculate of the integrity of
split. The feature with the greatest information gain
preferred as the examination attribute for the near node.
This attribute decreases the information essential to
categorize the models in the ensuing dividers. Such an
information-theoretic advance decreases the probable
quantity of checks required to categorize an object and
assurances that an easy tree is produce.
III. EXISTING ALGORITHM: INFORMATION GAIN
Let S be a set of training set models with their
identical tags. Suppose there are m modules and the
training set includes Si models of class ‘I‘ and ‘s’ is the
entire amount of models in the training set. Expected
information essential to categorize a certain model is
intended by:
i=1
I(S1,S2,……Sm) = - ∑ Si / S log2Si (1)
m
A feature F with values {f1,f2, ………fv} can split the
training set into v subsets
In addition let Sj have Sij trials of class i. Entropy of the
feature F is
V
E(F)= ∑ S1j + …….+Smj / S * I(S1j,S2j,…..Smj) (2)
j=1
Information gain for F can be intended as:
Gain(F) = I( S1,S2, …… ,Sm) - E(F) (3)
In this study, information gain is considered for class
tags by using a binary intolerance for all class. That is, for
each class, a dataset instance is measured in-class, if it has
the equivalent label; out-class, if it has a dissimilar label.
Consequently as different to manipulating one
information gain as a general assess on the significance of
the attribute for every class, so guess an information gain
for all class. Thus, this indicates how well the attribute
can categorize the particular class from extra classes.
IV. PROPOSED ENHANCEMENT: GAIN RATIO
CRITERION
The thought of information gain recognized previous
leans to maintain features that have an enormous amount
of ideals. For instance, if we have a feature D that has a
separate value for every record, then Info (D, T) is 0, thus
Gain (D, T) is maximal. To reimburse for this, it was
recommended in [6] to use the subsequent ratio in its
place of gain. Split info is the information due to the split
of T on the foundation of the value of the type feature D,
which is defined by
n
Split Info(x) = -∑ |Ti| / |T|.log2 |Ti| / |T| (4)
i=1
And the gain ratio is then calculated by
GainRatio(D,T) = Gain(D,T)/SplitInfo(D,T) (5)
The gain ratio, positions the sum of helpful
information formed by split, i.e., that shows cooperative
for categorization. If the split is close to slight, split
information will be tiny and this ratio will be unstable. To
avoid this, the gain ratio typical chooses a test to exploit
the ratio above, topic to manage that the information gain
must be great, at least as large as the average gain over all
tests checked.
V. CLASSIFYING AND DETECTING ANOMALIES
Misuse detection is done through applying rules to the
test data. Test data is collected from the KDDCUP Data
set. The test data is stored in the database. The rules are
applied as SQL query to the database. This classified data
under different attack categories as follows:
1) DOS
2) Probe
3) U2R
4) R2L
The C4.5 algorithm generates a decision tree, from the
origin node, by selecting one outstanding feature with the
maximum information gain as the examination for the
present node. In this work, Enhanced C4.5, by selecting
one enduring feature with the highest information gain
ratio as the check for present node is considered a
afterward edition of the C4.5 algorithm, will be used to
build the decision trees for categorization. From the table
3 it is clear that Enhanced C4.5 outperforms the classical
C4.5 algorithm Split info is the information owing to the
split of T on the foundation of the rate of the definite
feature D, which is clear by
n
Split Info(x) = -∑ |Ti| / |T|.log2 |Ti| / |T| (4)
i=1
2012 IEEE International Conference on Computational Intelligence and Computing Research
3. And the gain ratio is then calculated by
GainRatio (D,T) = Gain(D,T) / SplitInfo(D,T) (5)
In Enhanced C4.5 the gain ratio, states the amount of
helpful data created by split, i.e., that shows obliging for
categorization. If the split is near-trifling, split
information will be little and this ratio will be unstable.
To avoid this, the gain ratio form chooses a check to
exploit the ratio above, topic to the restriction that the
information gain must be great, at slightest as great as the
standard gain over all assessments studied.
VI. CONCLUSIONS OVERALL PERFORMANCE FOR
C4.5 ALGORITHM VS ENHANCED C4.5
ALGORITHM
This table1 illustrates the overall detection rate and
false positive rate for C4.5 and Enhanced C4.5 algorithm.
Enhanced C4.5 gives improved accuracy for DoS, Probe,
R2L and U2R categories compared to C4.5 algorithm.
TABLE 1
OVERALL DETECTION RATE AND FALSE POSITIVE RATE FOR C4.5
AND ENHANCED C4.5 ALGORITHM
Sl.
No
Attack
Category
Detection
Rate (%)
(C4.5)
Detection
Rate (%)
(Enhanced
C4.5)
False
Positive
(%)
(Enhanced
C4.5)
1 DoS 90.6 92.92 0.085
2 Probe 84.0 88.29 0.152
3 U2R 83.6 84.00 0.220
4 R2L 53.7 66.91 0.398
Average Success
Rate
77.975 83.03 0.213
VII. MODEL RESULT SCREEN SHOTS
Fig. 1. Showing KDDCUP Decision Tree Data Set
Fig. 2. Entropy and Gain Ratio Values of All Attributes
2012 IEEE International Conference on Computational Intelligence and Computing Research
4. VIII. FUTURE DETECTION GENETIC ALGORITHM
OVERVIEW
List illustrates the chief steps of the prepared detection
algorithm as well as the training process. It primary
produces the first population and loads the network audit
data. Then the original population is developed for an
amount of creations. In every formation, the qualities of
the rules are initially considered, and then amounts of
best-fit rules are chosen. The training technique creates by
arbitrarily producing an original population of rules (Step
1). Step 2 estimates the whole amount of records in the
audit data. Step 3 estimates the fitness of every rule and
choose the best-fit rules into novel population. Step 4
guesses the rank selection of individuals. Step 5-7 apply
the crossover and mutation operators to each rule in the
novel population. Step 8 prefers the top greatest
chromosomes into novel population. Finally, Step9
confirms and chooses whether to stop the training
procedure or to go into the next creation to maintain the
advance process.
A. Solution Steps of the Detection Algorithm
Algorithm: Rule set creation with Genetic Algorithm
Input: Amount of productions, Set Binary String,
Population range, Crossover possibility, Mutation
possibility.
Output: A position of chosen attributes.
Step 1) Initialize the Population randomly
Step 2) Total number of Records in the Training Set
Step 3) Estimate Fitness = f(x)/ f (sum)
Where f (x) is the fitness of individual x and f is
the entire fitness of all individuals
Step 4) Rank Selection Ps(i) = r(i) / rsum
Where Ps(i) is probability of selection individual
r(i) is rank of individuals
rsum is sum of all fitness values.
Step 5) For every Chromosome in the novel Population
Step 6) Apply regular Crossover operator to the
Chromosome
Step 7) Apply Mutation operator to the Chromosome
Step 8) Choose the top greatest 60% of Chromosomes
into new population
Step 9) if the amount of creations is not attained, go to
Step 3.
IX. EXPERIMENTAL RESULTS
From the above implementation we have successfully
generate some rules that classify the stated attack
connections and for applying Genetic Algorithm on
selected feature set and find the fitness value for each
generation. This section reports four different attack
categories that can recognize.
TABLE 2
ENHANCED RULE BASED GA - DETECTION RATE FOR DOS,
R2L, U2R, PROBE ATTACKS
Sl.
No
Attack
Category
Detection Rate
(%)
False Positive
(%)
1 DoS 93.70 0.063
2 R2L 88.85 0.112
3 U2R 92.50 0.075
4 Probe 95.33 0.055
Average Success Rate 92.595 0.076
TABLE 3
OVERALL PERFORMANCE COMPARISONS OF G.A VS ENHANCED G.A.
The graph in figure 3 shows the performance of G.A
and Enhanced G.A in terms of accuracy for the DoS,
R2L, U2R, Probe.
0
20
40
60
80
100
DoS Probe U2R R2L
Attack Categories
Detection
Rate
Detection Rate (%)
(Hoffman)
Detection Rate (%)
(Selvakani)
Detection Rate (%)
(Enhanced G.A)
Fig. 3. Shows the performance of G.A and Enhanced G.A
Sl.
No
Attac
k
Categ
ory
Detecti
on
Rate
(%)
(Hoff
man)
Detecti
on Rate
(%)
(Selvak
ani)
Detecti
on
Rate
(%)
(Enhan
ced
G.A)
False
Positive
(%)
(Enhanc
ed G.A)
1 DoS 82.9 86.7 93.70 0.063
2 Probe 75.3 79.1 95.33 0.112
3 U2R 73.1 71.2 92.50 0.075
4 R2L 85.3 83.3 88.85 0.055
Average
Success Rate
79.15 80.075 92.595 0.076
2012 IEEE International Conference on Computational Intelligence and Computing Research
5. TABLE 4
PERFORMANCE COMPARISON OF ENHANCED G.A VS ENHANCED
C4.5
The graph in figure 4 shows the performance of improved
G.A and enhanced C4.5 in terms of accuracy for the DoS,
R2L, U2R, Probe categories.
Fig. 4. Shows the Performance of Enhanced G.A an Enhanced C4.5
Algorithm
X. CONCLUSION AND FUTURE WORK
The improved Genetic Algorithm is a well proper
method for Intrusion Detection compared to enhanced
C4.5 algorithm. Obtain different classification rules for
Intrusion Detection through Genetic Algorithm. The
future Genetic Algorithm presents the Intrusion Detection
System for detecting DoS, R2L, U2R, Probe from
KDDCUP99 Dataset. The outputs of the experiments are
satisfactory with an average success rate of 92.595% and
the overall results of the technique implemented are good.
In Future we have to implement with other features and
different classification methods.
REFERENCES
[1] Axelsson S. 2000, Intrusion Detection Systems: A Survey and
Taxonomy, Technical Report, Dept.of Computer Engineering,
Chalmers University.
[2] Kruegel C and F Valeru. 2002, Stateful Intrusion Detection for
High-Speed Networks, Proceddings of the IEEE Symposium on
Research on Security and Privacy, 285-293.
[3] Kayacik G., Zincir-Heywood N., Heywood M.2003, On the
Capability of an SOM-based Intrusion Detection System,
proceedings of International Joint Conference on Neural Networks.
[4] Bass T.2000, Intrusion detection systems and multisensor data
fusion, Communications of the ACM, Vol.43, 99-105.
[5] S.Selvakani k, Rengan S Rajesh, “Integrated Intrusion Detection
System Using Soft Computing”, IJNS, Vol.10, No.2, pp.87-92,
March 2010.
[6] Bridges S.M and Vaughn R.B, “Fuzzy Data Mining and Genetic
Algorithms Applied to Intrusion Detection”, Proceedings of 12th
Annual Candian Information Technology Security Symposium,
pp.109-122, 2000.
[7] Crosbie Mark and Gene Spafford 1995, “Applying Genetic
Programming to Intrusion Detection”. In Proceeding of 1995
AAAIFall Symposium on Genetic Programming, pp.1-8
Cambridge, Massachusetts.
[8] Chitur. A, “Model Generation for an Intrusion Detection System
using Genetic Algorithms”, High School Hornors Thesis,
http”//www/cs columibi.edu/ids/publications/gaidsthesis 01.pdf
accessed in 2006.
[9] C. Xiang and S.M. Lim, “Design of multiple-level hybrid classifier
for intrusion detection system,” in IEEE Transaction on System
Man, Cybernetics Part A, Cybernetics, Vol.2, No.28, Mystic, CT,
pp. 117-122, May, 2005.
[10] J. Shavlik and M. Shavlik, “Selection, combination, and evaluation
of effective software sensors for detecting abnormal computer
usage, “Proceedings of the First International Conference on
Network security, Seattle, Washington, USA,pp. 56-67, May 2003.
Sl.
N
o
Attac
k
Categ
ory
Detection
Rate (%)
(Enhance
d G.A)
False
Positive
(%)
(Enhanc
ed G.A)
Detec
tion
Rate
(%)
(Enh
anced
C4.5)
False
Positiv
e (%)
(Enha
nced
C.4.5)
1 DoS 93.70 0.063 92.92 0.085
2 Probe 95.33 0.112 88.29 0.152
3 U2R 92.50 0.075 84.00 0.220
4 R2L 88.85 0.055 66.91 0.398
Average
Success
Rate
92.595 0.076 83.03 0.213
0
20
40
60
80
100
DoS Probe U2R R2L
Attack Categories
Detection
Rate
Detection Rate (%) (Enhanced
G.A)
Detection Rate (%) (Enhanced
C4.5)
2012 IEEE International Conference on Computational Intelligence and Computing Research