This document proposes an intrusion detection model using self-organizing maps (SOM) to detect malicious attackers on websites. It discusses how denial of service (DoS) attacks aim to harm systems by flooding servers with traffic. The proposed model uses an unsupervised machine learning technique called SOM to analyze website authentication logs and achieve better security by detecting malicious attackers. The SOM algorithm is chosen to naturally cluster data and produce better results than other clustering algorithms. Pseudocode is provided to demonstrate how the SOM algorithm is implemented on normalized website log data to identify different types of visitors and detect intrusions.
1. Intrusion Dectection Model Using Self Organizing Map
TUSHAR ASHOK SHINDE, School of Computer Science and Engineering
VIT Unversity
Chennai,Tamil-Nadu
shindetushar.ashok2014@vit.ac.in
Prof. I.SUMAIYA THASEEN, School of Computer Science and Engineering
VIT Unversity
Chennai,Tamil-Nadu
isumaiyathaseen@vit.ac.in
Denial of Service (DoS) attacks are done mainly to harm the system by the attackers, they are mostly malicious attackers trying
to break the internet security system. The harmful web attackers try to attack the best internet security and WWW websites
as per the literature. The malicious attackers can visit the web-site in groups to attack the server system. Based on this paper
we try to propose a technique that prevents malicious attacks to the server. This is done by examining concept of unsupervised
machine learning techniques on neural network (NN) based on many website-authentication logs. Self-Organizing Map (SOM)
is an unsupervised machine learning technique widely used which is deployed from intrusion detection in our paper. The use of
SOM is to achieve a better safety for the web-site servers by detecting malicious attackers. There are different types of visitors
who logon to use the service on the public websites. The IDS is used to identify the differences between the malicious web-site
attackers and non-malicious web-site users.
Categories and Subject Descriptors: 1 [Intrusion Detection System]: Un-supervised—SOM clustering; 2 [MATLAB 2014]:
plotsom—NSL-KDD dataset
Additional Key Words and Phrases: Access Logs, Clustering tools, Denial of Service (DoS), Machine Learning techniques, Web
Attackers, Web-Site data
1. INTRODUCTION
The business is highly depend on the use of internet. The growing demand of internet and market suc-
cess over the web-sites has changed the way of traditional providing services such as Banking Services,
Transport Services, Medical Services, Educational Services, Defense Services, etc. which are operated
on Internet sites. These Web-based applications providing services are increasing and available to
people in day to day life.
The internet model has an architecture that avoids different attackers from attacking the system
security services of Website-based applications. Majorly Distributed Denial-of-Service (DDoS) come
under important types of system security services attacks. In an survey, the Department of Defense in
United States has reported that cyber-attacks do malicious attacks with respect to individual, coun-
tries and world are targeting the economic system, political system and military based organizations.
Number of attacks is expected to be increasing in near future stating that would cause billions of dol-
lars losses for the country. The mostly commonly used technique for Denial of Service (DoS) attack on
web-site is done by transmitting flood messages towards the targeted system. Attackers aim to enter
into the interface of the targeted system to perform malicious act of operation and thereby cause it
to hang or crash the system or to reboot or perform any unwanted work of altering the server system
files. From the literature it is evident that most of the service attacks that were prevented by using
locating source location, by blocking the malicious websites traffic on the Internet based system.
2. :2 • Intrusion Dectection Model Using Self Organizing Map
Maximally all of the DoS attackers involve a complicated and distributed system network which are
attacked from web-site based machines, namely the ’ Hackers ’ and these are called as Distributed
Dos (DDoS) attacks. DDoS attacks are very difficult to detect, due to the inability of identifying the
source of attack many attackers does the website attack. And thereby creating a large traffic volume
on the network victims Web-Site. This results in important data and loss of services of the system.
In this paper, the SOM algorithm is chosen for the IDS, So that the SOM algorithm to produce result
based on natural clustering. Moreover, compare too many other known popular algorithms, In this
paper SOM clustering algorithm is chosen to build an intrusion detection model as it produces better
results in comparison to other clustering algorithms. SOM clustering achieves supervised visualisation
on dimensional input parameters data into 2D-representation.
Paper is organized as follows: Section to specify the related work on supervised and unsupervised
learning techniques. Section to discuss about the proposed work. Section to conclude the work.
2. RELATED WORK
In different paper studies, authors used different supervised machine learning technique for improved
clustering result the web-site sessions. Thus, it is studied in supervised machine learning can process
clustering sessions which are depend on old known data and other old data results. In the reference
paper [1], the authors P. N. Tan, V. Kumar attempted to get unique groups of web attackers sessions
by performing Decision tree classifier algorithm on a 25-D featuring vector space to be displayed. The
25-D features are obtained from navigation properties in every previous known attacker pre-labeled
report. After clustering is performed each of the report pre-labeled as robots. The result of the study
shows that, the proposed SOM Algorithm integrated with Decision tree classifier algorithm increase
the detection accuracy up to 90 after considering 4 web-site page requests. A. Stassopoulou and M. D.
Dikaiakos [2] utilized supervised technique of Bayesian classifier for detecting the instance of web-site
attackers from server-site logs data and also the authors compared the obtained with the previous
results of Decision tree classifier algorithm. The authors achieve very high accuracy result in identi-
fying the web attack detecting system and in another study they also utilized the logistic regression
technique along with decision tree. D. Doran, S. S. Gokhale [6], proposed tool for attack detection that
has increased the speed of pre-processing. Server website dat which contains access logs was used for
identifying attack detection on attack on web-site and achieving better sucess. Y. Hiltunen, M. Lap-
palainen [3], both utilize the SOM algorithm achieves dimensional view based classification on all
users based on the users web-page number and analyzing the pattern of their web-page visits made
by users. D. Petrilis, C. Halatsis [4], had also examined the application based on the SOM to cluster
up web-users and generate a result which would help to find similar information data within a lesser
time. J. Martn-Guerrero and E. Soria-Olivas [5] proved that employing the ART algorithms to sort
users based in their similarity interests on web-site pages.According to the paper studies [3], [4] and
[5], only normal user web-site visitors are mentioned but they had not mentioned the attackers data.
They have examined by applying the SOM algorithm data visualization.
3. PROPOSED SYSTEM
The IDS is used for security purpose by the developers of web-site server to maintain the web-site
against malicious attacks and avoiding unwanted harmful threats on the web-site server logs. Intru-
sion Detection System (IDS) can be software as well as hardware for detecting intrusion (malicious)
sessions.
3. Using MATLAB • :3
Figure 1: Architecture of Intrusion Detection System
3.1 Pre-processing of server logs
Typically all web server data log file-set contains the details about IP address known as unique host
identification users, the Uniform Resource Locator (URL) of web-site address, visited links in the page
and also time details of user such like date of visit and time of logged, the downloaded size of file done
by the user during surfing the web-site page.The server data log file-set contains the user information
describing the MAC Address and application details used by the user to visit the web. It also contains
referred fields which defines the web-site page that directed the user by which he has reached to the
current web page.
Our web-site server log file analyser performs the following steps when provided with a log file:
1) It checks from the previous entries in the system files to identify unique users activity.
2) According to each known sessions, analyser has to examine and match its unique key to determine
the view representation.
3.2 SOM Algorithm
The Self-organizing Map (SOM) is defined as the technique for data representation that is represented
using 2D visualization diagram, it was invented by Professor Kohonen in early 1980s. The SOM maps
the input is in multi-dimensional and the data is clustered into simple dimensional view or use to iden-
tify the result where the unique diagrammatically relationships between different points are used to
similar the result. By reducing the results into smaller dimensions allows web-site server visualization
of data. The SOM generates subspaces from a machine learning technique based on neural network
which consists of trained data with competitive learning algorithm. The weights have been altered
depending on probability to achieve ”winning” neurons (neurons most nearly match input sample). All
training dataset will be used any of the iterations resulting in combine of samples.
Basic steps in SOM Algorithm:
Step 1: Initialize Map
Step 2: For t from 0 to 1
Step 3: Randomly select a sample
Step 4: Get best matching unit
Step 5: Scale neighbours
Step 6: Increase t a small amount
Step 7: End for
4. :4 • Intrusion Dectection Model Using Self Organizing Map
ALGORITHM 1: SOM Algorithm
Step 1: To choose the result layer of topology.
To substitute the current closest neighbor distance node to a positive value.
Step2: Grab an input vector and to initialize weights values obtained to construct randomly smaller values.
Step3: Consider a 1.
Step4: do
To select an input sample ti.
Euclidean distance formula is used to calculate the similarity between input sample vector and the maps
neighbor vectors weight values.
Computing square of ti
Where in,
wq = weight vectors.
ti = input sample.
k(a) = current iteration.
q = output node.
To determine the result node q* having minimum vectors weight value from (Step2).
Updating the neighbor nodes of BMU by placing them together towards input vector value.
Where in,
n(a) restraint due to distance and time of iteration from BMU.
Step5: Increment the a and repeat until a¡ limit on time iteration.
3.3 Session Identification
It contains grouping a web-site access logs in sessions. Sessions identification is done by:
1) Collecting all the HTTP based requests on web-sites which originates for the same IP address
that matches the visitor.
2) By assigning a timeout operation to divide into key sessions to avoid any mishaps .In sequence
of HTTP requests of sessions occurred by the same IP and the timing period between the two similar
HTTP based requests comes in sequential under a previously defined threshold value. Thus the session
identification gets a unique result value from the given threshold, as many different Web-site users
performs various navigation like suffering on the web-site page online. A. Stassopoulou, M. D. Dika-
iakos [2] employ a 30-minute threshold thereby generating fairly successful web attacker classification
results.
4. CODING
4.1 Normalizing data:
clear all;
clc;
profile on;
ticID tic;
t cputime;
matrix_normalized zeros();
5. Using MATLAB • :5
excel_file {’unlabelleddata20.xlsx’};
NF size(excel_file,1);
output_pdf {’PDF_sample.csv’};
output_csv {’NORM_sample.csv’};
protocol_type{’tcp’;’udp’;’icmp’;’arp’};
Nprotocol size(protocol_type,1);
M zeros(Nprotocol,1);
pdf_p zeros(Nprotocol,1);
The Flag in KDD has the following values:
flag {’OTH’;’REJ’;’RSTO’;’RSTOS0’;’RSTR’;
’RSTRH’;’S0’;’S1’;’S2’;’S3’;’SF’;’SH’;
’SHR’};
Nflag size(flag,1);
F zeros(Nflag,1);
pdf_f zeros(Nflag,1);
service
{’aol’; ’http_443’; ’http_8001’;
’http_2784’;’domain_u’; ’ftp_data’;
’auth’; ’bgp’; ’courier’;’tftp_u’;
’uucp_path’; ’csnet_ns’; ’ctf’;
’daytime’;’time’; ’discard’;
’domain’; ’echo’;’eco_i’; ’ecr_i’;
’efs’; ’exec’; ’finger’; ’gopher’;
’harvest’;’hostnames’; ’http’; ’imap4’;
’IRC’;’iso_tsap’; ’klogin’; ’kshell’;
’ldap’;’link’;’login’; ’smtp’; ’mtp’;
’name’;’netbios_dgm’; ’netbios_ns’;
’netbios_ssn’; ’netstat’;’nnsp’;
’nntp’; ’ntp_u’; ’other’; ’pm_dump’;
’pop_2’;’pop_3’; ’printer’; ’private’;
’red_i’; ’remote_job’; ’rje’; ’shell’;
’sql_net’; ’ssh’; ’sunrpc’;’supdup’;
’systat’; ’telnet’; ’tim_i’;
’urh_i’; ’urp_i’; ’uucp’;’ftp’; ’vmnet’;
’whois’; ’X11’; ’Z39_50’};
Nservice size(service,1);
N zeros(Nservice,1);
pdf_s zeros(Nservice,1);
6. :6 • Intrusion Dectection Model Using Self Organizing Map
LoadtheDatasetfromxlsFile.
for f 1:NF
read everything into one cell array
fprintf(’Start processing the File : s’,
excel_file{f});fprintf(’n’);
[~,~,raw] xlsread(excel_file{f});
find numbers
containsNumbers cellfun(@isnumeric,raw);
convert to string
raw(containsNumbers)
cellfun(@num2str,raw(containsNumbers),
’UniformOutput’,false);
row_count size(raw,1);
col_count size(raw,2);
proto_col raw(:,2);
flag_col raw(:,4);
service_col raw(:,3);
for i 1:Nprotocol
M(i) sum(strcmp(protocol_type(i),
proto_col));
pdf_p(i) M(i)/row_count;
end
for p1:length(protocol_type)
proto_col strrep(proto_col,
protocol_type{p},
num2str(pdf_p(p)));
end
% calculate probabilities of flag
for i 1:Nflag
F(i) sum(strcmp(flag(i),flag_col));
pdf_f(i) F(i)/row_count;
end
for fg 1:length(flag)
flag_col strrep(flag_col,flag{fg},
num2str(pdf_f(fg)));
end
Service column calculation and replacement
for i 1:Nservice
N(i)sum(strcmp(service(i),service_col));
7. Using MATLAB • :7
pdf_s(i) N(i)/row_count;
end
for s 1:length(service)
service_col strrep(service_col,service{s},
num2str(pdf_s(s)));
end
Set all values back to the main cell matrix file
raw(:,2) proto_col;
raw(:,4) flag_col;
raw(:,3) service_col;
read the PDF file to start normalization
fid fopen(output_pdf{f},’wt’);
for i 1:row_count
fprintf(fid,’ s,’,raw{i,1:end-1});
fprintf(fid,’ sn’,raw{i,end});
end
fclose(fid);
fprintf(’Currently generated is : s’,
output_pdf{f});fprintf(’n’);
% End Converting
end
StartNormalization
for f 1:NF
Load the Dataset from xls File
data xlsread (’PDF.xlsx’);
data load (output_pdf{f});
row_count size(data,1);
col_count size(data,2);
raw_matrix data;
for i 1:col_count
selected_column raw_matrix(:,i);
maximum max(selected_column);
minimum min(selected_column);
if maximum > 1
for j 1:size(selected_column,1)
8. :8 • Intrusion Dectection Model Using Self Organizing Map
if selected_column(j) 0
matrix_normalized(j,i) 0;
else
matrix_normalized(j,i)
(selected_column(j)-min) / (max - min);
end
end
else
for j 1:size(selected_column,1)
matrix_normalized(j,i)selected_column(j);
end
end
end
write to a csv file
csvwrite(output_csv{f},matrix_normalized);
fprintf(’>Finished is: s’,output_csv{f});
fprintf(’n’);
end
% End Normalizing
fprintf(’Total execution time is: f n’,
cputime-t);
fclose all;
clear;
SOMTraining
clean up the previous act
close all;
clear;
clc;
clf;
shg;
Load training_data.mat that consists of
unnamed of 1024x1934 matrix consists
of a total of 1934 input samples and each
sample posseses1024 attributes
load normalize;
dataRow
number of attributes of each sample,
i.e. 1024
9. Using MATLAB • :9
dataCol
Total number of training samples,
i.e. 1934
[dataRow, dataCol] size(NORMsample);
SOMArchitecture
The number of rows and columns of som map
somRow 10;
somCol 10;
Initialize 10x10x1024 som matrix
The is SOM Map of 10x10 neurons
Each neuron carries a weight vector of
1024 elements
som zeros(somRow, somCol, dataRow);
Parameters Settings
Max number of iteration
N 20;
Initial effective width
sigmaInitial 5;
Time constant for sigma
t1 N / log(sigmaInitial);
Initialize 10x10 matrix to store
Euclidean distances of each neurons on map
euclideanD zeros(somRow, somCol);
neighbourhoodF zeros(somRow, somCol);
initial learning rate
eta_Initial 0.1;
time constant for eta
t2 N;
% Initialization
Generate random weight vectors
[dataRow x 1] and assign it to the
third dimension of som
for r 1:somRow
for c 1:somCol
som(r, c, :) rand(dataRow, 1);
10. :10 • Intrusion Dectection Model Using Self Organizing Map
end
end
Initialize iteration count to one
n 1;
Start of Iterative Training
Start of one iterative loop
while n < N
sigma sigmaInitial * exp(-n/t1);
variance sigma^2;
eta etaInitial * exp(-n/t2);
Prevent eta from falling below 0.01
if (eta < 0.01)
eta 0.01;
end
in NORMsample
i randi([1,dataCol]);
% Competition Phase
Find the Euclidean Distances
from the input vector
to all the neurons
for r 1:somRow
for c 1:somCol
v NORMsample(:,i)
- reshape(som(r,c,:),dataRow,1);
euclideanD(r,c) sqrt(v’ * v);
end
end
Determine the winner neuron,
i.e. the neuron that is the closest to
the input vector.
winnerRow and winnerCol is index position
of the winner neuron on the SOM Map
11. Using MATLAB • :11
[vector,winnerRowVector]
*min(euclideanD,[],1);
1 stands for 1st dimension, i.e. row
[winnerEuclidean,winnerCol]
*min(vector,[],2);
2 stands for 2nd dimension, i.e. column
winnerRow winnerRowVector(winnerCol);
% End of Competition Phase
% Cooperation Phase
Compute the neighborhood function
of every neuron
for r 1:somRow
for c 1:somCol
if (r ==winnerRow && c == winnerCol)
Is the winner
neighbourhoodF(r, c) 1;
continue;
else Not the winner
distance(winnerRow - r)^2 +
(winnerCol - c)^2;
neighbourhoodF(r, c)
exp(-distance/(2*variance));
end
end
end
% End of Cooperation Phase
% Adaptation Phase
for r 1:somRow
for c 1:somCol
oldWeightVector reshape(som(r, c,:),dataRow,1);
Update weight vector of neuron
som(r, c,:) oldWeightVector +
eta*neighbourhoodF(r,c)*(NORMsample(:,i)
- oldWeightVector);
end
end
% End of Adaptation Phase
12. :12 • Intrusion Dectection Model Using Self Organizing Map
% Draw updated SOM map
f1 figure(1);
set(f1,’name’,strcat(’Iteration #’,
num2str(n)),’numbertitle’,’off’);
for r 1:somRow
for c 1:somCol
region 10 * (r - 1) + c;
subplot(somRow,somCol,region,’align’)
img reshape(som(r, c, :),dataRow,1);
imshow(double(img));
end
end
% End of Draw updated SOM map
Increase iteration count by one
n ( n + 1);
end
End of while loop
% Save the trained SOM
save(’trained_som’, ’som’);
13. Using MATLAB • :13
5. RESULTS
Figure 2: KDD Unlabelled dataset
Figure 3: Normalized Data
14. :14 • Intrusion Dectection Model Using Self Organizing Map
Figure 4: Training SOM Network
15. Using MATLAB • :15
Figure 5: SOM Neigbhor Distance
Figure 6: SOM Sample Hits
16. :16 • Intrusion Dectection Model Using Self Organizing Map
6. CONCLUSION
The detection of malicious web attackers is an upcoming trend in industry areas for improving the
network security from unwanted attackers. Based on this paper, the identification is done on the issues
of network security and derive the result in cluster using SOM technique for KDD dataset. The KDD
data after normalized to perform SOM clustering to build cluster and label the cluster on the Anomaly
and Normal users. The project result helps the future security system that it should be able to detect
as much as Anomaly attack based on the cluster results. However, we now differentiate the Anomaly
attackers and normal users so as to provide a labelling that makes easy for user to assign privileges to
important data from being attacked.
REFERENCES
[1] P. N. Tan and V. Kumar, Patterns, Discovery of Web Robot Sessions Based on their Navigation, Data Mining and Knowledge
Discovery,3rd ed. Jan. 2002.
[2] A. Stassopoulou and M. D. Dikaiakos, Web robot detection: A probabilistic reasoning approach, The International Journal of
Computer and Telecommunications Networking. Feb. 2009.
[3] Y. Hiltunen and M. Lappalainen, Automated Personalization of Internet Users Using Self-Organizing Maps, in IDEAL,
Manchester,UK. 2002.
[4] D. Petrilis and C. Halatsis, Two-level Clustering of Web Sites Using Self-Organizing Maps, Neural Process Letters. Feb
2008.
[5] J. Martn-Guerrero, E. Soria-Olivas, P. J. G. Lisboa, A. Palomares, and E. Balaguer-Ballester, User Profiling from Citizen Web
Portal Accesses using the Adaptive Resonance Theory Neural Network, San Sabastian, Spain. 2006.
[6] D. Doran and S. S. Gokhale, Web robot detection techniques: overview and limitations, Data Mining and Knowledge Discovery,
Jun 2010.
[7] T. Kohonen, Self-Organizing Maps, 3rd ed. New York: Springer-Verlag, Berlin Heidelberg, 2001.
[8] Y. Xie and S.-Z. Yu, Monitoring the Application-Layer DDoS Attacks for Popular Websites, IEEE/ACM Transactions on Net-
working, Feb 2009.
[9] User-Agents.org.[Online], http://www.user-agents.org, Jan.2011.
[11] Bots vs. Browsers, http://www.botsvsbrowsers.com, Jan 2011.
[12] S.Kumar and E.Spafford, A Software architecture to Support Misuse Intrusion Detection, 18th National Information Security
Conference. 1995.