SlideShare a Scribd company logo
1
Data Mining Techniques
ITE2006
NETWORK ABUSE DETECTION
PROJECT REPORT
SUBMITTED BY
15BIT0134 RUBAL NANDAL
15BIT0268 KEDAR KUMAR
Guided By:
Dr. Sudha M
2
CERTIFICATE
This is to guarantee that the undertaking work entitled "STUDENT Marks
Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and
RUBAL NANDAL (15BIT0134)" is a record of bonafide work done in Data
MINING (ITE2006) under my watch. The substance of this Project work, in
full or in parts, have nor been taken from some other source nor have been
submitted for some other CAL course.
PLACE:VELLORE
DATE:1/11/2017
KEDAR KUMAR (15BIT0268)
RUBAL NANDAL (15BIT0134)"
3
Table of components
Acknowlegement 2
Problem Statement 3
Approach 6
Modules 7
Proposed Implementation 8
Implementation 9
Conclusi
on
22
Referenc
es
23
4
ACKNOWLEDGEMENTS
We acknowledge SUDHA M mam for the direction and help gave help
the execution of the undertaking. We additionally recognize all others
worried about accomplishment of this undertaking. It is standard to
recognize the University Management/School Dean for giving us a
chance to complete our examinations at the University. Thanks for such
an outstanding opportunity to us.
Problem Statement
Now a days there are so many attacks are carried out on various people with malicious intents
.Most of them are network attacks , so we attempt to develop an network abuse detection
(intrusion detection ) from the KDD-1999 data set and try to identity normal connection and
attacked connection
To detect network intrusions protects a computer network from unauthorized users, including
perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a
classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and
"good" normal connections.
A connection is a sequence of TCP packets starting and ending at some well defined times,
between which data flows to and from a source IP address to a target IP address under some well
defined protocol. Each connection is labelled as either normal, or as an attack, with exactly one
specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories
 DOS: denial-of-service, e.g. syn flood;
 R2L: unauthorized access from a remote machine, e.g. guessing password;
 U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer
overflow" attacks;
 PROBING: surveillance and other probing, e.g., port scanning.
5
ABOUT DATASET
Our dataset contains these features
Table 1: Basic features of individual TCP connections
feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment number of "wrong" fragments continuous
urgent number of urgent packets continuous
Table 2: Content features within a connection suggested by domain knowledge
feature name description type
hot number of "hot" indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of "compromised" conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if "su root" command attempted; 0 otherwise discrete
num_root number of "root" accesses continuous
6
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete
is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete
Table 3: Traffic features computed using a two-second time window
feature name description> type
count number of connections to the same host as the current connection
in the past two seconds
continuous
Note: The following features refer to these same-host connections.
serror_rate % of connections that have "SYN" errors continuous
rerror_rate % of connections that have "REJ" errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the current
connection in the past two seconds
continuous
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have "SYN" errors continuous
srv_rerror_rate % of connections that have "REJ" errors continuous
srv_diff_host_rate % of connections to different hosts continuous
7
Approach
1)There we will do some exploratory data analysis using Pandas.
2) After that we will do Data pre-processing and remove unnecessary features (attributes) from
our dataset
3) Then we will use clustering and anomality detection. We want our model to be able to work
well with unknown attack types and also to give an approximation of the closest attack type. We
will use K-mean clustering.
4) Then we will build a classifier using Scikit-learn (machine learning library).
Our classifier will just classify entries into normal or attack. By doing so, we can
generalise the model to new attack types.
8
Modules
1) Data Pre-processing:
Initially, we will use all features. We need to do something with our categorical variables. But
not all the features are numerical so we will do feature selection to remove unwanted features to
reduce the dimensionality of our data.
2) KMeans clustering
We will perform anomaly detection approach in the reduced dataset. We will start by doing k-
means clustering. Once we have the cluster centres, we can use it to identify the clusters of
attack or normal in new dataset
3) Classification
In classification we will train our dataset and make a classifier and use that classifier to predict
other data file and then we will test our estimation with R2
test to predict the accuracy of our
classifier.
4) Predictions
Based on the assumption that new attack types will resemble old type, we will be able to detect
those. Moreover, anything that falls too far from any cluster, will be considered anomalous and
therefore a possible attack.
9
Feature
selection
and scaling
DKK-1999
Labelled
dataset
Proposed Implementation Framework
DKK-1999
Labelled raw
dataset
DKK-1999
Corrected
raw
dataset
Clustering
and anomaly
detection
Anomaly
detection
algorithm
DKK-1999
Corrected
dataset
Unlabell
ed
dataset
labelled
dataset
Predicti
on
results
10
Implementation
1) CLUSTERING
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
header=None, names = col_names)
kdd_data_10percent.describe()
11
OUTPUT
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT
12
FEATURE SELECTION
In [4] :num_features = [
"duration","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT
13
CLUSTERING
from sklearn.cluster import KMeans
k = 30
km = KMeans(n_clusters = k)
t0 = time()
km.fit(features)
tt = time()-t0
print("Clustered in",round(tt,3)," seconds")
#visualising cluster sample
for i in range(600,620):
print (km.labels_[i])
ASSIGINING LABELS
labels = kdd_data_10percent['label']
label_names = list(map(
lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]),
range(k)))
for i in range(k):
print ("Cluster ",i," labels:")
print (label_names[i].value_counts(),"n")
print
14
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
header=None, names = col_names)
ASSIGINING CLUSTERS
t0 = time()
pred = km.predict(kdd_data_corrected[num_features])
tt = time() - t0
print ("Assigned clusters in",round(tt,3)," seconds")
15
2) CLASSIFICATIONS
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
header=None, names = col_names)
kdd_data_10percent.describe()
OUTPUT
16
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT
FEATURE SELECTION
17
In [4] :num_features = [
"duration","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT
18
ADDING LABELS
from sklearn.neighbors import KNeighborsClassifier
labels = kdd_data_10percent['label'].copy()
labels[labels!='normal.'] = 'attack.'
labels.value_counts()
1) TRAINING CLASSIFIER WITH BALL TREE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
header=None, names = col_names)
kdd_data_corrected['label'].value_counts()
19
CONVERTING LABELS
kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.'
kdd_data_corrected['label'].value_counts()
CREATING TEST SAMPLE
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
kdd_data_corrected[num_features],
kdd_data_corrected['label'],
test_size=0.1,
random_state=42)
PRIDICTING
t0 = time()
pred = clf.predict(features_test)
tt = time() - t0
print ("Predicted in",round(tt,3)," seconds")
20
CHECKING ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
21
2) TRAINING CLASSIFIER WITH KD-TREE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'kd-tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
22
3) TRAINING CLASSIFIER WITH BRUTEFORCE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'bruteforce', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
23
CONCLUSION
We have formed clusters . those clusters can e used with real data to predict an
attack and a normal connection. Even anything falling far from cluster can also be
considered as an attack
From classification we obtained results tabulated in below table
ALGORITHM TIME FOR TRAINING ACCURACY
Ball-Tree Least 0.925 (near max)
KD-TREE Little higher than Ball-tree 0.820 (least)
BRUTEFORCE High 0.932 (maximum)
Form our experiment we concluded bruteforce is most expensive algorithm but
produced max accuracy on the other hand kd-tree obtained least result for our data
and ball-tree algorithm worked better as it consumed almost least time and almost
max accuracy
24
References
Dataset
[1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Software
[2] https://spark.apache.org/downloads.html
Pyspark tutorial
[3] https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
[4] https://www.datacamp.com/community/tutorials/apache-spark-python
Research article
[2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009, July). A detailed analysis of
the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications,
2009. CISDA 2009. IEEE Symposium on (pp. 1-6). IEEE.

More Related Content

What's hot

Transforming Security: Containers, Virtualization and Softwarization
Transforming Security: Containers, Virtualization and SoftwarizationTransforming Security: Containers, Virtualization and Softwarization
Transforming Security: Containers, Virtualization and Softwarization
Priyanka Aash
 
Deadlock in Distributed Systems
Deadlock in Distributed SystemsDeadlock in Distributed Systems
Deadlock in Distributed Systems
Pritom Saha Akash
 
DDoS Attack Detection & Mitigation in SDN
DDoS Attack Detection & Mitigation in SDNDDoS Attack Detection & Mitigation in SDN
DDoS Attack Detection & Mitigation in SDN
Chao Chen
 
Pattern-Oriented Network Trace Analysis
Pattern-Oriented Network Trace AnalysisPattern-Oriented Network Trace Analysis
Pattern-Oriented Network Trace Analysis
Dmitry Vostokov
 
Myriam phd
Myriam phdMyriam phd
Myriam phd
iammyr
 
Securing tesla broadcast protocol with diffie hellman key exchange
Securing tesla broadcast protocol with diffie hellman key exchangeSecuring tesla broadcast protocol with diffie hellman key exchange
Securing tesla broadcast protocol with diffie hellman key exchange
IAEME Publication
 
IRJET- Secure Kerberos System in Distributed Environment
IRJET- Secure Kerberos System in Distributed EnvironmentIRJET- Secure Kerberos System in Distributed Environment
IRJET- Secure Kerberos System in Distributed Environment
IRJET Journal
 
Report_Summer
Report_SummerReport_Summer
Report_Summer
Rutvij shah
 
Cyber-security
Cyber-securityCyber-security
Cyber-security
Qasim Zaidi
 
Unveiling-Patchwork
Unveiling-PatchworkUnveiling-Patchwork
Unveiling-Patchwork
Brandon Levene
 
IRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT ProtocolsIRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT Protocols
IRJET Journal
 
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In ManetAn Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
idescitation
 
Deadlock in distribute system by saeed siddik
Deadlock in distribute system by saeed siddikDeadlock in distribute system by saeed siddik
Deadlock in distribute system by saeed siddik
Saeed Siddik
 
DDoS Attack on DNS using infected IoT Devices
DDoS Attack on DNS using infected IoT DevicesDDoS Attack on DNS using infected IoT Devices
DDoS Attack on DNS using infected IoT Devices
Seungjoo Kim
 
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
Nexgen Technology
 
Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's traffic
Sandipan Biswas
 
Manu sheelvant resume
Manu sheelvant resumeManu sheelvant resume
Manu sheelvant resume
Manu RS CCNA-RS
 
Early exploring design alterna1ves of smart sensor so5ware with actors
Early exploring design alterna1ves of smart sensor so5ware with actorsEarly exploring design alterna1ves of smart sensor so5ware with actors
Early exploring design alterna1ves of smart sensor so5ware with actors
ESUG
 
State of the art parallel approaches for
State of the art parallel approaches forState of the art parallel approaches for
State of the art parallel approaches for
ijcsa
 
Cldap threat-advisory
Cldap threat-advisoryCldap threat-advisory
Cldap threat-advisory
Andrey Apuhtin
 

What's hot (20)

Transforming Security: Containers, Virtualization and Softwarization
Transforming Security: Containers, Virtualization and SoftwarizationTransforming Security: Containers, Virtualization and Softwarization
Transforming Security: Containers, Virtualization and Softwarization
 
Deadlock in Distributed Systems
Deadlock in Distributed SystemsDeadlock in Distributed Systems
Deadlock in Distributed Systems
 
DDoS Attack Detection & Mitigation in SDN
DDoS Attack Detection & Mitigation in SDNDDoS Attack Detection & Mitigation in SDN
DDoS Attack Detection & Mitigation in SDN
 
Pattern-Oriented Network Trace Analysis
Pattern-Oriented Network Trace AnalysisPattern-Oriented Network Trace Analysis
Pattern-Oriented Network Trace Analysis
 
Myriam phd
Myriam phdMyriam phd
Myriam phd
 
Securing tesla broadcast protocol with diffie hellman key exchange
Securing tesla broadcast protocol with diffie hellman key exchangeSecuring tesla broadcast protocol with diffie hellman key exchange
Securing tesla broadcast protocol with diffie hellman key exchange
 
IRJET- Secure Kerberos System in Distributed Environment
IRJET- Secure Kerberos System in Distributed EnvironmentIRJET- Secure Kerberos System in Distributed Environment
IRJET- Secure Kerberos System in Distributed Environment
 
Report_Summer
Report_SummerReport_Summer
Report_Summer
 
Cyber-security
Cyber-securityCyber-security
Cyber-security
 
Unveiling-Patchwork
Unveiling-PatchworkUnveiling-Patchwork
Unveiling-Patchwork
 
IRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT ProtocolsIRJET- Estimating Various DHT Protocols
IRJET- Estimating Various DHT Protocols
 
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In ManetAn Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
 
Deadlock in distribute system by saeed siddik
Deadlock in distribute system by saeed siddikDeadlock in distribute system by saeed siddik
Deadlock in distribute system by saeed siddik
 
DDoS Attack on DNS using infected IoT Devices
DDoS Attack on DNS using infected IoT DevicesDDoS Attack on DNS using infected IoT Devices
DDoS Attack on DNS using infected IoT Devices
 
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
 
Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's traffic
 
Manu sheelvant resume
Manu sheelvant resumeManu sheelvant resume
Manu sheelvant resume
 
Early exploring design alterna1ves of smart sensor so5ware with actors
Early exploring design alterna1ves of smart sensor so5ware with actorsEarly exploring design alterna1ves of smart sensor so5ware with actors
Early exploring design alterna1ves of smart sensor so5ware with actors
 
State of the art parallel approaches for
State of the art parallel approaches forState of the art parallel approaches for
State of the art parallel approaches for
 
Cldap threat-advisory
Cldap threat-advisoryCldap threat-advisory
Cldap threat-advisory
 

Similar to Data mining final report

Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor NetworksNode Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
Eswar Publications
 
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
cscpconf
 
06558266
0655826606558266
06558266
Vidya Sagar
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
James Sirota
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
DataWorks Summit
 
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Rod Soto
 
5G-USA-Telemetry
5G-USA-Telemetry5G-USA-Telemetry
5G-USA-Telemetry
snrism
 
Quantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPQuantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAP
Roy Blackstone
 
DDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine LearningDDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine Learning
IRJET Journal
 
Secure Checkpointing Approach for Mobile Environment
Secure Checkpointing Approach for Mobile EnvironmentSecure Checkpointing Approach for Mobile Environment
Secure Checkpointing Approach for Mobile Environment
idescitation
 
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
Open Networking Perú (Opennetsoft)
 
Atlas Services Remote Analysis Report Sample
Atlas Services Remote Analysis Report SampleAtlas Services Remote Analysis Report Sample
Atlas Services Remote Analysis Report Sample
ExtraHop Networks
 
Protecting Financial Networks from Cyber Crime
Protecting Financial Networks from Cyber CrimeProtecting Financial Networks from Cyber Crime
Protecting Financial Networks from Cyber Crime
Lancope, Inc.
 
Layered approach using conditional random fields for intrusion detection (syn...
Layered approach using conditional random fields for intrusion detection (syn...Layered approach using conditional random fields for intrusion detection (syn...
Layered approach using conditional random fields for intrusion detection (syn...
Mumbai Academisc
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
Docker, Inc.
 
Certified ethical hacker (cehv11) exam dumps 2022
Certified ethical hacker (cehv11) exam dumps 2022Certified ethical hacker (cehv11) exam dumps 2022
Certified ethical hacker (cehv11) exam dumps 2022
SkillCertProExams
 
Semantic Metadata Annotation for Network Anomaly Detection
Semantic Metadata Annotation for Network Anomaly DetectionSemantic Metadata Annotation for Network Anomaly Detection
Semantic Metadata Annotation for Network Anomaly Detection
ThomasGraf42
 
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
cscpconf
 
D03601023026
D03601023026D03601023026
D03601023026
theijes
 
An Enhanced Technique for Network Traffic Classification with unknown Flow De...
An Enhanced Technique for Network Traffic Classification with unknown Flow De...An Enhanced Technique for Network Traffic Classification with unknown Flow De...
An Enhanced Technique for Network Traffic Classification with unknown Flow De...
IRJET Journal
 

Similar to Data mining final report (20)

Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor NetworksNode Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
 
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
 
06558266
0655826606558266
06558266
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
 
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
 
5G-USA-Telemetry
5G-USA-Telemetry5G-USA-Telemetry
5G-USA-Telemetry
 
Quantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPQuantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAP
 
DDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine LearningDDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine Learning
 
Secure Checkpointing Approach for Mobile Environment
Secure Checkpointing Approach for Mobile EnvironmentSecure Checkpointing Approach for Mobile Environment
Secure Checkpointing Approach for Mobile Environment
 
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
 
Atlas Services Remote Analysis Report Sample
Atlas Services Remote Analysis Report SampleAtlas Services Remote Analysis Report Sample
Atlas Services Remote Analysis Report Sample
 
Protecting Financial Networks from Cyber Crime
Protecting Financial Networks from Cyber CrimeProtecting Financial Networks from Cyber Crime
Protecting Financial Networks from Cyber Crime
 
Layered approach using conditional random fields for intrusion detection (syn...
Layered approach using conditional random fields for intrusion detection (syn...Layered approach using conditional random fields for intrusion detection (syn...
Layered approach using conditional random fields for intrusion detection (syn...
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
Certified ethical hacker (cehv11) exam dumps 2022
Certified ethical hacker (cehv11) exam dumps 2022Certified ethical hacker (cehv11) exam dumps 2022
Certified ethical hacker (cehv11) exam dumps 2022
 
Semantic Metadata Annotation for Network Anomaly Detection
Semantic Metadata Annotation for Network Anomaly DetectionSemantic Metadata Annotation for Network Anomaly Detection
Semantic Metadata Annotation for Network Anomaly Detection
 
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
 
D03601023026
D03601023026D03601023026
D03601023026
 
An Enhanced Technique for Network Traffic Classification with unknown Flow De...
An Enhanced Technique for Network Traffic Classification with unknown Flow De...An Enhanced Technique for Network Traffic Classification with unknown Flow De...
An Enhanced Technique for Network Traffic Classification with unknown Flow De...
 

More from Kedar Kumar

Big data project
Big data projectBig data project
Big data project
Kedar Kumar
 
.net programming using asp.net to make web project
 .net programming using asp.net to make web project .net programming using asp.net to make web project
.net programming using asp.net to make web project
Kedar Kumar
 
educational website report
educational website reporteducational website report
educational website report
Kedar Kumar
 
Storage final rev
Storage final revStorage final rev
Storage final rev
Kedar Kumar
 
Wireless multimedia sensor networking
Wireless multimedia sensor networkingWireless multimedia sensor networking
Wireless multimedia sensor networking
Kedar Kumar
 
Combinatorial testing
Combinatorial testingCombinatorial testing
Combinatorial testing
Kedar Kumar
 
Combinatorial testing ppt
Combinatorial testing pptCombinatorial testing ppt
Combinatorial testing ppt
Kedar Kumar
 

More from Kedar Kumar (7)

Big data project
Big data projectBig data project
Big data project
 
.net programming using asp.net to make web project
 .net programming using asp.net to make web project .net programming using asp.net to make web project
.net programming using asp.net to make web project
 
educational website report
educational website reporteducational website report
educational website report
 
Storage final rev
Storage final revStorage final rev
Storage final rev
 
Wireless multimedia sensor networking
Wireless multimedia sensor networkingWireless multimedia sensor networking
Wireless multimedia sensor networking
 
Combinatorial testing
Combinatorial testingCombinatorial testing
Combinatorial testing
 
Combinatorial testing ppt
Combinatorial testing pptCombinatorial testing ppt
Combinatorial testing ppt
 

Recently uploaded

DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 

Recently uploaded (20)

DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 

Data mining final report

  • 1. 1 Data Mining Techniques ITE2006 NETWORK ABUSE DETECTION PROJECT REPORT SUBMITTED BY 15BIT0134 RUBAL NANDAL 15BIT0268 KEDAR KUMAR Guided By: Dr. Sudha M
  • 2. 2 CERTIFICATE This is to guarantee that the undertaking work entitled "STUDENT Marks Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and RUBAL NANDAL (15BIT0134)" is a record of bonafide work done in Data MINING (ITE2006) under my watch. The substance of this Project work, in full or in parts, have nor been taken from some other source nor have been submitted for some other CAL course. PLACE:VELLORE DATE:1/11/2017 KEDAR KUMAR (15BIT0268) RUBAL NANDAL (15BIT0134)"
  • 3. 3 Table of components Acknowlegement 2 Problem Statement 3 Approach 6 Modules 7 Proposed Implementation 8 Implementation 9 Conclusi on 22 Referenc es 23
  • 4. 4 ACKNOWLEDGEMENTS We acknowledge SUDHA M mam for the direction and help gave help the execution of the undertaking. We additionally recognize all others worried about accomplishment of this undertaking. It is standard to recognize the University Management/School Dean for giving us a chance to complete our examinations at the University. Thanks for such an outstanding opportunity to us. Problem Statement Now a days there are so many attacks are carried out on various people with malicious intents .Most of them are network attacks , so we attempt to develop an network abuse detection (intrusion detection ) from the KDD-1999 data set and try to identity normal connection and attacked connection To detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labelled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. Attacks fall into four main categories  DOS: denial-of-service, e.g. syn flood;  R2L: unauthorized access from a remote machine, e.g. guessing password;  U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer overflow" attacks;  PROBING: surveillance and other probing, e.g., port scanning.
  • 5. 5 ABOUT DATASET Our dataset contains these features Table 1: Basic features of individual TCP connections feature name description type duration length (number of seconds) of the connection continuous protocol_type type of the protocol, e.g. tcp, udp, etc. discrete service network service on the destination, e.g., http, telnet, etc. discrete src_bytes number of data bytes from source to destination continuous dst_bytes number of data bytes from destination to source continuous flag normal or error status of the connection discrete land 1 if connection is from/to the same host/port; 0 otherwise discrete wrong_fragment number of "wrong" fragments continuous urgent number of urgent packets continuous Table 2: Content features within a connection suggested by domain knowledge feature name description type hot number of "hot" indicators continuous num_failed_logins number of failed login attempts continuous logged_in 1 if successfully logged in; 0 otherwise discrete num_compromised number of "compromised" conditions continuous root_shell 1 if root shell is obtained; 0 otherwise discrete su_attempted 1 if "su root" command attempted; 0 otherwise discrete num_root number of "root" accesses continuous
  • 6. 6 num_file_creations number of file creation operations continuous num_shells number of shell prompts continuous num_access_files number of operations on access control files continuous num_outbound_cmds number of outbound commands in an ftp session continuous is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete Table 3: Traffic features computed using a two-second time window feature name description> type count number of connections to the same host as the current connection in the past two seconds continuous Note: The following features refer to these same-host connections. serror_rate % of connections that have "SYN" errors continuous rerror_rate % of connections that have "REJ" errors continuous same_srv_rate % of connections to the same service continuous diff_srv_rate % of connections to different services continuous srv_count number of connections to the same service as the current connection in the past two seconds continuous Note: The following features refer to these same-service connections. srv_serror_rate % of connections that have "SYN" errors continuous srv_rerror_rate % of connections that have "REJ" errors continuous srv_diff_host_rate % of connections to different hosts continuous
  • 7. 7 Approach 1)There we will do some exploratory data analysis using Pandas. 2) After that we will do Data pre-processing and remove unnecessary features (attributes) from our dataset 3) Then we will use clustering and anomality detection. We want our model to be able to work well with unknown attack types and also to give an approximation of the closest attack type. We will use K-mean clustering. 4) Then we will build a classifier using Scikit-learn (machine learning library). Our classifier will just classify entries into normal or attack. By doing so, we can generalise the model to new attack types.
  • 8. 8 Modules 1) Data Pre-processing: Initially, we will use all features. We need to do something with our categorical variables. But not all the features are numerical so we will do feature selection to remove unwanted features to reduce the dimensionality of our data. 2) KMeans clustering We will perform anomaly detection approach in the reduced dataset. We will start by doing k- means clustering. Once we have the cluster centres, we can use it to identify the clusters of attack or normal in new dataset 3) Classification In classification we will train our dataset and make a classifier and use that classifier to predict other data file and then we will test our estimation with R2 test to predict the accuracy of our classifier. 4) Predictions Based on the assumption that new attack types will resemble old type, we will be able to detect those. Moreover, anything that falls too far from any cluster, will be considered anomalous and therefore a possible attack.
  • 9. 9 Feature selection and scaling DKK-1999 Labelled dataset Proposed Implementation Framework DKK-1999 Labelled raw dataset DKK-1999 Corrected raw dataset Clustering and anomaly detection Anomaly detection algorithm DKK-1999 Corrected dataset Unlabell ed dataset labelled dataset Predicti on results
  • 10. 10 Implementation 1) CLUSTERING LOADING THE DATA In [2] : import pandas from time import time col_names = ["duration","protocol_type","service","flag","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"] kdd_data_10percent = pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected", header=None, names = col_names) kdd_data_10percent.describe()
  • 11. 11 OUTPUT VIEWING THE LABELS In [3] : kdd_data_10percent['label'].value_counts() OUTPUT
  • 12. 12 FEATURE SELECTION In [4] :num_features = [ "duration","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate" ] features = kdd_data_10percent[num_features].astype(float) features.describe() OUTPUT
  • 13. 13 CLUSTERING from sklearn.cluster import KMeans k = 30 km = KMeans(n_clusters = k) t0 = time() km.fit(features) tt = time()-t0 print("Clustered in",round(tt,3)," seconds") #visualising cluster sample for i in range(600,620): print (km.labels_[i]) ASSIGINING LABELS labels = kdd_data_10percent['label'] label_names = list(map( lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]), range(k))) for i in range(k): print ("Cluster ",i," labels:") print (label_names[i].value_counts(),"n") print
  • 14. 14 LOADING TESTING DATA kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected", header=None, names = col_names) ASSIGINING CLUSTERS t0 = time() pred = km.predict(kdd_data_corrected[num_features]) tt = time() - t0 print ("Assigned clusters in",round(tt,3)," seconds")
  • 15. 15 2) CLASSIFICATIONS LOADING THE DATA In [2] : import pandas from time import time col_names = ["duration","protocol_type","service","flag","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"] kdd_data_10percent = pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected", header=None, names = col_names) kdd_data_10percent.describe() OUTPUT
  • 16. 16 VIEWING THE LABELS In [3] : kdd_data_10percent['label'].value_counts() OUTPUT FEATURE SELECTION
  • 17. 17 In [4] :num_features = [ "duration","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate" ] features = kdd_data_10percent[num_features].astype(float) features.describe() OUTPUT
  • 18. 18 ADDING LABELS from sklearn.neighbors import KNeighborsClassifier labels = kdd_data_10percent['label'].copy() labels[labels!='normal.'] = 'attack.' labels.value_counts() 1) TRAINING CLASSIFIER WITH BALL TREE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") LOADING TESTING DATA kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected", header=None, names = col_names) kdd_data_corrected['label'].value_counts()
  • 19. 19 CONVERTING LABELS kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.' kdd_data_corrected['label'].value_counts() CREATING TEST SAMPLE from sklearn.cross_validation import train_test_split features_train, features_test, labels_train, labels_test = train_test_split( kdd_data_corrected[num_features], kdd_data_corrected['label'], test_size=0.1, random_state=42) PRIDICTING t0 = time() pred = clf.predict(features_test) tt = time() - t0 print ("Predicted in",round(tt,3)," seconds")
  • 20. 20 CHECKING ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 21. 21 2) TRAINING CLASSIFIER WITH KD-TREE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'kd-tree', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 22. 22 3) TRAINING CLASSIFIER WITH BRUTEFORCE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'bruteforce', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 23. 23 CONCLUSION We have formed clusters . those clusters can e used with real data to predict an attack and a normal connection. Even anything falling far from cluster can also be considered as an attack From classification we obtained results tabulated in below table ALGORITHM TIME FOR TRAINING ACCURACY Ball-Tree Least 0.925 (near max) KD-TREE Little higher than Ball-tree 0.820 (least) BRUTEFORCE High 0.932 (maximum) Form our experiment we concluded bruteforce is most expensive algorithm but produced max accuracy on the other hand kd-tree obtained least result for our data and ball-tree algorithm worked better as it consumed almost least time and almost max accuracy
  • 24. 24 References Dataset [1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Software [2] https://spark.apache.org/downloads.html Pyspark tutorial [3] https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial [4] https://www.datacamp.com/community/tutorials/apache-spark-python Research article [2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009, July). A detailed analysis of the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on (pp. 1-6). IEEE.