SlideShare a Scribd company logo
1 of 4
Download to read offline
Traffic Classification using a Statistical Approach
Denis Zuev1
and Andrew W. Moore2
1
University of Oxford, Mathematical Institute, zuev@maths.ox.ac.uk
2
University of Cambridge, Computer Laboratory, andrew.moore@cl.cam.ac.uk
Abstract. Accurate traffic classification is the keystone of numerous network
activities. Our work capitalises on hand-classified network data, used as input to
a supervised Bayes estimator. We illustrate the high level of accuracy achieved
with a supervised Na¨ıve Bayes estimator; with the simplest estimator we are able
to achieve better than 83% accuracy on both a per-byte and a per-packet basis.
1 Introduction
Traffic classification enables a variety of other applications and topics, including Qual-
ity of Service, security, monitoring, and intrusion-detection that are of use to researchers,
accountants, network operators and end users. Capitalising on network traffic that had
been previously hand-classified provides us with training and testing data-sets. We use a
supervised Bayes algorithm to demonstrate an accuracy of better than 66% of flows and
better than 83% for packets and bytes. Further, we require only the network protocol
headers of unknown traffic for a successful classification stage.
While machine-learning has been used previously for network-traffic/flow classification
e.g., [1], we consider our work to be the first that combines this technique with the use
of accurate test and training data-sets.
2 Experiment
In order to perform analysis of data using the Na¨ıve Bayes technique, appropriate input
data is needed. To do this, we capitalised on trace-data described and categorised in [2].
This classified data was further reduced, and split into 10 equal time intervals each con-
taining around 25,000–65,000 objects (flows). To evaluate the performance of the Na¨ıve
Bayes technique, each dataset was used as a training set in turn and evaluated against
the remaining datasets, allowing computation of the average accuracy of classification.
Traffic categories Fundamental to classification work is the idea of classes of traffic.
Throughout this work we use classes of traffic defined as a common group of user-
centric applications. Other users of classification may have both simpler definitions,
This work was completed when Denis Zuev was employed by Intel Research, Cambridge.
Andrew Moore thanks the Intel Corporation for its generous support of his research fellowship
e.g., Normal versus Malicious, or more complex definitions, e.g., the identification of
specific applications or specific TCP implementations.
Described further in [2], we consider the following categories of traffic: BULK (e.g.,
ftp), DATABASE (i.e., postgres, etc.), INTERACTIVE (ssh, telnet), MAIL (smtp, etc.),
SEVICES (X11, dns), WWW, P2P (e.g., KaZaA, ...), ATTACK (virus and worm at-
tacks), GAMES (Half-Life, ...), MULTIMEDIA (Windows Media Player, ...).
Importantly, the characteristics of the traffic within each category are not necessarily
unique. For example, the BULK category which is made up of ftp traffic, consists of
both the control channel, which transfers data in both directions, and the data channel
consisting a simplex flow of data for each object transferred. The assignment of cate-
gories to applications is an artificial grouping that further illustrates that such arbitrary
clustering of only-minimally-related traffic-types is possible with our approach.
Objects and Discriminators Our central object for classification is the flow and for
the work presented in this extended-abstract we have limited our definition of a flow to
being a complete TCP flow — that is all the packets between two hosts for a specific
tuple. We restrict ourselves to complete flows, those that start and end validly, e.g., with
the first SYN, and the last FIN ACK.
As noted in Section 1, the application of a classification scheme requires the parame-
terisation of each object to be classified. Using these parameters, the classifier allocates
an object to a class, due to their ability to allow discrimination between classes. We
refer to these object-describing parameters as discriminators. In our research we have
used 249 different discriminators to describe traffic flows, these include: flow duration
statistics, TCP Port information, payload size statistics, fourier transform of the packet
interarrival time discriminators — a complete list is given in [3].
3 Method
Machine Learned classification Here we briefly describe the machine learning (ML)
approach we take, a trained Na¨ıve Bayes classifier, along with a number of the refine-
ments we use. We would direct interested readers to [4] for one of many surveys of all
ML techniques.
Several methods exist for classifying data and all of them fall into two broad classes:
deterministic and probabilistic classification. As the name suggests, deterministic clas-
sification assigns data points to one of a number of mutually-exclusive classes. This
is done by considering some metric that defines the distance between data points and
by defining the class boundaries. On the other hand, the probabilistic method classifies
data by assigning it with probabilities of belonging to each class of interest.
We believe that probabilistic classification of Internet traffic, and our approach in par-
ticular, is more suitable given the need to be robust to measurement error, to allow for
supervised training with pre-classified traffic, to be able to identify similar character-
istics of flows after their probabilistic class assignment. We believe that the method
be tractable and understood, and be able to cope with the unstable-dynamic nature of
Internet traffic and that the method allow identification of when the model requires
retraining. Additionally, the method needs to be available in a number of implementa-
tions.
Na¨ıve Bayesian Classifier The main approach that is used in this work is the Na¨ıve
Bayes technique described in [5]. Consider a collection of flows x = (x1, . . . , xn),
where each flow xi is described by m discriminators {d
(i)
1 , . . . , d
(i)
m } that can take ei-
ther numeric or discrete values. In the context of the Internet traffic, d
(i)
j is a discrim-
inant of flow xi, for example it may represent the mean interarrival time of packets
in the flow xi. In this paper, flows xi belong to exactly one of the mutually-exclusive
classes described in Section 2. The supervised Bayesian classification problem deals
with building a statistical model that describes each class based on some training data,
and where each new flow y receives a probability of getting classified into a particular
class according to the Bayes rule below,
p(cj | y) =
p(cj)f(y | cj)
cj
p(cj)f(y | cj)
(1)
where p(cj) denotes the probability of obtaining class cj independently of the observed
data, f(y | cj) is the distribution function (or the probability of y given cj) and the
denominator acts as a normalising constant.
The Na¨ıve Bayes technique that is considered in this paper assumes the independence
of discriminators d1, . . . , dm as well as the simple Gaussian behaviour of them. The
authors understand that these assumptions are not realistic in the context of the Internet
traffic, but [5] suggest that this model sometimes outperforms certain more complex
models.
4 Na¨ıve Bayes Results
Our experiments have shown that the Na¨ıve Bayes technique classified on average
66.71% of the traffic correctly. Table 1 demonstrates the classification accuracy of this
techinique for each class. It can be seen from this table, that SERVICES and BULK
are very well classified, with around 90% of correctly-predicted flows. In comparision
to other results, it could be concluded that most discriminator distributions are well
separated in the Euclinean space.
At this stage, it is important to note why certain classes performed very poorly. Classes
such as GAMES and INTERACTIVE do not contain enough samples, therefore, Na¨ıve
Bayes training on these classes is not accurate or realistic. ATTACK flows were often
confused with the WWW flows, due to the similarity in discriminators.
WWW MAIL BULK SERV DB INT P2P ATT MMEDIA GAMES
Accuracy (%) 65.97 56.85 89.26 91.19 20.20 22.83 45.59 58.08 59.45 1.39
Probability (%) 98.31 90.69 90.01 35.92 61.78 7.54 4.96 1.10 32.30 100.00
Table 1. Average accuracy of classification of Na¨ıve Bayes technique by class and Probability
that the predictive class is the real class.
Alongside accuracy we consider it important to define several other metrics describ-
ing the classification technique. Table 1 shows how traffic from different classes gets
classified — clearly an important measure. However, if a network administrator were
to use our tool they would be interested in finding out how much trust can be placed
in the classification outcome. Table 1 also shows the average probability that the pre-
dicted flow class is in fact the real class, e.g., if flow xi has been classified as WWW, a
measure of trust gives us a probability that xi is in reality WWW.
A further indication of how well the Na¨ıve Bayes technique performs is to analyse the
volume of accurately-classified bytes and packets. The results obtained are: 83.98% and
83.93% of packets and bytes, respectively, were correctly classified by the Na¨ıve Bayes
technique described above. In contrast port-based classification achieved an accuracy
of 71.02% by packet and 69.07% by bytes (from [2]). Comparing results in this way
highlights the significant improvement of our Na¨ıve Bayes technique over the port-
based classification alone.
5 Conclusions & Further Work
We demonstrate that, in its simplest form, our probabilistic-classification is capable of
67% accuracy per-flow or better than 83% accuracy both per-byte and per-packet. We
maintain that access to a full-payload trace, the only definitive way to characterise net-
work applications, will be limited due to technical and legal restrictions. We illustrate
how data gathered without those restrictions may be used as training input for a sta-
tistical classifier which in turn can provide accurate, albeit estimated, classification of
header-only trace data.
References
1. Anthony McGregor et al.: Flow Clustering Using Machine Learning Techniques. In: Proceed-
ings of the Fifth Passive and Active Measurement Workshop (PAM 2004). (2004)
2. Moore, A.W., Papagiannaki, K.: Toward the accurate identification of network applications.
In: Passive & Active Measurement Workshop 2005 (PAM2005), Boston, MA (2005)
3. Moore, A., Zuev, D.: Discriminators for use in flow-based classification. Technical report,
Intel Research, Cambridge (2005)
4. Mitchell, T.: Machine Learning. McGraw Hill (1997)
5. Witten, I.H., Frank, E.: Data Mining. Morgan Kaufmann Publishers (2000)

More Related Content

What's hot

Dynamic control of coding for progressive packet arrivals in dtns
Dynamic control of coding for progressive packet arrivals in dtnsDynamic control of coding for progressive packet arrivals in dtns
Dynamic control of coding for progressive packet arrivals in dtnsJPINFOTECH JAYAPRAKASH
 
Risk-Aware Response Mechanism with Extended D-S theory
Risk-Aware Response Mechanism with Extended D-S theoryRisk-Aware Response Mechanism with Extended D-S theory
Risk-Aware Response Mechanism with Extended D-S theoryEditor IJCATR
 
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2Aws Ali
 
Routing in Opportunistic Networks
Routing in Opportunistic NetworksRouting in Opportunistic Networks
Routing in Opportunistic NetworksAuwal Amshi
 
Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...
Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...
Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...ijasuc
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGIJNSA Journal
 
Content Sharing over Smartphone-Based Delay-Tolerant Networks
Content Sharing over Smartphone-Based Delay-Tolerant NetworksContent Sharing over Smartphone-Based Delay-Tolerant Networks
Content Sharing over Smartphone-Based Delay-Tolerant NetworksIJERA Editor
 
Project
Project Project
Project J M
 
IEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTION
IEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTIONIEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTION
IEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTIONranjith kumar
 
Maximizing Efficiency Of multiple–Path Source Routing in Presence of Jammer
Maximizing Efficiency Of multiple–Path Source Routing in Presence of JammerMaximizing Efficiency Of multiple–Path Source Routing in Presence of Jammer
Maximizing Efficiency Of multiple–Path Source Routing in Presence of JammerIOSR Journals
 
Dynamic control of coding for progressive packet arrivals in dt ns
Dynamic control of coding for progressive packet arrivals in dt nsDynamic control of coding for progressive packet arrivals in dt ns
Dynamic control of coding for progressive packet arrivals in dt nsIEEEFINALYEARPROJECTS
 
DPRoPHET in Delay Tolerant Network
DPRoPHET in Delay Tolerant NetworkDPRoPHET in Delay Tolerant Network
DPRoPHET in Delay Tolerant NetworkPhearin Sok
 
Routing in Delay Tolerant Networks
Routing in Delay Tolerant NetworksRouting in Delay Tolerant Networks
Routing in Delay Tolerant NetworksAnubhav Mahajan
 
Secure and energy efficient disjoint multi-path
Secure and energy efficient disjoint multi-pathSecure and energy efficient disjoint multi-path
Secure and energy efficient disjoint multi-pathIMPULSE_TECHNOLOGY
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogsmoresmile
 

What's hot (16)

Dynamic control of coding for progressive packet arrivals in dtns
Dynamic control of coding for progressive packet arrivals in dtnsDynamic control of coding for progressive packet arrivals in dtns
Dynamic control of coding for progressive packet arrivals in dtns
 
Risk-Aware Response Mechanism with Extended D-S theory
Risk-Aware Response Mechanism with Extended D-S theoryRisk-Aware Response Mechanism with Extended D-S theory
Risk-Aware Response Mechanism with Extended D-S theory
 
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
An alternative Routing Mechanisms for Mobile Ad-hoc NetworksPresentation2
 
Routing in Opportunistic Networks
Routing in Opportunistic NetworksRouting in Opportunistic Networks
Routing in Opportunistic Networks
 
Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...
Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...
Exploiting Wireless Networks, through creation of Opportunity Network – Wirel...
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
 
Content Sharing over Smartphone-Based Delay-Tolerant Networks
Content Sharing over Smartphone-Based Delay-Tolerant NetworksContent Sharing over Smartphone-Based Delay-Tolerant Networks
Content Sharing over Smartphone-Based Delay-Tolerant Networks
 
Project
Project Project
Project
 
IEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTION
IEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTIONIEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTION
IEEE BE-BTECH NS2 PROJECT@ DREAMWEB TECHNO SOLUTION
 
Maximizing Efficiency Of multiple–Path Source Routing in Presence of Jammer
Maximizing Efficiency Of multiple–Path Source Routing in Presence of JammerMaximizing Efficiency Of multiple–Path Source Routing in Presence of Jammer
Maximizing Efficiency Of multiple–Path Source Routing in Presence of Jammer
 
Dynamic control of coding for progressive packet arrivals in dt ns
Dynamic control of coding for progressive packet arrivals in dt nsDynamic control of coding for progressive packet arrivals in dt ns
Dynamic control of coding for progressive packet arrivals in dt ns
 
06775257
0677525706775257
06775257
 
DPRoPHET in Delay Tolerant Network
DPRoPHET in Delay Tolerant NetworkDPRoPHET in Delay Tolerant Network
DPRoPHET in Delay Tolerant Network
 
Routing in Delay Tolerant Networks
Routing in Delay Tolerant NetworksRouting in Delay Tolerant Networks
Routing in Delay Tolerant Networks
 
Secure and energy efficient disjoint multi-path
Secure and energy efficient disjoint multi-pathSecure and energy efficient disjoint multi-path
Secure and energy efficient disjoint multi-path
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogs
 

Viewers also liked

Mapa conceptual gerencia publica y privada
Mapa conceptual gerencia publica y privadaMapa conceptual gerencia publica y privada
Mapa conceptual gerencia publica y privadaIng. Rubilet Alvarez
 
Revista de los conejos %282%29
Revista de los conejos %282%29Revista de los conejos %282%29
Revista de los conejos %282%29Iriana Robles
 
Analysis of questionnaire
Analysis of questionnaireAnalysis of questionnaire
Analysis of questionnairefaerbs
 
Presentación final 6ta
Presentación final 6taPresentación final 6ta
Presentación final 6taPabloIvars10
 
Presentación final 5ta
Presentación final 5taPresentación final 5ta
Presentación final 5taPabloIvars10
 
La grandeza-del-mar
La grandeza-del-marLa grandeza-del-mar
La grandeza-del-marXohán Rompe
 
Actividades final primer trimestre curso 2016 2017
Actividades final primer trimestre curso 2016 2017Actividades final primer trimestre curso 2016 2017
Actividades final primer trimestre curso 2016 2017cchh07
 
Stan Novotny 2016 Resume FM
Stan Novotny 2016 Resume FMStan Novotny 2016 Resume FM
Stan Novotny 2016 Resume FMStan Novotny
 
Chapter 14
Chapter 14Chapter 14
Chapter 14ezasso
 
Foundations 30 - Ch 8 review assignment(answers)
Foundations 30 - Ch 8 review assignment(answers)Foundations 30 - Ch 8 review assignment(answers)
Foundations 30 - Ch 8 review assignment(answers)Michelle Sack
 
Visita a corea de los alumnos del ies
Visita a corea de los alumnos del iesVisita a corea de los alumnos del ies
Visita a corea de los alumnos del iesAna Hidalgo
 
Leaf Chromatography
Leaf ChromatographyLeaf Chromatography
Leaf Chromatographythegr8jigs
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)Denis Zuev
 

Viewers also liked (20)

Tercer concilio ecuménico de éfeso
Tercer concilio ecuménico de éfesoTercer concilio ecuménico de éfeso
Tercer concilio ecuménico de éfeso
 
Tumblr
TumblrTumblr
Tumblr
 
Mapa conceptual gerencia publica y privada
Mapa conceptual gerencia publica y privadaMapa conceptual gerencia publica y privada
Mapa conceptual gerencia publica y privada
 
Revista de los conejos %282%29
Revista de los conejos %282%29Revista de los conejos %282%29
Revista de los conejos %282%29
 
Plan de negocio
Plan de negocioPlan de negocio
Plan de negocio
 
Analysis of questionnaire
Analysis of questionnaireAnalysis of questionnaire
Analysis of questionnaire
 
Presentación final 6ta
Presentación final 6taPresentación final 6ta
Presentación final 6ta
 
Presentación final 5ta
Presentación final 5taPresentación final 5ta
Presentación final 5ta
 
La grandeza-del-mar
La grandeza-del-marLa grandeza-del-mar
La grandeza-del-mar
 
Actividades final primer trimestre curso 2016 2017
Actividades final primer trimestre curso 2016 2017Actividades final primer trimestre curso 2016 2017
Actividades final primer trimestre curso 2016 2017
 
Stan Novotny 2016 Resume FM
Stan Novotny 2016 Resume FMStan Novotny 2016 Resume FM
Stan Novotny 2016 Resume FM
 
NABRAIZ RESUME
NABRAIZ RESUMENABRAIZ RESUME
NABRAIZ RESUME
 
Chapter 14
Chapter 14Chapter 14
Chapter 14
 
Foundations 30 - Ch 8 review assignment(answers)
Foundations 30 - Ch 8 review assignment(answers)Foundations 30 - Ch 8 review assignment(answers)
Foundations 30 - Ch 8 review assignment(answers)
 
jairam_book_V05
jairam_book_V05jairam_book_V05
jairam_book_V05
 
Visita a corea de los alumnos del ies
Visita a corea de los alumnos del iesVisita a corea de los alumnos del ies
Visita a corea de los alumnos del ies
 
Leaf Chromatography
Leaf ChromatographyLeaf Chromatography
Leaf Chromatography
 
Backdropdource
Backdropdource Backdropdource
Backdropdource
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
 
Creatividad
CreatividadCreatividad
Creatividad
 

Similar to Traffic Classification using a Statistical Approach

Internet Traffic Classification Using Bayesian Analysis Techniques
Internet Traffic Classification Using Bayesian Analysis TechniquesInternet Traffic Classification Using Bayesian Analysis Techniques
Internet Traffic Classification Using Bayesian Analysis TechniquesDenis Zuev
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Traffic classification svm_im2015_10may2015
Traffic classification svm_im2015_10may2015Traffic classification svm_im2015_10may2015
Traffic classification svm_im2015_10may2015Yang Hong
 
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Eswar Publications
 
IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...
IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...
IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...ijcseit
 
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Editor IJCATR
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java ProjectsVijay Karan
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...IRJET Journal
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot netredpel dot com
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java ProjectsVijay Karan
 
IEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and AbstractIEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and Abstracttsysglobalsolutions
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesianAhmad Amri
 
Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...ijwmn
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERIJCSEA Journal
 
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...cscpconf
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...csandit
 
Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...redpel dot com
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
 
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...IRJET Journal
 
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation AnalysisA New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysisijceronline
 

Similar to Traffic Classification using a Statistical Approach (20)

Internet Traffic Classification Using Bayesian Analysis Techniques
Internet Traffic Classification Using Bayesian Analysis TechniquesInternet Traffic Classification Using Bayesian Analysis Techniques
Internet Traffic Classification Using Bayesian Analysis Techniques
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Traffic classification svm_im2015_10may2015
Traffic classification svm_im2015_10may2015Traffic classification svm_im2015_10may2015
Traffic classification svm_im2015_10may2015
 
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
 
IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...
IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...
IDENTIFICATION AND INVESTIGATION OF THE USER SESSION FOR LAN CONNECTIVITY VIA...
 
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot net
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
IEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and AbstractIEEE Networking 2016 Title and Abstract
IEEE Networking 2016 Title and Abstract
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
 
Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...Enforcing end to-end proportional fairness with bounded buffer overflow proba...
Enforcing end to-end proportional fairness with bounded buffer overflow proba...
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
 
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
 
Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
 
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...
 
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation AnalysisA New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
A New Way of Identifying DOS Attack Using Multivariate Correlation Analysis
 

Traffic Classification using a Statistical Approach

  • 1. Traffic Classification using a Statistical Approach Denis Zuev1 and Andrew W. Moore2 1 University of Oxford, Mathematical Institute, zuev@maths.ox.ac.uk 2 University of Cambridge, Computer Laboratory, andrew.moore@cl.cam.ac.uk Abstract. Accurate traffic classification is the keystone of numerous network activities. Our work capitalises on hand-classified network data, used as input to a supervised Bayes estimator. We illustrate the high level of accuracy achieved with a supervised Na¨ıve Bayes estimator; with the simplest estimator we are able to achieve better than 83% accuracy on both a per-byte and a per-packet basis. 1 Introduction Traffic classification enables a variety of other applications and topics, including Qual- ity of Service, security, monitoring, and intrusion-detection that are of use to researchers, accountants, network operators and end users. Capitalising on network traffic that had been previously hand-classified provides us with training and testing data-sets. We use a supervised Bayes algorithm to demonstrate an accuracy of better than 66% of flows and better than 83% for packets and bytes. Further, we require only the network protocol headers of unknown traffic for a successful classification stage. While machine-learning has been used previously for network-traffic/flow classification e.g., [1], we consider our work to be the first that combines this technique with the use of accurate test and training data-sets. 2 Experiment In order to perform analysis of data using the Na¨ıve Bayes technique, appropriate input data is needed. To do this, we capitalised on trace-data described and categorised in [2]. This classified data was further reduced, and split into 10 equal time intervals each con- taining around 25,000–65,000 objects (flows). To evaluate the performance of the Na¨ıve Bayes technique, each dataset was used as a training set in turn and evaluated against the remaining datasets, allowing computation of the average accuracy of classification. Traffic categories Fundamental to classification work is the idea of classes of traffic. Throughout this work we use classes of traffic defined as a common group of user- centric applications. Other users of classification may have both simpler definitions, This work was completed when Denis Zuev was employed by Intel Research, Cambridge. Andrew Moore thanks the Intel Corporation for its generous support of his research fellowship
  • 2. e.g., Normal versus Malicious, or more complex definitions, e.g., the identification of specific applications or specific TCP implementations. Described further in [2], we consider the following categories of traffic: BULK (e.g., ftp), DATABASE (i.e., postgres, etc.), INTERACTIVE (ssh, telnet), MAIL (smtp, etc.), SEVICES (X11, dns), WWW, P2P (e.g., KaZaA, ...), ATTACK (virus and worm at- tacks), GAMES (Half-Life, ...), MULTIMEDIA (Windows Media Player, ...). Importantly, the characteristics of the traffic within each category are not necessarily unique. For example, the BULK category which is made up of ftp traffic, consists of both the control channel, which transfers data in both directions, and the data channel consisting a simplex flow of data for each object transferred. The assignment of cate- gories to applications is an artificial grouping that further illustrates that such arbitrary clustering of only-minimally-related traffic-types is possible with our approach. Objects and Discriminators Our central object for classification is the flow and for the work presented in this extended-abstract we have limited our definition of a flow to being a complete TCP flow — that is all the packets between two hosts for a specific tuple. We restrict ourselves to complete flows, those that start and end validly, e.g., with the first SYN, and the last FIN ACK. As noted in Section 1, the application of a classification scheme requires the parame- terisation of each object to be classified. Using these parameters, the classifier allocates an object to a class, due to their ability to allow discrimination between classes. We refer to these object-describing parameters as discriminators. In our research we have used 249 different discriminators to describe traffic flows, these include: flow duration statistics, TCP Port information, payload size statistics, fourier transform of the packet interarrival time discriminators — a complete list is given in [3]. 3 Method Machine Learned classification Here we briefly describe the machine learning (ML) approach we take, a trained Na¨ıve Bayes classifier, along with a number of the refine- ments we use. We would direct interested readers to [4] for one of many surveys of all ML techniques. Several methods exist for classifying data and all of them fall into two broad classes: deterministic and probabilistic classification. As the name suggests, deterministic clas- sification assigns data points to one of a number of mutually-exclusive classes. This is done by considering some metric that defines the distance between data points and by defining the class boundaries. On the other hand, the probabilistic method classifies data by assigning it with probabilities of belonging to each class of interest. We believe that probabilistic classification of Internet traffic, and our approach in par- ticular, is more suitable given the need to be robust to measurement error, to allow for
  • 3. supervised training with pre-classified traffic, to be able to identify similar character- istics of flows after their probabilistic class assignment. We believe that the method be tractable and understood, and be able to cope with the unstable-dynamic nature of Internet traffic and that the method allow identification of when the model requires retraining. Additionally, the method needs to be available in a number of implementa- tions. Na¨ıve Bayesian Classifier The main approach that is used in this work is the Na¨ıve Bayes technique described in [5]. Consider a collection of flows x = (x1, . . . , xn), where each flow xi is described by m discriminators {d (i) 1 , . . . , d (i) m } that can take ei- ther numeric or discrete values. In the context of the Internet traffic, d (i) j is a discrim- inant of flow xi, for example it may represent the mean interarrival time of packets in the flow xi. In this paper, flows xi belong to exactly one of the mutually-exclusive classes described in Section 2. The supervised Bayesian classification problem deals with building a statistical model that describes each class based on some training data, and where each new flow y receives a probability of getting classified into a particular class according to the Bayes rule below, p(cj | y) = p(cj)f(y | cj) cj p(cj)f(y | cj) (1) where p(cj) denotes the probability of obtaining class cj independently of the observed data, f(y | cj) is the distribution function (or the probability of y given cj) and the denominator acts as a normalising constant. The Na¨ıve Bayes technique that is considered in this paper assumes the independence of discriminators d1, . . . , dm as well as the simple Gaussian behaviour of them. The authors understand that these assumptions are not realistic in the context of the Internet traffic, but [5] suggest that this model sometimes outperforms certain more complex models. 4 Na¨ıve Bayes Results Our experiments have shown that the Na¨ıve Bayes technique classified on average 66.71% of the traffic correctly. Table 1 demonstrates the classification accuracy of this techinique for each class. It can be seen from this table, that SERVICES and BULK are very well classified, with around 90% of correctly-predicted flows. In comparision to other results, it could be concluded that most discriminator distributions are well separated in the Euclinean space. At this stage, it is important to note why certain classes performed very poorly. Classes such as GAMES and INTERACTIVE do not contain enough samples, therefore, Na¨ıve Bayes training on these classes is not accurate or realistic. ATTACK flows were often confused with the WWW flows, due to the similarity in discriminators.
  • 4. WWW MAIL BULK SERV DB INT P2P ATT MMEDIA GAMES Accuracy (%) 65.97 56.85 89.26 91.19 20.20 22.83 45.59 58.08 59.45 1.39 Probability (%) 98.31 90.69 90.01 35.92 61.78 7.54 4.96 1.10 32.30 100.00 Table 1. Average accuracy of classification of Na¨ıve Bayes technique by class and Probability that the predictive class is the real class. Alongside accuracy we consider it important to define several other metrics describ- ing the classification technique. Table 1 shows how traffic from different classes gets classified — clearly an important measure. However, if a network administrator were to use our tool they would be interested in finding out how much trust can be placed in the classification outcome. Table 1 also shows the average probability that the pre- dicted flow class is in fact the real class, e.g., if flow xi has been classified as WWW, a measure of trust gives us a probability that xi is in reality WWW. A further indication of how well the Na¨ıve Bayes technique performs is to analyse the volume of accurately-classified bytes and packets. The results obtained are: 83.98% and 83.93% of packets and bytes, respectively, were correctly classified by the Na¨ıve Bayes technique described above. In contrast port-based classification achieved an accuracy of 71.02% by packet and 69.07% by bytes (from [2]). Comparing results in this way highlights the significant improvement of our Na¨ıve Bayes technique over the port- based classification alone. 5 Conclusions & Further Work We demonstrate that, in its simplest form, our probabilistic-classification is capable of 67% accuracy per-flow or better than 83% accuracy both per-byte and per-packet. We maintain that access to a full-payload trace, the only definitive way to characterise net- work applications, will be limited due to technical and legal restrictions. We illustrate how data gathered without those restrictions may be used as training input for a sta- tistical classifier which in turn can provide accurate, albeit estimated, classification of header-only trace data. References 1. Anthony McGregor et al.: Flow Clustering Using Machine Learning Techniques. In: Proceed- ings of the Fifth Passive and Active Measurement Workshop (PAM 2004). (2004) 2. Moore, A.W., Papagiannaki, K.: Toward the accurate identification of network applications. In: Passive & Active Measurement Workshop 2005 (PAM2005), Boston, MA (2005) 3. Moore, A., Zuev, D.: Discriminators for use in flow-based classification. Technical report, Intel Research, Cambridge (2005) 4. Mitchell, T.: Machine Learning. McGraw Hill (1997) 5. Witten, I.H., Frank, E.: Data Mining. Morgan Kaufmann Publishers (2000)