` Traffic Classification based on Machine Learning

`
Traffic Classification based on Machine Learning
using Flow-level Information
Jong Gun Lee (jglee@an.kaist.ac.kr)
Advanced Networking Lab.

`
Table of Contents
• Motivation of this work
• Background about machine learning
• Our approach using machine learning
• Experiment (dataset and result)
• Conclusion

`
Motivation
• We cannot effectively classify the traffic of some new
emergent applications,
– such as online games and streaming applications
– because there is no application information, such as port
number or a common byte sequence in payload
We propose a methodology to classify Internet traffic
with supervised and unsupervised learning

`
Basic Terminologies of Machine Learning
• Classifier
is mapping unlabeled instances into classes
• Instance
is a single object of the world
• Attribute
is a single object of the world
• Feature
is the specification of an attribute and its value
• Feature vector
is a list of features describing an instance

`
Unsupervised and Supervised Learning
• Supervised learning (with answer/teacher)
– With a training set, a classifier learns the characteristics of each
class. And when entering new instance, the classifier predicts
the class of the instance.
• Unsupervised learning (without answer/teacher)
– With only a set of data (feature vectors), a classifier make a set
of clusters.

`
K-Means
• One of the unsupervised learning methods
• K value is the number of clusters and this value is given as
the initial parameter
• Procedure
– First, the classifier randomly chooses K points as the centers of
K subspaces
– Second, it divides the overall vector space into K subspaces
according to the centers
– Third, it picks new K centers for each subspaces
– And then, it iterates 2nd
and 3rd
steps until all of the centers are
not changed or moved within the threshold value

`
Example of K-Means
• # of instance: 8, K=2

`
Overall Process of Our Method
Unsupervised
Learning
Feature
Extraction
Supervised
Learning
N packets N feature
vectors
Classifier
K Clusters
Classification
Method

`
Flow-level Feature Information
• Protocol number: 6(TCP) or 17(UDP)
• Duration: seconds
• Number of packets per second (PPS)
• Mean of size of all packets
• Mean of size of non-ACK packets
• Rate of ACK packets
• Interaction Information

`
Feature Extraction (Interaction Information)
• Interaction Information
– H: 2-dimensional histogram, 16x16
– p1, p2, p3, …, pn
• a sequence of packets size of a flow and its partner flow
according to timestamp
For i = 1 : n-1
H[pi/100][pi+1/100]++
A sequence of packets’ size: 40, 80, 1500, …, 40, 1500
Pair-wise representation: [40, 80], [80, 1500], …, [40, 1500]
Histogram: [40/100, 80/100], [80/100, 1500/100], … , [40/100, 1500/100]
[0, 0], [0, 15], …, [0, 15]

`
Guideline
Unsupervised
Learning
Supervised
Learning
Feature
Extraction
Packets N feature
vectors
K clusters
yes
no
Classifier
Rx and Tx
Rx only
Tx only
#bins, bin size
Dynamic/static
Initial ??
packets
Effetive K
estimation
Efficient
theshold
What kind of
learning methodFeature
extraction
Unknown
TRaffic

`
Dataset
• 6412 bittorrent.arff
• 4913 clubbox.arff
• 101355 edonkey.arff
• 21060 fileguri.arff
• 635 ftp.arff
• 200274 http.arff
• 3611 https.arff
• 22 melon.arff
• 4986 msnp.arff
• 1565 nateon.arff
• 169 nntp.arff
• 63 pop3.arff
• 224 sayclub.arff
• 40556 smtp.arff
• 67 ssh.arff
• 385912 total
• 1500 bittorrent.arff
• 1500 clubbox.arff
• 1500 edonkey.arff
• 1500 fileguri.arff
• 0 ftp.arff
• 1500 http.arff
• 1500 https.arff
• 0 melon.arff
• 1500 msnp.arff
• 1500 nateon.arff
• 0 nntp.arff
• 0 pop3.arff
• 0 sayclub.arff
• 1500 smtp.arff
• 0 ssh.arff
• 13500 total

`
Sum of Squared Error (SSE)
• How to get SSE
• #bins: 8*8
• #clusters: 1~20

`
Fitting of SSE
Y=1.446e004 * X^(-1.194) + 755.8

`
Decrease Rate of SSE
0.1% decrease

`
To do list
• Direction
– Rx and Tx, Rx only, and Tx only
• Dynamic bin size
• Initial N packets or all the packets
• Different (un)supervised learning method
• Different feature extraction method

` Traffic Classification based on Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ` Traffic Classification based on Machine Learning

Similar to ` Traffic Classification based on Machine Learning (20)

More from butest

More from butest (20)

` Traffic Classification based on Machine Learning