Your SlideShare is downloading. ×
0
`
Traffic Classification based on Machine Learning
using Flow-level Information
Jong Gun Lee (jglee@an.kaist.ac.kr)
Advanc...
`
Table of Contents
• Motivation of this work
• Background about machine learning
• Our approach using machine learning
• ...
`
Motivation
• We cannot effectively classify the traffic of some new
emergent applications,
– such as online games and st...
`
Basic Terminologies of Machine Learning
• Classifier
is mapping unlabeled instances into classes
• Instance
is a single ...
`
Unsupervised and Supervised Learning
• Supervised learning (with answer/teacher)
– With a training set, a classifier lea...
`
K-Means
• One of the unsupervised learning methods
• K value is the number of clusters and this value is given as
the in...
`
Example of K-Means
• # of instance: 8, K=2
`
Overall Process of Our Method
Unsupervised
Learning
Feature
Extraction
Supervised
Learning
N packets N feature
vectors
C...
`
Flow-level Feature Information
• Protocol number: 6(TCP) or 17(UDP)
• Duration: seconds
• Number of packets per second (...
`
Feature Extraction (Interaction Information)
• Interaction Information
– H: 2-dimensional histogram, 16x16
– p1, p2, p3,...
`
Guideline
Unsupervised
Learning
Supervised
Learning
Feature
Extraction
Packets N feature
vectors
K clusters
yes
no
Class...
`
Dataset
• 6412 bittorrent.arff
• 4913 clubbox.arff
• 101355 edonkey.arff
• 21060 fileguri.arff
• 635 ftp.arff
• 200274 h...
`
`
`
Sum of Squared Error (SSE)
• How to get SSE
• #bins: 8*8
• #clusters: 1~20
`
Fitting of SSE
Y=1.446e004 * X^(-1.194) + 755.8
`
Estimation of SSE
`
Decrease Rate of SSE
0.1% decrease
`
To do list
• Direction
– Rx and Tx, Rx only, and Tx only
• Dynamic bin size
• Initial N packets or all the packets
• Dif...
Upcoming SlideShare
Loading in...5
×

` Traffic Classification based on Machine Learning

698

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
698
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "` Traffic Classification based on Machine Learning "

  1. 1. ` Traffic Classification based on Machine Learning using Flow-level Information Jong Gun Lee (jglee@an.kaist.ac.kr) Advanced Networking Lab.
  2. 2. ` Table of Contents • Motivation of this work • Background about machine learning • Our approach using machine learning • Experiment (dataset and result) • Conclusion
  3. 3. ` Motivation • We cannot effectively classify the traffic of some new emergent applications, – such as online games and streaming applications – because there is no application information, such as port number or a common byte sequence in payload We propose a methodology to classify Internet traffic with supervised and unsupervised learning
  4. 4. ` Basic Terminologies of Machine Learning • Classifier is mapping unlabeled instances into classes • Instance is a single object of the world • Attribute is a single object of the world • Feature is the specification of an attribute and its value • Feature vector is a list of features describing an instance
  5. 5. ` Unsupervised and Supervised Learning • Supervised learning (with answer/teacher) – With a training set, a classifier learns the characteristics of each class. And when entering new instance, the classifier predicts the class of the instance. • Unsupervised learning (without answer/teacher) – With only a set of data (feature vectors), a classifier make a set of clusters.
  6. 6. ` K-Means • One of the unsupervised learning methods • K value is the number of clusters and this value is given as the initial parameter • Procedure – First, the classifier randomly chooses K points as the centers of K subspaces – Second, it divides the overall vector space into K subspaces according to the centers – Third, it picks new K centers for each subspaces – And then, it iterates 2nd and 3rd steps until all of the centers are not changed or moved within the threshold value
  7. 7. ` Example of K-Means • # of instance: 8, K=2
  8. 8. ` Overall Process of Our Method Unsupervised Learning Feature Extraction Supervised Learning N packets N feature vectors Classifier K Clusters Classification Method
  9. 9. ` Flow-level Feature Information • Protocol number: 6(TCP) or 17(UDP) • Duration: seconds • Number of packets per second (PPS) • Mean of size of all packets • Mean of size of non-ACK packets • Rate of ACK packets • Interaction Information
  10. 10. ` Feature Extraction (Interaction Information) • Interaction Information – H: 2-dimensional histogram, 16x16 – p1, p2, p3, …, pn • a sequence of packets size of a flow and its partner flow according to timestamp For i = 1 : n-1 H[pi/100][pi+1/100]++ A sequence of packets’ size: 40, 80, 1500, …, 40, 1500 Pair-wise representation: [40, 80], [80, 1500], …, [40, 1500] Histogram: [40/100, 80/100], [80/100, 1500/100], … , [40/100, 1500/100] [0, 0], [0, 15], …, [0, 15]
  11. 11. ` Guideline Unsupervised Learning Supervised Learning Feature Extraction Packets N feature vectors K clusters yes no Classifier Rx and Tx Rx only Tx only #bins, bin size Dynamic/static Initial ?? packets Effetive K estimation Efficient theshold What kind of learning methodFeature extraction Unknown TRaffic
  12. 12. ` Dataset • 6412 bittorrent.arff • 4913 clubbox.arff • 101355 edonkey.arff • 21060 fileguri.arff • 635 ftp.arff • 200274 http.arff • 3611 https.arff • 22 melon.arff • 4986 msnp.arff • 1565 nateon.arff • 169 nntp.arff • 63 pop3.arff • 224 sayclub.arff • 40556 smtp.arff • 67 ssh.arff • 385912 total • 1500 bittorrent.arff • 1500 clubbox.arff • 1500 edonkey.arff • 1500 fileguri.arff • 0 ftp.arff • 1500 http.arff • 1500 https.arff • 0 melon.arff • 1500 msnp.arff • 1500 nateon.arff • 0 nntp.arff • 0 pop3.arff • 0 sayclub.arff • 1500 smtp.arff • 0 ssh.arff • 13500 total
  13. 13. `
  14. 14. `
  15. 15. ` Sum of Squared Error (SSE) • How to get SSE • #bins: 8*8 • #clusters: 1~20
  16. 16. ` Fitting of SSE Y=1.446e004 * X^(-1.194) + 755.8
  17. 17. ` Estimation of SSE
  18. 18. ` Decrease Rate of SSE 0.1% decrease
  19. 19. ` To do list • Direction – Rx and Tx, Rx only, and Tx only • Dynamic bin size • Initial N packets or all the packets • Different (un)supervised learning method • Different feature extraction method
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×