• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
IEEE INFOCOM-TMA 2013
 

IEEE INFOCOM-TMA 2013

on

  • 297 views

 

Statistics

Views

Total Views
297
Views on SlideShare
260
Embed Views
37

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 37

http://www.luigigrimaudo.it 26
http://www.linkedin.com 9
https://www.linkedin.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    IEEE INFOCOM-TMA 2013 IEEE INFOCOM-TMA 2013 Presentation Transcript

    • L U I G I G R I M A U D O * , M A R C O M E L L I A * , E L E N A B A R A L I S * A N DR A M K E R A L A P U R A ⌃SELF-LEARNING CLASSIFIER FORINTERNET TRAFFICTMA 201319/04/2013 TORINO* P o l i t e c n i c o d i T o r i n o ⌃ N a r u s , I n c .
    • INTERNET TRAFFIC CLASSIFICATION•  Provide network visibility for operators:•  Deep Packet Inspection•  Does not work if encryption is in place2•  Behavioral classifierLook for specific tokens inside the packet payloadExploit some description of application behaviors bymeans of statistical characteristics (e.g., length ofthe first packets of a flow)
    • DISAVANTAGES3•  Both require training•  Define patterns to be matched•  Define a training set•  Identify only application they have been trained for•  No new application or change in the applicationprotocol or behavior•  Regular update or retraining
    • EXPERIMENTAL ANALYSIS (1)•  Datasets4Name Date Time Place Type IP FlowDS-1 Aug05 1pm South America backbone 108k 527kDS-2 Sep10 10am Asia backbone 111k 1.8MDS-3 Aug11 2am Europe access 111k 885kDS-4 Aug11 5pm Europe access 190k 2.3M•  Each dataset 1h long•  Split into client to server and server to client dataset(indicated as C and S in the following)
    • EXPERIMENTAL ANALYSIS (2)•  DPI as oracle:•  Tstat [1]•  NarusInsight•  23 different protocols, like:5•  Performance metrics:•  Overall accuracy•  Recall•  PrecisionHTTP/S POP3/S YAHOOIM FASTTRACK TELNETRTSP IMAP/S BITTORRENT ARES IRCTLS XMPP EMULE SMBSMTP/S MSN GNUTELLA FTP[1] – A.Finamore, M.Mellia, M.Meo, M.Munafò, D.Rossi, “Experiences of Internet Traffic Monitoring with tstat”,Network, IEEE, vol.25,no.3,2011
    • SeLeCTSelf-Learning Classifier for Internet Traffic6•  Behavioral classifier based on:•  Simple layer-4 metrics (segments size, inter-arrival time)•  Iterative clustering•  Filtering phase•  Adaptive/progressive learning•  Advantages:•  Few and very pure clusters•  Quick inspection•  Easy manual labeling
    • THE SeLeCT APPROACH7FLOWSUNKNOWNGOODINTERNETDEEP ANALYSISSeLeCT
    • 8L4 DATA CLUSTERINGFEATURES(IAT_1! pay1_size!IAT_2! pay2_size!IAT_3! pay3_size!IAT_4! pay4_size!IAT_5! pay5_size!IAT_6! pay6_size!UNCLUSTERED FLOWSSmallclustersFILTERINGGOOD CLUSTERSREADY FOR INSPECTDISCARDED
    • ITERATIVE CLUSTERING•  Work on batches: every 10000 flows or similar…•  Based on k-means clustering algorithm•  Simple, well understood, and it works!•  Group flows into clusters possibly originated by the same application•  How to deal with Server Port?•  It is know to carry valuable information [2]•  KEY IDEA: Use port number for filtering:•  Dominated port protocol•  Random port protocol•  Clustering and filtering phase alternate for a fixed number of iterations9[2] – II.Kim, KC.Claffy, M.Fomenkov, D.Barman, M.Faloutsos, K.Lee. “Internet Traffic Classification demystified:myths, caveats, and best practices.” ACM CoNEXT, Madrid, 2009
    • FILTERING PROCEDURE10Cluster1Cluster2Cluster5Cluster8Cluster6Cluster3Cluster 7Cluster9ClusternCluster4Clusters after k-meansexecutionFor each cluster findsDOMINANTDESTINATION PORTNOYESCreate a new cluster forthe dominant port of eachclusterFlows from original clustergoing to dominant portNO YES020406080PORT 1 PORT 2 PORT 3 PORT 4UNCLUSTERED FLOWSADD TO NEW CLUSTER
    • FINAL ITERATIVE STEP11Filtering procedureCollect all dominantdestination portsN-1 STEPSPort FilteringUNCLUSTERED FLOWSRemove flows going to thecollected portsOUTLIERSCLUSTERINGDISCARD SMALLCLUSTERSFINAL CLUSTERSCluster1Cluster2Cluster5Cluster8Cluster6Cluster3Cluster 7Cluster9ClusternCluster4FINALSTEP
    • ITERATIVE CLUSTERING PERFORMANCESeLeCT vs k-means [3]127075808590951001C 2C 3C 4C 1S 2S 3S 4SAccuracy[%]Datasetk-means SeLeCTOverall accuracy on average ~ 98%[3] – P.N.Tan, M.Steinbach,V.Kumar. “Introduction to data mining.” Pearson Addison Wesley Boston, 2006
    • Interestingly – Select does BETTER than the ground truth•  Fooled by non-English welcome messages of SMTP server?102INTERESTING FINDINGS (1)13
    • LABELING (1)•  Clusters from iterative clustering:•  Ready to be inspected by operators•  Clusters are labeled by means of seeding flows and votingscheme•  Otherwise clusters are labeled as “unknown”14
    • LABELING (2)•  Bootstrapping:•  Labels can be assigned to each “unknown” cluster byoperator using domain knowledge•  Turns out to be trivial for most of port-dominated clusters•  Cluster of flows all going to port 80 -> HTTP !•  Cluster of flows all going to login.skype.com -> skype loginprotocol !15
    • INTERESTING FINDINGS (2)Pure Clusters (not identified by free and commercial DPIs) for:•  Apple push notification over TLS•  Microsoft Media Server (MMS)•  …•  Backdoor.Laphex.Client•  Skype authentication•  …
    • SELF-SEEDING•  SeLeCT is able to automatically reuse theknowledge from previous batches•  Seeding flows from labeled clusters are extracted•  Stratified sampling technique•  Seeding flows are used to label flows of the nextbatch17•  Self training process:•  System grows the set of labeled data•  Augment the coverage of classification process
    • ITERATIVE CLUSTERING PERFORMANCEAccuracy over different batches1850607080901001 2 3 4 5 6 7 8 9 10Accuracy[%]BatchDS-1CDS-1SDS-2CDS-2SDS-3CDS-3SDS-4CDS-4S•  For server – 99% accuracy•  server features are stronger than client features•  For client – 90% accuracy•  only P2P traffic is “confused”
    • SEEDING EVOLUTION19S eMule clusters are used at bootstrap01020304050607080901001 2 3 4 5 6 7 8 9 10Recall[%]BatchS=1S=2S=3
    • HOW FAST IS SeLeCT TO DETECT NEWCLASS?2075808590951001 2 3 4 5 6 7 8 9 10Accuracy[%]BatchDS-4S0204060801001 2 3 4 5 6 7 8 9 10Precision[%]BatchHTTPSPOP30204060801001 2 3 4 5 6 7 8 9 10Recall[%]BatchHTTPSPOP3HTTPS traffic added at batch 3 and POP3 traffic added at batch 6
    • CONCLUSION•  SeLeCT:•  Semi-automated Internet flow traffic classifier•  Based on clustering and filtering•  Adapts the model to traffic changes•  Able to increase its knowledge•  Extremely good performance•  Based on simple behavioral features•  Can be extended to include other features21
    • Thanks for attention!Questions?Luigi Grimaudo – luigi.grimaudo@polito.it