Exploiting clustering techniques
Upcoming SlideShare
Loading in...5
×
 

Exploiting clustering techniques

on

  • 168 views

 

Statistics

Views

Total Views
168
Views on SlideShare
168
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Exploiting clustering techniques Exploiting clustering techniques Presentation Transcript

  • Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)
  • Outline
    • Web Session Model
    • Clustering techniques
    • The proposed algorithm
    • Performance of the algorithm
    • Session statistics
  • Web session definition
    • A single web client generates a succession of TCP flows and think times
    think time T off think time T off
    • A session here is defined as the set of TCP flows arriving close enough one to each other
    • For example a threshold can be used to discriminate between think times and inter arrivals of TCP flows
  • Algorithms
    • A threshold based approach needs a priori knowledge of the source
    • An adaptive algorithm should be capable to catch traffic variations
    • This is supposed to be less sensitive to traffic characteristics
    • Clustering is the chosen approach
  • Proposed algorithm
    • Three steps
      • A K-means is used on all samples to obtain a first clustering, K is chosen very large
      • A hierarchical clustering is used only on representatives of each cluster, K is reduced
      • A K-means is used on all samples again
    • To test the algorithm we need a priori known traffic, that is artificially generated
  • First Step: K-means
    • K is chosen large enough but significantly smaller than the number of samples
    • The K farthest flows determine the first partition
    • K-means is performed 1000 iterations on all samples
    • Each cluster is then represented using a subset of samples, one or two in our algorithm
      • The mean value (Centroid method)
      • The gth and (100-g)th percentiles (Single linkage method if g=0)
    g - th percentile (100- g )- th percentile
  • Second step: a hierarchical method
    • A hiera r chical method is used on only representatives
    • This method merges clusters until a quality function determines that the optimal number of cluster s Nc has been found
  • Gamma function typical behaviour -10 0 10 20 30 40 50 60 70 0 200 400 600 800 1000 1200 1400 gamma Step
  • Third Step: K-means
    • A K-means is performed on all samples
    • This last step is not critical but rearranges samples ’ positions within cluster s that is flows within sessions
    • It is not CPU time consuming , than it is not critical to use it
  • Performance evaluation
    • Artificial traffic is generated according to an ON/OFF process
    • During ON periods a succession of flows is generated using i.i.d. inter-arrivals
    • In this model inferring is to recognize if an inter arrival is an OFF period or an inter arrival between flows within an ON period
    • Every time the algorithm does not guess correctly, an error is counted
    • Suppose all variables are exponentially distributed
  • First step sensitivity (1/2)
    • If the initial number of clusters is chosen large enough the method is less error prone
    • The algorithm is much more sensitive to the value of the idle period
    0.01 0.1 1 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Percentage of errors T_{off} K=1000 K=1500 K=2000 K=2500
  • First step sensitivity (2/2)
    • Performance is sensitive to the choice of the percentile g
    • When clusters are represented through flows at the border of the session the method is less sensitive to traffic , i.e. g=1
    • This is due to the fact
    • that cluster has a long
    • and narrow shape and
    • those representatives
    • well model this fact
    0.01 0.1 1 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Percentage of errors T_{off} Single linkage Centroid Method g=1 g=5 0.01 0.1 1 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 T_{off} g=15 g=25 g=35 g=45
  • Comparison with threshold based algorithms – exponential case
    • Threshold based algorithms work well if traffic characteristics are known
    • But they are very sensitive to the threshold value
    • If sessions are already
    • well clustered because
    • idle periods are large
    • enough compared to
    • flow’s inter arrivals ,
    • our algorithm is very
    • good
    0.1 1 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Percentage of errors T_{off} clustering etha=T_{off}/2 etha=T_{off}/4 etha=T_{off}/8 0.1 1 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 T_{off} etha=T_{off}/16 etha=T_{off}/32 etha=T_{off}/64 etha=T_{off}/128
  • Comparison with threshold based algorithms – Pareto case
    • Threshold based algorithms work well if traffic characteristics are known
    • But they are very sensitive to the threshold value
    • If sessions are already
    • well clustered because
    • idle periods are large
    • enough compared to
    • flow’s inter arrivals ,
    • our algorithm is very
    • good
    0.1 1 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Percentage of errors T_{off} clustering etha=T_{off}/2 etha=T_{off}/4 etha=T_{off}/8 0.1 1 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 T_{off} etha=T_{off}/16 etha=T_{off}/32 etha=T_{off}/64 etha=T_{off}/128
  • Some statistics on aggregated sessions
    • The session sizes are heavy tailed (broadly)
      • Usually each session is made of a few TCP flows
    • Flow termination definition is not that important
    0 0.05 0.1 0.15 0.2 0.25 0.3 1 10 100 1000 10000 PDF Number of TCP connections per session 1e-005 0.0001 0.001 0.01 0.1 1 100 1000 10000 Number of TCP connections per session Compl. CDF 0 0.01 0.02 0.03 0.04 0.05 0.06 1 10 100 PDF Session Length [s] First SYN -> Last TCP Tear-Down First SYN -> Last Data Segment 0.0001 0.001 0.01 0.1 1 100 1000 10000 Session Length [s] Compl. CDF
  • Some statistics on aggregated sessions
    • Similar results concerning server to client and client to server data
    • Similar distribution law, asymetries on volume only
    0 0.005 0.01 0.015 0.02 0.025 0.03 100 1000 10000 100000 1e+006 PDF Session data [bytes] Server -> Client Client -> Server 1e-005 0.0001 0.001 0.01 0.1 1 10000 100000 1e+006 1e+007 Session data [bytes] Compl. CDF
  • Flow’s and session’s inter-arrivals
    • The method infers session which are similar even when considering very different traces
    • Tarr and Toff are well identified
    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 1 10 100 1000 10000 CDF Time [s] Apr.04 T_{off} Oct.02 T_{off} Apr.04 T_{arr} Oct.02 T_{arr}
  • Conclusions
    • Clustering techniques could be easily used to infer web-session
    • The p ro posed algorithm is a mix a known clustering approaches
    • It is able to deal with huge amount of data
    • Sessions seems to be very well recognized