2. The Web today
A proliferation of services
that rely on HTTP
Each day hundred thousands of
unique URLs need to be analyzed by
the network analyst
- For traffic analysis
- For performance tuning
- For security
- …
2
3. Malware
DGA technique
State of art: Firewalls block
malicious traffic using static rules.
Countermeasure: DGA - Generate
pseudo-random domains starting
from common seeds (e.g. current
date or Twitter trends), eluding static
controls based on blacklists.
rammyjuke.com
C&C Server
Blacklist
swltcho81.com
www.hjaoopoa.top
textspeier.de
…
3
5. Idea
Ease the analysis by
clustering network traffic
Implement a self-learning
methodology to automatically
associate previously observed
services and identify new traffic
generated by possibly suspicious
applications.
5
10. LENTA
overview
Day 1 Day 2
Clusters
!(1) !(2)
Clusters’ sampling is performed to facilitate computation & storing
&
!(1) &
!(2)
System
Knowledge
&
'(1) &
'(2)
6
11. LENTA
overview
Day 1 Day 2 Day 3 Day 4
Clusters
!(1) !(2) !(3) !(4)
Clusters’ sampling is performed to facilitate computation & storing
(
!(1) (
!(2) (
!(3) (
!(4)
System
Knowledge
(
)(1) (
)(2) (
)(3) (
)(4)
6
13. Amount of unique URLs found in a
network over a week of observation
URLs observation
8
14. Clustering
URL comparison is executed by means of a string distance implementation
based on edit distance, i.e., number of edit necessary to make one string
equal to the other
New: A recursive version of DBSCAN clustering to
- Reduce data complexity
- Improve clustering accuracy
9
15. CLUE[1] - Big data approach for HTTP mining
DBSCAN
calculation
Distance
calculation
Log
URLs
extraction
Results
HTTP traffic analysis.
How to find similar
URLs?
How to group
similar URLs?
Which clustering
algorithm? Which
parameters?
[1] Morichetta, A., Bocchi, E., Metwalley, H., & Mellia, M. (2016, September). CLUE: clustering for mining web URLs. In Teletraffic
Congress (ITC 28), 2016 28th International (Vol. 1, pp. 286-294). IEEE.
10
16. I-DBSCAN
Compute
DBSCAN
Extract Clusters
(and noise…)
Distance Matrix
- 0.3
0.5
0.9
Silhouette index
Find the ϵ that allows to
cluster a certain percentage
of the whole dataset
Threshold defined a priori
Iterate the process over clusters with silhouette below a threshold !
Input MinPts
11
17. More clusters and higher cluster
quality, thanks to recursive clustering
of bad formed clusters
The more the silhouette is near to
one, the more the cluster is well
formed
I-DBSCAN
Iterative DBSCAN over
clustering results
12
18. Sampling is performed to ease the comparison between clusters, to reduce
computational complexity and keep traffic digests, reducing its footprint
The medoid is appropriate for spherical and homogeneous clusters. We
implemented percentile sampling in order to produce a sampling that is
more peculiar to the population of the cluster
Sampling
13
19. For each element in the cluster we
compute its mean intra-cluster
distance, i.e., the mean of pairwise
distance by it and every other
element in the cluster.
We then order the elements by their
mean intra-cluster distance.
Percentile sampling
Choosing representatives
for clusters
14
20. We extract from this distribution m
percentiles and pick the
corresponding elements.
The idea is to have a set of cluster
subsamples (representatives) that
includes both elements that are in
the center area of a cluster and the
ones at its border, dividing it in equal
sets.
Percentile sampling
Choosing representatives
for clusters
15
21. The number of subsamples chosen is
a trade-off between precision and
complexity.
We tested it using two clustering
data sets results from a day of traffic.
The first builds the System
Knowledge and contains half clusters
selected from C. The second set
contains all clusters.
Percentile sampling
Choosing m size
16
22. Using string distance , new clusters are compared to the ones in the System
Knowledge and added to it if the distance to the closest old cluster is higher
than a threshold !
"
# $ = "
# $ − 1 ∪ "
)* $ ∈ ) $ ,-./
"
)* $ , "
# $ − 1 ≥ !
Random replacement when a new cluster is associated to the old one
to update the system knowledge and to replace “old” representatives
System knowledge enhancement
17
23. Starting from an initial group of
almost 33000 unique URLs we then
artificially create new groups,
progressively injecting URLs
belonging to different applications
to the previous data set.
From the picture can be noticed that
LENTA is able to identify multiple
clusters for each stage.
In vitro experiment
LENTA reaction to
anomalous traffic
18
25. Future steps
Focus on HTTPS traffic to have a complete view on the network activities
Extend big data approaches to all the stage of the system, to scale the
analysis
Application of LENTA over different lexical features, e.g., hostname in DNS
queries or user agents in HTTP requests
20
28. Starting from an initial group of
almost 33000 unique URLs we then
artificially create new groups,
progressively injecting URLs
belonging to different applications
to the previous data set.
From the picture can be noticed that
LENTA is able to identify multiple
clusters for each stage.
In vitro experiment
LENTA reaction to
anomalous traffic
28
29. HTTP vs HTTPS over time
2013/04
2013/07
2013/10
2014/01
2014/04
2014/07
2014/10
2015/01
2015/04
2015/07
2015/10
2016/01
2016/04
2016/07
2016/10
2017/01
2017/04
2017/07
2017/10
0
10
20
30
40
50
60
70
80
90
100
Share
[%]
FB-ZERO
SPDY
HTTP/2
TLS
QUIC
HTTP
A B C D E F
29