slides_itc30_2018_Morichetta_v2.pdf

LENTA
Longitudinal Exploration for Network Traffic Analysis
Andrea Morichetta, Marco Mellia

The Web today
A proliferation of services
that rely on HTTP
Each day hundred thousands of
unique URLs need to be analyzed by
the network analyst
- For traffic analysis
- For performance tuning
- For security
- …
2

Malware
DGA technique
State of art: Firewalls block
malicious traffic using static rules.
Countermeasure: DGA - Generate
pseudo-random domains starting
from common seeds (e.g. current
date or Twitter trends), eluding static
controls based on blacklists.
rammyjuke.com
C&C Server
Blacklist
swltcho81.com
www.hjaoopoa.top
textspeier.de
…
3

Malware
DGA technique
swltcho81.com/NZf4A07d7r7yE1C1dmVyPTQuMCZiaWQ9YjZjYW
VhNjE0NjhhMmQ4ZTc0OGQ3ZTEzMTIyMDZiMDQ4NWY2MjJhY
SZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdvb2ds
ZS5pdCZxPXVpbmZlIG5ZGVzaw==38c
rammyjuke.com/kaI1wWRd8Y5yfbU9dmVyPTQuMCZiaWQ9YjZjY
WVhNjE0NjhhMmQ4ZTc0OGQ3ZTEzMTIyMDZiMDQ4NWY2MjJh
YSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdvb2d
sZS5pdCZxPWZvcnVtIGFybWF0YSBkZWxsZSB0ZW5lYnJl37g
Looking better at the path, it can be
noticed that the structure is similar
and is still possible to match them.
Blacklist
swltcho81.com
www.hjaoopoa.top
textspeier.de
… rammyjuke.com
C&C Server
4

Idea
Ease the analysis by
clustering network traffic
Implement a self-learning
methodology to automatically
associate previously observed
services and identify new traffic
generated by possibly suspicious
applications.
5

LENTA
overview
Day 1
Clusters
!(1)
6

LENTA
overview
Day 1
Clusters
!(1)
Clusters’ sampling is performed to facilitate computation & storing
6

LENTA
overview
Day 1
Clusters
!(1)
%
!(1)
System
Knowledge
%
&(1)
6

LENTA
overview
Day 1 Day 2
Clusters
!(1) !(2)
&
!(1) &
!(2)
System
Knowledge
&
'(1) &
'(2)
6

LENTA
overview
Day 1 Day 2 Day 3 Day 4
Clusters
!(1) !(2) !(3) !(4)
(
!(1) (
!(2) (
!(3) (
!(4)
System
Knowledge
(
)(1) (
)(2) (
)(3) (
)(4)
6

Traffic Collection
HTTP requests
Internal
Clients
External
Servers
Edge
Router
7

Amount of unique URLs found in a
network over a week of observation
URLs observation
8

Clustering
URL comparison is executed by means of a string distance implementation
based on edit distance, i.e., number of edit necessary to make one string
equal to the other
New: A recursive version of DBSCAN clustering to
- Reduce data complexity
- Improve clustering accuracy
9

CLUE[1] - Big data approach for HTTP mining
DBSCAN
calculation
Distance
calculation
Log
URLs
extraction
Results
HTTP traffic analysis.
How to find similar
URLs?
How to group
similar URLs?
Which clustering
algorithm? Which
parameters?
[1] Morichetta, A., Bocchi, E., Metwalley, H., & Mellia, M. (2016, September). CLUE: clustering for mining web URLs. In Teletraffic
Congress (ITC 28), 2016 28th International (Vol. 1, pp. 286-294). IEEE.
10

I-DBSCAN
Compute
DBSCAN
Extract Clusters
(and noise…)
Distance Matrix
- 0.3
0.5
0.9
Silhouette index
Find the ϵ that allows to
cluster a certain percentage
of the whole dataset
Threshold defined a priori
Iterate the process over clusters with silhouette below a threshold !
Input MinPts
11

More clusters and higher cluster
quality, thanks to recursive clustering
of bad formed clusters
The more the silhouette is near to
one, the more the cluster is well
formed
I-DBSCAN
Iterative DBSCAN over
clustering results
12

Sampling is performed to ease the comparison between clusters, to reduce
computational complexity and keep traffic digests, reducing its footprint
The medoid is appropriate for spherical and homogeneous clusters. We
implemented percentile sampling in order to produce a sampling that is
more peculiar to the population of the cluster
Sampling
13

For each element in the cluster we
compute its mean intra-cluster
distance, i.e., the mean of pairwise
distance by it and every other
element in the cluster.
We then order the elements by their
mean intra-cluster distance.
Percentile sampling
Choosing representatives
for clusters
14

We extract from this distribution m
percentiles and pick the
corresponding elements.
The idea is to have a set of cluster
subsamples (representatives) that
includes both elements that are in
the center area of a cluster and the
ones at its border, dividing it in equal
sets.
Percentile sampling
Choosing representatives
for clusters
15

The number of subsamples chosen is
a trade-off between precision and
complexity.
We tested it using two clustering
data sets results from a day of traffic.
The first builds the System
Knowledge and contains half clusters
selected from C. The second set
contains all clusters.
Percentile sampling
Choosing m size
16

Using string distance , new clusters are compared to the ones in the System
Knowledge and added to it if the distance to the closest old cluster is higher
than a threshold !
"
# $ = "
# $ − 1 ∪ "
)* $ ∈ ) $ ,-./
"
)* $ , "
# $ − 1 ≥ !
Random replacement when a new cluster is associated to the old one
to update the system knowledge and to replace “old” representatives
System knowledge enhancement
17

Starting from an initial group of
almost 33000 unique URLs we then
artificially create new groups,
progressively injecting URLs
belonging to different applications
to the previous data set.
From the picture can be noticed that
LENTA is able to identify multiple
clusters for each stage.
In vitro experiment
LENTA reaction to
anomalous traffic
18

Future steps
Focus on HTTPS traffic to have a complete view on the network activities
Extend big data approaches to all the stage of the system, to scale the
analysis
Application of LENTA over different lexical features, e.g., hostname in DNS
queries or user agents in HTTP requests
20

• Metti slide finale con «domande» J
Questions?
21

10
3
10
4
10
5
Dataset Size
101
102
103
104
Elapsed
time
in
seconds
Centralized
Spark
Computing the pairwise distance
between points is the most complex
and time consuming step in
clustering algorithms.
We implemented a parallelized
computation of distances on Spark,
obtaining better results with respect
to a centralized approach.
Distance matrix
Computing pairwise
distances
27

Starting from an initial group of
almost 33000 unique URLs we then
artificially create new groups,
progressively injecting URLs
belonging to different applications
to the previous data set.
From the picture can be noticed that
LENTA is able to identify multiple
clusters for each stage.
In vitro experiment
LENTA reaction to
anomalous traffic
28

HTTP vs HTTPS over time
2013/04
2013/07
2013/10
2014/01
2014/04
2014/07
2014/10
2015/01
2015/04
2015/07
2015/10
2016/01
2016/04
2016/07
2016/10
2017/01
2017/04
2017/07
2017/10
0
10
20
30
40
50
60
70
80
90
100
Share
[%]
FB-ZERO
SPDY
HTTP/2
TLS
QUIC
HTTP
A B C D E F
29

slides_itc30_2018_Morichetta_v2.pdf

Recommended

Recommended

More Related Content

Similar to slides_itc30_2018_Morichetta_v2.pdf

Similar to slides_itc30_2018_Morichetta_v2.pdf (20)

Recently uploaded

Recently uploaded (20)

slides_itc30_2018_Morichetta_v2.pdf