1. CLUE
CLUSTERING FOR MINING WEB URLS
Andrea Morichetta
Enrico Bocchi
Hassan Metwalley
Marco Mellia
name.surname@polito.it
ITC28
Wรผrzburg, September 15th, 2016
3. SCENARIO
Internet evolution and needs for monitoring.
3,256,931,615
Users in the world
December 2nd 2015 [http://www.internetlivestats.com]
1,930,257,214
Subscriptions to โmobile
networksโ
December 2013
[Source: ITU]
Network
Monitoring
To obtain quality and
security
3
6. THE WEB AND MALICIOUS TRAFFIC
HTTP traffic monitoring to track anomalous and potentially malicious behaviors.
Malware
Zero-day
Compromised
machines talk to
the C&C.
C&C Server
Firewall
Compromised Host
mlw.com/abc
Firewall blocks
malicious
requests using
static rules.
6
7. THE WEB AND MALICIOUS TRAFFIC
HTTP traffic monitoring to track anomalous and potentially malicious behaviors.
Malware
Zero-day
C&C Server
Firewall
Compromised host
mlw.com/abc
malw.com/abd
Algorithmically
generated URLs starting
from seeds
(e.g. current date
or Twitter trends)
They elude static
controls,
based on blacklists,
changing URLsโ
paths and hostnames
7
8. THE WEB AND MALICIOUS TRAFFIC
HTTP traffic monitoring to track anomalous and potentially malicious behaviors.
Malware
Zero-day
C&C Server
Firewall
Compromised host
mlw.com/abc
malw.com/abd
Algorithmically
generated URLs starting
from seeds
(e.g. current date
or Twitter trends)
They elude static
controls,
based on blacklists,
changing URLsโ
paths and hostnames
HTTP traffic
monitoring
Group algorithmically
generated URLs.
Control and monitor
possible, not-checked,
malicious behaviors.
Or generically better
understanding the traffic on
the Web.
8
9. EXAMPLE: TIDSERV
Malware TidServ analysis.
Profit-making purpose
It spreads with users complicity
URLs characterized by pseudo-randomness
Trojan Rootkit
9
10. EXAMPLE: TIDSERV
Malware TidServ analysis.
swltcho81.com/NZf4A07d7r7yE1C1dmVyPTQuMCZiaWQ9YjZjYWVhNjE0NjhhMmQ4ZTc0OGQ3ZTEzMTIy
MDZiMDQ4NWY2MjJhYSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdvb2dsZS5pdCZxPXV
pbmZlIG5ZGVzaw==38c
rammyjuke.com/kaI1wWRd8Y5yfbU9dmVyPTQuMCZiaWQ9YjZjYWVhNjE0NjhhMmQ4ZTc0OGQ3ZTEzMT
IyMDZiMDQ4NWY2MjJhYSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdvb2dsZS5pdCZxP
WZvcnVtIGFybWF0YSBkZWxsZSB0ZW5lYnJl37g
Profit-making purpose
It spreads with users complicity
URLs characterized by pseudo-randomness
Trojan Rootkit
10
11. EXAMPLE: TIDSERV
Malware TidServ analysis.
swltcho81.com/NZf4A07d7r7yE1C1dmVyPTQuMCZiaWQ9YjZjYWVhNjE0NjhhMmQ4ZTc0OGQ3ZTEzMTIy
MDZiMDQ4NWY2MjJhYSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdvb2dsZS5pdCZxPXV
pbmZlIG5ZGVzaw==38c
rammyjuke.com/kaI1wWRd8Y5yfbU9dmVyPTQuMCZiaWQ9YjZjYWVhNjE0NjhhMmQ4ZTc0OGQ3ZTEzMT
IyMDZiMDQ4NWY2MjJhYSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdvb2dsZS5pdCZxP
WZvcnVtIGFybWF0YSBkZWxsZSB0ZW5lYnJl37g
Profit-making purpose
It spreads with users complicity
URLs characterized by pseudo-randomness
Trojan Rootkit
How to automatically detect this behavior?
Which are services adopting these techniques?
11
13. CLUE in a nutshell
โข HTTP traffic analysis -> How to find similar URLs?
โข How similar are two strings?
โข How to group similar URLs?
โข Clustering algorithms -> Which algorithm? Which parameters?
โข How to suggest relevant clusters?
โข Highlight relevant clusters to further mine
Big data approach for HTTP mining
DBSCAN
clustering
Results
Distance
calculation
Log
CLUE: CLustering for URL Exploration
13
14. SCENARIO
Traffic collected from a network with more than 20000 Hosts connected.
IDS
HTTP requests
DBSCAN
calculation
Results
Distance
calculation
Log
URLs
extraction
Internal
Clients
Edge
Router
External
Servers
Labels
14
16. SIMILARITY
Comparison between elements with no good understanding a priori.
LEVENSHTEIN
DISTANCE
JARO
DISTANCE
URL
DISTANCE
Simple Levenstein
distance: assigns a unit
cost to all edit operations
Levenshtein modified:
unitary weight for adding
and removing edit
operations, double weight
for replacements
The Jaro algorithm is a
measure that evaluates
the number and order of
features in common
Edit Distance
Class of distance functions in which, given two strings s and t, distance is the cost of
best sequence of edit operations that convert s to t.
DBSCAN
calculation
Results
Distance
calculation
Log
URLs
extraction
16
17. DISTANCE EVALUATION IN PRACTICE
Comparing distance measures behavior with TidServ elements.
LEVENSHTEIN
DISTANCE
a.
swltcho81.com/NZf4A07d7r7yE1C1dmVyPTQuMCZiaWQ9YjZjYWVhNjE0NjhhMmQ4ZTc0OGQ
3ZTEzMTIyMDZiMDQ4NWY2MjJhYSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdv
b2dsZS5pdCZxPXVpbmZlIG5ZGVzaw==38c
b.
iau71nag001.com/NZf4A07d7r7yE1C1dmVyPTQuMCZiaWQ9YjZjYWVhNjE0NjhhMmQ4ZTc0OG
Q3ZTEzMTIyMDZiMDQ4NWY2MjJhYSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmd
vb2dsZS5pdCZxPXVpbmZlIG15ZGVzaw==38c
c.
rammyjuke.com/kaI1wWRd8Y5yfbU9dmVyPTQuMCZiaWQ9YjZjYWVhNjE0NjhhMmQ4ZTc0OGQ
3ZTEzMTIyMDZiMDQ4NWY2MjJhYSZhaWQ9NDAxOTcmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdv
b2dsZS5pdCZxPWZvcnVtIGFybWF0YSBkZWxsZSB0ZW5lYnJl37g
d.
iau71nag001.com/Kvb13nWd6P4XrFs3dmVyPTQuMiZiaWQ9MDU0NWQwZDQwY2MyODU4YWNj
YzFlZjJkM2FiZDA5N2RiYmRlYmVkZiZhaWQ9NTAwMTgmc2lkPTAmcmQ9MCZlbmc9d3d3Lmdv
b2dsZS5pdCZxPWZhY2Vib29r27c
e.
zhakazth.cn/qkF3Vrye5c4qHoo4dmVyPTUuMCZzPTAmYmlkPTA1MzMyNGU1MzQzMDY5NTZiYW
YxNGViYTQ5YWY4ZGZhM2I2OWEwYTQmYWlkPTMwNDIxJnNpZD0zJmVuZz13d3cuZ29vZ2xlLml
0JnE9dHJvbWJhdGErdnVhaWVyK2NvbiticmFzaWxpYW5hK2luK3NwaWFnZ2lhJng4Nj02NA==16h
a-b:
11
a-c:
56
a-d:
97
a-e:
182
17
19. EXAMPLE
๐ข๐๐1 = โ๐๐๐๐๐๐. ๐๐๐โ 10 ๐โ๐๐๐๐๐ก๐๐๐ ;
๐ข๐๐2 = โ1๐๐๐๐๐๐. ๐๐๐โ 11 ๐โ๐๐๐๐๐ก๐๐๐ ;
๐ฟ๐๐ฃ๐๐๐ โ๐ก๐๐๐ ๐๐๐ ๐ก๐๐๐๐ ๐๐๐ ๐ข๐๐1, ๐ข๐๐2 =
1 ๐๐๐ ๐ค๐๐๐โ๐ก: 1 + 1 ๐๐๐๐๐๐๐๐๐๐๐ก ๐ค๐๐๐โ๐ก: 2 = 3;
๐ซ๐ผ๐น๐ณ ๐๐๐๐, ๐๐๐๐ =
3
10+11
= 0.143
URL DISTANCE
Measure to calculate strings similarity.
๐ซ๐ผ๐น๐ณ ๐๐๐๐๐๐๐, ๐๐๐๐๐๐๐ =
๐ฟ๐๐ฃ๐๐๐ โ๐ก๐๐๐ ๐๐๐ ๐ก๐๐๐๐ ๐๐๐(๐ ๐ก๐๐๐๐1, ๐ ๐ก๐๐๐๐2)
๐ ๐ก๐๐๐๐1 + ๐ ๐ก๐๐๐๐2
Based on Levensthein
distance,
unitary weight for adding
and removing,
double weight for
replacements
plus normalization
FORMULA
DBSCAN
calculation
Results
Distance
calculation
Log
URLs
extraction
How to use this metric to group
similar URLs?
19
20. DBSCAN
Clustering algorithm used for grouping URLs together.
Features
It allows the presence of outliers: prevents non-
coherent elements to be added to the cluster.
Must not define the number of clusters a priori
Must not define centroids
Do not mandatory require points in Euclidean space
Can handle different shaped clusters and not only
globular ones
Parameters
Epsilon, radius of the considered area
Min points, minimum number of points inside the area Example of clustering with DBSCAN
Based on the idea of density, intended as the
number of points in a specific area; compared
to other algorithms families it provides partial
solutions.
DBSCAN
calculation
Results
Distance
calculation
Log
URLs
extraction
20
21. SCHEMA
Final schema. Developed in Python.
Log files
URLs List URL distance between
every couple of elements
Compute
Distance
Matrix
Distance Matrix
Extract HTTP
Object URLs
Load
Distance
Matrix
Compute
DBSCAN
Clusters
Statistics
DBSCAN
calculation
Results
Distance
calculation
URLs
extraction
Log
21
23. DISTINCT URL ELEMENTS
ANALYSIS
Analysis of the HTTP traffic
generated by
14 Hosts infected by TidServ
20 randomly selected Hosts.
Analysis of DBSCAN clustering on 34 Hostsโ Test Set.
About
TidServ 228
Other malware 33
Benign 78160
Total 78421
DBSCAN
calculation
Results
Distance
calculation
URLs
extraction
Log
Is it possible to separate all the 228
malicious URLs from the data?
And which parameters shall be used?
23
24. URLs
extraction
CLUSTERING
Results for 34 Hosts infected by TidServ.
NUMBER OF OUTLIERS
Performance
Decrease in the number of outliers, for
growing Epsilon.
DBSCAN
calculation
Log
Results
Distance
calculation
Lots of outliers
Few outliers
24
25. URLs
extraction
CLUSTERING
Results for 34 Hosts infected by TidServ.
NUMBER OF CLUSTERS
Performance
More complicated relations with the
number of clusters
Increase in the number of clusters for
Epsilon = 0.2 and 0.225, due to the fact
that many elements previously
considered noise constitute new
clusters.
DBSCAN
calculation
Log
Results
Distance
calculation
Lots of very small clusters
Few giant clusters
Which E allows us to isolate the 228
malicious URLs?
Note: from 78000++ URLs to 300 clusters
25
26. URLs
extraction
CLUSTERING
Results for 34 Hosts infected by TidServ.
CLUSTERING RESULTS
FOR TIDSERV - OUTLIERS
Performance
Decrease in the number of outliers, until
reaching 0 for Epsilon = 0.4.
DBSCAN
calculation
Log
Results
Distance
calculation
All Tidserv URLs are clustered
26
27. URLs
extraction
CLUSTERING
Results for 34 Hosts infected by TidServ.
CLUSTERING RESULTS
FOR TIDSERV
Performance
Constant and coherent growing of the
number of known elements included and
ability to aggregate additional not-
reported elements.
DBSCAN
calculation
Log
Results
Distance
calculation
Nr. of IDS-
flagged
URLs
(228)
Few giant clusters
Why more than 228 URLs are actually
clustered?
27
28. URLs
extraction
CLUSTERING
Results for 34 Hosts infected by TidServ.
CLUSTERING RESULTS
FOR TIDSERV
Performance
Constant and coherent growing of the
number of known elements included and
ability to aggregate additional not-
reported elements.
DBSCAN
calculation
Log
Results
Distance
calculation
Cluster ID TidServ - IDS Count All elements Count
A 5 5
B 18 32
C 5 6
D 75 79
E 118 192
F 6 6
G 1 37
Total 228 357
Do those clusters contain actually
similar URLs?
28
29. TIDSERV ANALYSIS
Cluster G โ Compare Elements
โข gnu4oke0r.com/4VY00y9P7Z5xiPs9dmVyPTQuMCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPT
AmcmQ9MCZlbmc9d3d3Lmdvb2dsZS5pdCZxPWxvdWlzIGNydWlzZXM=16h
โข lkckclcklii1i.com/TAR3vUsX844qz1c5Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPTA
mcmQ9MA==27g
โข lkckclckl1i1i.com/TAR3vUsX844qz1c5Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPTA
mcmQ9MA==27g
โข lkckclcklii1i.com/ZvP1nw3P6z6XLSs7Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPTA
mcmQ9MA==26g
โข lkckclckl1i1i.com/ZvP1nw3P6z6XLSs7Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPT
AmcmQ9MA==26g
โข lkckclcklii1i.com/yVv4l79D5E7yT8u9Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPTA
mcmQ9MA==18x
โข lkckclckl1i1i.com/yVv4l79D5E7yT8u9Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPTA
mcmQ9MA==18x
โข lkckclcklii1i.com/3Zh2DpoP583XBvc2Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPTA
mcmQ9MA==05Z
โข lkckclckl1i1i.com/3Zh2DpoP583XBvc2Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPTA
mcmQ9MA==05Z
โข lkckclcklii1i.com/ZaW4pfQP6P4Q7EO9Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPT
AmcmQ9MA==06c
โข lkckclckl1i1i.com/ZaW4pfQP6P4Q7EO9Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPT
AmcmQ9MA==06c
โข lkckclcklii1i.com/SVn4kZCE8Y6MEes8Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPT
AmcmQ9MA==38A
โข lkckclckl1i1i.com/SVn4kZCE8Y6MEes8Y2xrPTIuNCZiaWQ9NWJjNWFiMjE1YjRmN2I4ZjM3OTRmODNkZjhmNWY0ZjFmODZkYjE1YyZhaWQ9MzAwMDEmc2lkPT
AmcmQ9MA==38A
Tidserv OK (better than IDS!!!)
But what about the 300++ clusters?
29
30. SILHOUETTE
Silhouette values distribution for some representative clustering results.
CALCULATIONS
Performance
Consider clusters with more than 20
elements
Most clusters have silhouette > 0
Tidservโs clusters are not those with the
highest silhouette (between 0.7 and 0.4)
Clusters with silhouette > 0 are associated to
URL algorithmically generated
This behavior is evident for silhouette > 0.7
DBSCAN
calculation
Results
Distance
calculation
URLs
extraction
Log
Cohese clusters
Sparse clusters
S(C)
30
31. SILHOUETTE
Examples of groupings (Eps = 0.4, MinPts = 4).
DBSCAN
calculation
Results
Distance
calculation
URLs
extraction
Log
Clusters sorted by silhouette coefficient
S(C) Main hostname (unique number) Elements Activity
0.92 skygo_streaming-i.akamaihd.net (1) 551 Streaming
0.91 ad.doubleclick.net (1) 99 Advertising
0.87 cookex.amp.yahoo.com (1) 61 Malware
0.85 static.simply.com (1) 25 File Hosting
0.81 d24w6bsrhbeh9d.cloudfront.net (1) 63 File Hosting
0.81 mfdclk001.org (1) 27 Malware
0.78 adserver.webads.it (1) 35 Advertising
0.77 .com (3) 37 TidServ
0.75 pixel.quantserve.com (1) 57 Advertising
0.72 watson.microsoft.com (1) 29 Windows
Debug
0.7 coadvertise.cubecdn.net (1) 36 Advertising
0.69 atdmt.com (2) 768 Tracking
0.65 su.ff.avast.com (1) 82 Avast Update
0.64 log.dmtry.com (1) 24 Advertising
0.61 clickpixelabn.com (1) 32 Malware
S(C) = silhouette coefficient for the cluster
Take away
Results on other clusters
Clusters contains very similar URLs
Easy to identify specific services
โข Streaming
โข ADS
โข Malware
โข Tracking
โข Software update
Most are automatically generated URLs
Helps the monitoring/security analyst to
understand network traffic
31
34. CONCLUSIONS & FUTURE WORK
Benefits of the system and possible next steps.
CLUE Automatically provides aggregated views of URLs
๏ง Simplifies network/security administratorโs tasks
Use of passively monitored network traffic
๏ง Transparent for the user
Completely unsupervised methodology
๏ถ Further analyze clusters to extract common, interesting behaviors
๏ถ Allow greater system scalability
๏ถ Iterative approach
๏ถ Use CLUE to identify other interesting patterns (e.g. look at the User Agent)
โฆIn future:
34
Editor's Notes
Impatto e dimensione di internet
Impatto e dimensione di internet
Impatto e dimensione di internet
Impatto e dimensione di internet
Dรฌ che Epsilon sarร la distanza calcolata da Ratio