5. 5
The study's authors defined "creepiness" by the feeling
consumers get when they sense an ad is too personal
because it uses data the consumer did not agree to
provide, such as online-search and browsing history.
Consumers are even more creeped out by this because
they don't know how and where that information will
be used.
7. amazon.com imdb.com facebook.com
X Y
X
Y
IP 1.1.1.1
ID-A = “aaa”
IP 1.1.1.1
ID-X = “xxx”
IP 2.2.2.2
ID-B = “bbb”
IP 2.2.2.2
ID-Y = “yyy”
IP 3.3.3.3
ID-C = “ccc”
IP 3.3.3.3
ID-X = “xxx”
IP 3.3.3.3
ID-Y = “yyy”
Linking Tracker Information
7
Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give
You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299
8. Can’t we block them?
proxy
Tracker
Tracker
Ad Server
8
Legitimate site
9. ● not frequently updated
● not sure who or based on what criteria URLs are
blacklisted
● miss “hidden” trackers or dual-role nodes
● blocking requires manual matching against the list
● can you buy your way into the whitelist?
Available Solutions
AdBlock, DoNotTrack, EasyPrivacy:
crowd-sourced “black lists” of tracker URLs
9
11. Our Goal
Exploit fundamental
properties necessary for
tracker operation
Use existing data
to build a trackers classifier
● structural attributes:
connections, network
positions
● operational aspects: data
volume exchange,
communication patterns
12. Can we detect Trackers automatically?
● Are Trackers similar? How?
○ network structure
○ data received/sent
○ response times
○ latency
● Are Trackers different from normal sites? How?
● Are Trackers mainly connected to other Trackers?
12
13. The Road to our Goal
● algorithms
● tuning
● features
● combinations of
algorithms and features
and parameters...
13
15. Basic Dataset Analysis
● How many requests to
Trackers?
DataSet API
● Do Tracker requests have
larger latency than other
requests?
15
● How many Trackers
○ per user?
○ per request?
○ per website?
● Do popular websites embed
more Trackers than others?
● Do same-topic websites share
Trackers?
● Do different users visiting the
same website end up on
different Trackers?
● Do Trackers send / receive
more / less bytes?
● Do they have more / less
connections on average?
16. Main Idea
Model the data as a
referer → host bipartite
graph and exploit the
graph structure to identify
Trackers
facebook.com
youtube.com
google-analytics.com
b.scorecardresearch.com
embedded URLsURLs explicitly
visited by the user
16
17. Attempt#1
Relevance Search
Iterative, random walk-like
algorithm for bipartite graphs
Given an input source node,
assign a “relevance score” to other
nodes, based on how similar their
network position is
Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search
and anomaly detection in bipartite graphs. ACM SIGKDD Explorations
Newsletter,7(2), 48-55.
19. Relevance Search Implementation
● single-source relevance search
○ similar to pagerank
○ easily mapped to vertex-centric iterations
● multi-source relevance search
○ each vertex keeps a vector of scores
○ compute top-k relevant nodes per source
○ merge the top-k lists
19
Gelly API
21. Relevance Search Tuning
● How many and which sources to give as input?
● How to define convergence?
● Does initialization matter?
● How to weigh the input graph?
● How to define the relevance score threshold?
21
22. Relevance Search Problems
● Easy to find the few very similar and the few very
different pages
● Popular trackers are similar to other popular trackers,
but not to not-so-popular ones
● We might keep re-discovering what we already know
22
28. Attempt#N
Community Detection on the
Projection Graph
The Projection Graph captures
implicit connections between
trackers, through other sites
Do Trackers form communities in
the Projection Graph?
29. ● Do they form connected
components?
Basic Analysis of the Projection Graph
● Do Trackers have unusually
high degrees?
DataSet & Gelly APIs
29
● Are they mainly connected to
other Trackers?
31. Final Data Pipeline
raw logs
cleaned
logs
1: logs pre-
processing
2: bipartite graph
creation
3: largest
connected
component
extraction
4: hosts-
projection graph
creation
5: community
detection
google-analytics.com: T
bscored-research.com: T
facebook.com: NT
github.com: NT
cdn.cxense.com: NT
...
6: results
DataSet API
Gelly
DataSet API
31
Very high accuracy and
very low FPR :-)
32. Start simple
Lessons Learned
Choose features incrementally
Visualize your data
Re-evaluate your models
Try different data representations
Use a flexible system
33. Automatic Detection
of Web Trackers
Vasia Kalavri
Apache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri