The document describes MindYourPrivacy, a system designed to visualize third-party web tracking. It analyzes web browsing traffic to identify trackers and visualize them in a tag cloud. The system was tested at an event where 129 attendees' traffic was captured. Analysis of the web graph found advertising sites clustered together due to their many incoming links but few outgoing links. While the Do Not Track flag was enabled in only 6% of requests, visualization helped reveal privacy issues around third-party tracking. Future work includes adopting more sophisticated detection approaches.
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
Visualize & Detect Third-Party Web Tracking
1. IEEE, 12th Annual Conference on Privacy Security
Trust, PST 2014
MindYourPrivacy: Design and
Implementation of a Visualization
System for Third-Party Web
Tracking
Yuuki Takano, Satoshi Ohta,
Takeshi Takahashi, Ruo Ando,
Tomoya Inoue
1
2. Introduction
❖ The number of third-party Web tracking is growing each year.!
❖ online privacy is now significant issue!
❖ SNSs and targeted ads can associate real names of individuals with tracking
information!
❖ Propose MindYourPrivacy to visualize and show third-party web tracking.!
❖ deep-packet-inspection based architecture!
❖ to support heterogeneous browsers and devices!
❖ Experimented MindYourPrivacy at the Workshop (WIDE Camp 2014 Autumn in
JAPAN), which has 129 attendees.!
❖ reveal that clustering web graph helps to detect ads’ sites by analyzing user traffic!
❖ some graph theory features also help to heuristically detect ads sites
2
3. Related Work
Web Tracking Mechanism
❖ Third-party Web tracker typically tracks by cookie,
Etags or flash storage
web bug (1x1 pict)
ads
social widgets
First-party Web servers
Third-party Web tracker
tracking id (cookie, Etags, flash storage, etc...)
contents
contents
3
6. Related Work
Web Tracking Detection Techniques
❖ ShareMeNot!
❖ swap a link to known data-collection sites such as Facebook!
❖ Roesner et al. “Detecting and defending against third-party tracking on the
web”, USENIX NSDI 2012!
❖ Lightbeam!
❖ visualize web graph between first and third-party sites!
❖ https://www.mozilla.org/lightbeam/!
❖ AdBlock Plus!
❖ signature based ads detection and blocking!
❖ https://adblockplus.org/en/firefox
6
7. Related Work
Measurements
❖ Several researchers reported on third party web tracker.!
❖ One of the research reported third-party trackers within Alexa’s top 500 domains.!
❖ Roesner et al, “Detecting and defending against third-party tracking on the web”, USENIX NSDI 2012!
e fact that the tracking
t it is thus difficult to
or policy solutions.
s classification is ag-
on of the mechanisms
e storage may be done
, and information may
ker in any way. This
anism makes the clas-
evolution of specific
by trackers.
ework, we created a
tomatically classifies
rved on the client-side.
Figure 6: Prevalence of Trackers on Top 500 Domains.
Trackers are counted on domains, i.e., if a particular tracker
appears on two pages of a domain, it is counted once.
Top 20 Trackers on Alexa’s Top 500 Domains!
[Roesner et al. NSDI 2012]
7
8. MindYourPrivacy
Design Principle
❖ We designed and implemented a visualization system for third-party web tracking called
MindYourPrivacy.!
❖ To clearly show third-party web trackers to users.!
❖ Design Principles of MindYourPrivacy!
❖ Independence from browsers and devices!
❖ the existence of various OSes or devices such as Linux, Windows, MacOS, and smartphone
OSes such as Android and iOS complicates the problem!
❖ adopt a deep-packet-inspection based approach to support heterogeneous browsers and devices!
❖ Accessibility and comprehensiveness of the analysis results!
❖ easy to access: MindYourPrivacy provides analysis results in the form of an HTML file via an
HTTP server to facilitate users’ access to them!
❖ easy to understand: visualize trackers by tag cloud fashion, and provide web graph’s file further
analysis
8
9. Design and Implementation
Web Tracker Identification Methodology (1)
❖ HTTP Referrer Web Graph Analysis!
❖ generate a web graph by using HTTP referrer tag!
❖ if an site is referred by many other sites, MindYourPrivacy
assumes that it is a suspicious site tracking users!
❖ Domain Aggregation!
❖ to show users which organizations track them, MindYourPrivacy
aggregates domains as either second or third level!
❖ platform.twitter.com and platform0.twitter.com are aggregated to
twitter.com
9
10. Design and Implementation
Web Tracker Identification Methodology (2)
❖ DNS-SOA-Record-Based Grouping!
❖ aggregate domains by DNS SOA record!
❖ facebook.com and facebook.net are aggregated into dns.facebook.com,
which is their DNS SOA record!
❖ Balanchander et al., “Privacy diffusion on the web: a longitudinal
perspective”, WWW 2009!
❖ Weighted site Ranking of User Data Leakage!
❖ MindYourPrivacy shows not only web trackers but also leaking sites to
trackers!
❖ leaking sites are scored, but the details are omitted here. see our paper
10
11. Design and Implementation
System Model
❖ MindYourPrivacy captures traffic of users’ web access!
❖ show analyzed results via MindYourPrivacy’s web server!
❖ users need not install or configure specific applications
MindYourPrivacy
The Internet
Traffic Capture
Web Access
Analyzed Result via HTTP
Outgoing Traffic
Router・・・
Users
11
12. Design and Implementation
Implementation Architecture
❖ Catenaccio DPI!
❖ capture traffic from network IF!
❖ reconstruct TCP stream and store captured data into
NoSQL DB!
❖ written in C++!
❖ NoSQL DB!
❖ use MongoDB as a database!
❖ Tracking Analyzer!
❖ analyze measurement data!
❖ written in JavaScript and Python!
❖ HTML/Graph File Generator!
❖ generate visualized results!
❖ written in Python!
❖ HTML Server!
❖ serve HTML/Graph files to users
Catenaccio DPI NoSQL DB
Tracking Analyzer
HTML/Graph File
Generator
HTML Server
NW/IF
L2 Datagram
Measurement Data
Analyzed Result
Measurement Data
HTML/Graph Files
Analyzing Result
12
13. Design and Implementation
Web User Interface
❖ visualize suspicious web trackers as tag cloud fashion!
❖ domains are grouped by DNS SOA records!
❖ referring sites are shown in right pane
14. Experiment at WIDE Camp 2013 Autumn
❖ We experimented MindYourPrivacy at WIDE camp 2013 autumn.!
❖ WIDE Camp 2013 Autumn (Sep. 10 - Sep. 13)!
❖ a workshop for Internet researchers, operators and developers!
❖ 129 attendees, most of whom are either IT specialists or
students majoring IT!
❖ the experiment is agreed by every attendees (for only research
purpose)!
❖ We captured the attendees’ web browsing traffic and analyzed.
14
15. Experiment
User Traffic Analysis (1)
❖ Obtained 734,194 HTTP
requests and 1,661
individual source IP
addresses (IPv4 and IPv6).!
❖ A directed web graph is
generated by using HTTP
referrer header.!
❖ There are 3,966 nodes and
12,941 edges.!
❖ Analyze this web graph to
find web trackers.
15
16. Experiment
User Traffic Analysis (2)
❖ To find web trackers, we extract top most-referred sites
from the web graph!
❖ Advertisements and social sites, which tend to track
users, have many incoming links.
ttendees
Total
117
12
129
RLs are only
TABLE II: Top-five Most-referred Sites
Site # of incoming links
google-analytics.com 847
facebook.com 437
twitter.com 393
doubleclick.net 380
google.com 356
16
Top-Five Most-referred Sites
17. Experiment
User Traffic Analysis (3)
❖ We then adopted a clustering technique (M-CODE) to the web graph.!
❖ As a result of clustering, many ad-sites are found in cluster.
referred Graph Pane: This pane provides referred
.dot and .sif formats. Users can download these
re and analyze or visualize the referred graph by
viz, Cytoscape, etc. Figures 5 and Figure 6 show
examples using Cytoscape. Through this sort of
users can easily find to which sites many other
IV. Experiment
strate the usability and effectiveness of the pro-
m, we conducted an experiment at WIDE camp
September 10–13 2013.
E project [19] is a research and development
apan aimed at developing a widely integrated
nvironment. It organizes camps every spring and
many researchers, developers, and students tak-
discussing Internet technologies. Table I lists the
f the camp attendees. There were 129 attendees,
m are either IT specialists or students majoring in
conducted two types of experiments: user traffic
questionnaire-based use analysis.
whose values are random text strings, the number of coo
values we observed, and examples. In total we obser
2,309 and 2,671 requests for platform.twitter.com
www.facebook.com, respectively. However, we found o
about 100 unique values for each cookie, though fr
www.facebook.com is 397. fr thus does not seem to
tracking cookies, and the 100 likely indicates the numbe
attendees (which was also around 100) or devices. The res
reveal that tracking cookies can also be used for per-u
analysis and visualization.
We then applied MCODE clustering [20] to the graph
Figure 5 to find further features. This allowed us to obse
many ad sites clustered into the rank 1 cluster by MCO
The following domains were ad sites found in the ran
cluster of Figure 6:
doubleclick.net, amazon-adsystem.com,
googleadservices.com, i-mobile.co.jp,
advg.jp, adingo.jp, iogous.com, admeld.com,
criteo.com.
Ad sites generally tend to collect user information for busin
purposes. We therefore should be concerned with the priv
issues they present. This discovery should help further anal
and visualization concerning such sites. Table IV lists
feature vector of ads and other sites that appeared in Figur
ad-sites in cluster
17
18. Experiment
User Traffic Analysis (4)
❖ We analyzed the cluster from the aspect of graph theory’s feature.!
❖ As a result of that, we found that ad-sites’ #incoming links, #outgoing links
and neighborhood connectivity are quite different from others.!
❖ ad-sites have many incoming links, but few outgoing links!
❖ ad-sites’ neighborhood connectivity is relatively low
18
Fig. 6: Rank 1 Cluster by MCODE (include loops = false,
degree cutoff = 2, haircut = true, fluff = false, node score
cutoff = 0.2, k-core = 2, and max. depth = 100)
TABLE IV: Feature Vector of Rank 1 Cluster’s Edge (Average
and Unbiased Variance)
#incoming links # of outgoing
links
Neighborhood
connectivity
avg. var. avg. var. avg. var.
ad sites 90.2 12405.4 15.2 3972.9 46.0 3972.9
others 30.2 3972.9 29.7 569.3 130.2 5212.0
measures, and the most popular measure is to use multiple
browsers. Although multiple browser usage does not strictly
the DNT flag i
tracking; it is ju
referrers or coo
online usability
not use SNSs.
of infrastructur
pros and cons o
The free-form
• Use privat
• Delete HT
• Use AdBlo
• Absolutely
Modern Web b
mode to isolat
responded that
Some of them
for not disablin
Some attendee
blocks online a
leakage throug
attendees answ
tracking. Such
privacy are qui
Question 3: D
after seeing the
19. Experiment
User Traffic Analysis (5)
❖ Do Not Track flag is used to announce a wish of users to
third-party trackers.!
❖ However only 40,650 (40,605/734,194 = 6 %) DNT
enabled requests are observed.
19
20. Conclusion and Future Work
❖ Proposed a visualization system for third-party web tracking called
MindYourPrivacy.!
❖ browser and device independent architecture!
❖ visualize web trackers as tag cloud fashion!
❖ Experimented MindYourPrivacy at WIDE camp 2013 autumn and analyze users’
web browsing traffic.!
❖ generate web graph by HTTP referrer and analyze it!
❖ revealed that graph clustering and some graph theory’s features are useful to
find web trackers!
❖ Adopting more sophisticated approaches we revealed at the experiment, and
signature based approach is a future work.
20