Paper at: http://www2016.net/proceedings/proceedings/p121.pdf
Abstract: Online tracking poses a serious privacy challenge that has drawn significant attention in both academia and industry. Existing approaches for preventing user tracking, based on curated blocklists, suffer from limited coverage and coarse-grained resolution for classification, rely on exceptions that impact sites’ functionality and appearance, and require significant manual maintenance. In this paper we propose a novel approach, based on the concepts leveraged from k-Anonymity, in which users collectively identify unsafe data elements, which have the potential to identify uniquely an individual user, and remove them from requests. We deployed our system to 200,000 German users running the Cliqz Browser or the Cliqz Firefox extension to evaluate its efficiency and feasibility. Results indicate that our approach achieves better privacy protection than blocklists, as provided by Disconnect, while keeping the site breakage to a minimum, even lower than the community-optimized Ad-Block Plus. We also provide evidence of the prevalence and reach of trackers to over 21 million pages of 350,000 unique sites, the largest scale empirical evaluation to date. 95% of the pages visited contain 3rd party requests to potential trackers and 78% attempt to transfer unsafe data. Tracker organizations are also ranked, showing that a single organization can reach up to 42% of all page visits in Germany.
8. Twitter pages only
accessible to the
authenticated user also
load 3rd parties like GA
This browsing session
on 5 different sites
involved more than 60
different 3rd parties.
9. GET /css?family=Open+Sans+Condensed:300,700
Host: fonts.googleapis.com
User-Agent: Mozilla/5.0 ... Firefox/45.0
Referer: http://www.meetic.com/home/index.php
IP: 79.227.235.241
fonts.googleapis.com is a potential tracker
<meetic.com/home/index.php, UID>
<www20016.ca/, UID>
<wired.com/, UID>
However, in THIS request, there is no data element that can be used as a
UID.
Since there is no unsafe data element, the request is safe.
10. GET /impression.php/f3ae074XXX/api_key=597038480XXX&lid=115…
Host: www.facebook.com
User-Agent: Mozilla/5.0 … Firefox/45.0
Referer: http://www.meetic.com/home/index.php
Cookie: datr=0IPhVj5YHEJ20XXX; c_user=10973XXXX; … csm=2;
IP: 79.227.235.241
facebook.com is a potential tracker too,
<meetic.com/home/index.php, 10973XXXX>
<www20016.ca/, 10973XXXX>
<wired.com/, 10973XXXX>
<ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200, 10973XXXX>
Unlike fonts.googleapi.com, the request above is not safe with regards to
privacy because it contain two values that we consider unsafe, thus could be
used as UIDs,
c_user=10973XXXX and datr=0IPhVj5YHEJ20XXX
Because it contains at least one unsafe value, the request is considered unsafe.
11. GET /collect?
v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&
..._u=QCCAAAABI~&jid=&cid=6531474...
Host: www.google-analytics.com
Referer: http://www.meetic.com/home/index.php
IP: 79.227.235.241
google-analytics.com is a potential tracker too,
<meetic.com/home/index.php, 1291x522:79.227.235.241>
<www20016.ca/, 1291x522:79.227.235.241>
<wired.com/, 1291x522:79.227.235.241>
<ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200,
1291x522:79.227.235.241>
<analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241>
The UID is not as evident as for Facebook. But the combination vp+IP is an
unsafe data element, it can be used as a UID. Therefore this request is also
unsafe.
vp+IP = 1291x522:79.227.235.241
12. GET /collect?
v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&
..._u=QCCAAAABI~&jid=&cid=6531474...
Host: www.google-analytics.com
Referer: http://www.meetic.com/home/index.php
IP: 79.227.235.241
google-analytics.com is a potential tracker too,
<meetic.com/home/index.php, 1291x522:79.227.235.241>
<www20016.ca/, 1291x522:79.227.235.241>
<wired.com/, 1291x522:79.227.235.241>
<ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200,
1291x522:79.227.235.241>
<analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241>
The UID is not as evident as for Facebook. But the combination vp+IP is an
unsafe data element, it can be used as a UID. Therefore this request is also
unsafe.
vp+IP = 1291x522:79.227.235.241
14. Tracking in the Wild
Largest field study with real traffic to date,
200,000 users in Germany for a week(*)
21M page loads,
5M unique pages (URLs)
from 350K domains
(*) Between 09/09/2015 and 16/09/2015
15. Tracking in the Wild: Prevalence
Potential trackers
are 3rd parties that are
present in many different
domains.
Unsafe data
elements
are data elements for which
we cannot rule out that
possibility that they are
UIDs.
21 M
page
loads
without
poten3al
trackers
with
poten3al
trackers
1 to 9 >= 10
5% 95%
24%76%
16. Tracking in the Wild: Prevalence
Potential trackers
are 3rd parties that are
present in many different
domains.
Unsafe data
elements
are data elements for which
we cannot rule out that
possibility that they are
UIDs.
21 M
page
loads
without
unsafe
values
with
unsafe
values
1 to 9 >= 10
22% 78%
21%79%
17. Tracking in the Wild: Prevalence
Potential trackers
are 3rd parties that are
loaded in many different
domains.
Unsafe values
are data elements for which
we cannot rule out that
possibility that they are
UIDs.
21 M
page
loads
without
unsafe
values
with
unsafe
values
1 to 9 >= 10
22% 78%
21%79%
78%
of all page loads
can be tracked
18. Tracking in the Wild: Reach
% of page loads
seen
% of page loads seen with unsafe
data elements (tracking)
rank
Google 62.4% 42.4% 1st
Facebook 21.1% 18.5% 2nd
AppNexus 10.15% 9.9% 3rd
ADITION 8.7% 8.4% 4th
Criteo 8.7% 8.2% 5th
…
Comscore 6.1% 5.9% --
DoublePimp 0.5% 0.5% --
NewRelic 2% 0.03% --
…
19. Tracking in the Wild: Reach
% of page loads
seen
% of page loads seen with unsafe
data elements (tracking)
rank
Google 62.4% 42.4% 1st
Facebook 21.1% 18.5% 2nd
AppNexus 10.15 9.9% 3rd
ADITION 8.7% 8.4% 4th
Criteo 8.7% 8.2% 5th
…
Comscore 6.1% 5.9% --
DoublePimp 0.5% 0.5% --
NewRelic 2% 0.03% --
…
58
organizations
with a reach
larger than 1%
21. CLIQZ Tracking Protection
Maximize
coverage,
minimize false
positives
Aggressiveness is counter-productive…
• increases site breakage, which forces users to add exceptions, thus
reducing protection coverage.
• affects legitimate services and data collection
22. Block only the Ability to Track
GET /collect?
v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&
..._u=QCCAAAABI~&jid=&cid=6531474...
Host: www.google-analytics.com
Referer: http://www.meetic.com/home/index.php
IP: 79.227.235.241
Intervention only on unsafe data elements – those
elements that can be used as UIDs,
Should protect the user, while minimizing side-effects:
a) site-breakage for users
b) legitimate data collection for 3rd parties
23. Blocklists are coarse-grained
CDF of the number of requests with observed unsafe data elements by 3rd
party domains contained both in Disconnect Blocklist and CLIQZ list of
potential trackers (~2000 domains each). Intersection is 477 domains.
24. Blocklists are coarse-grained
CDF of the number of requests with observed unsafe data elements by 3rd
party domains contained both in Disconnect Blocklist and CLIQZ list of
potential trackers (~2000 domains each). Intersection is 477 domains.
Only 2% of
tracker
domains in
Disconnect
always send
unsafe data
elements.
25. Blocklists are coarse-grained
CDF of the number of requests with observed unsafe data elements by 3rd
party domains contained both in Disconnect Blocklist and CLIQZ list of
potential trackers (~2000 domains each). Intersection is 477 domains.
98% of tracker domains
have a MIXED
behavior
Lack of resolution…
Only 2% of
tracker
domains is
Disconnect
always send
unsafe data
elements.
27. Blocklists are too coarse-grained
EasyPrivacy (from Adblock Plus) has hundreds of regular
expressions to cover for mixed behavior of trackers.
BLOCKLIST by Domain + RegExp Exceptions
28. Blocklists are too coarse-grained
BLOCKLIST by Domain + More RegExp Exceptions
EasyPrivacy (from Adblock Plus) has hundreds of regular
expressions to cover for mixed behavior of trackers.
29. We propose a more fine-grained approach
to algorithmically determine the safeness
level of individual data elements within a
request to a 3rd party
30. Determining Safeness
Each 3rd party request to a potential tracker is parsed to obtain a list of tuples
T = [<s, d, k, v>] whose safeness level is evaluated in real-time,
T = [
<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>,
<s=wired.com/, d=3rdparty.com, k=fl,v=21.0>,
<s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>,
<s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>,
<s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>,
<s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The aim is to identify which data elements (including combinations) are unsafe,
and therefore, they are candidates to be used as UIDs.
31. Determining Safeness
Each 3rd party request to a potential tracker is parsed to obtain a list of tuples
T = [<s, d, k, v>] whose safeness level is evaluated in real-time,
T = [
<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>,
<s=wired.com/, d=3rdparty.com, k=fl,v=21.0>,
<s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>,
<s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>,
<s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>,
<s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The aim is to identify which data elements (including combinations) are unsafe,
and therefore, they are candidates to be used as UIDs.
We cannot do this effectively. But we can do the opposite, identify data
elements that cannot be used effectively as UIDs, and consider them safe.
32. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
All tuples are
UNSAFE by
default unless we
can determine
that the given
data-element is
not a good UID,
hence safe.
33. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The value
1501498154 has
never been seen
before
for <d, k>.
Thus, cannot be
used as UID =>
SAFE
34. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The value 21.0 is
to short to encode
any UID => SAFE
35. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
More than 3
different values in
less than 2 days
by the same tuple
<d,k>.
Not persistent,
bad UID => SAFE
36. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Always the same
value for
<d,k>.
We cannot rule
out that the data-
elements are UID
=> keep as
UNSAFE
Only using local information is not
enough; vr=1440x1024 is not a UID…
We need something extra.
37. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Locally UNSAFE, i.e. always the same
value for <d, k>.
Globally SAFE since more than 20 other
users have observed the same value
1440x900 for tuple <d,k> =
<3rdparty.com,vr> in the last 2 days.
38. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Locally UNSAFE, i.e. always the same
value for tuple <d,k>.
Globally SAFE since it has reach
the safeness-quorum based on
k-Anonymity.
39. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Locally UNSAFE, i.e. always the same value for <d, k>.
Globally UNSAFE not enough people has seen the value
for <d, k>, always same <d, k, u>. Not safe to send.
Two options:
a) it is a UID, or an element that could be used as such.
b) a false positive due to the Transient State (0.07%)
40. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Locally UNSAFE and Globally UNSAFE
At this point the request analysis is complete:
1) ALLOW Request removing unsafe data-elements
2) ALLOW Request obfuscating unsafe data-elements
3) BLOCK Request or ALLOW Request without alteration
41. Safeness Quorum without Tracking
To determine that a data-element is globally safe we need to count the number of
unique users that have observed a tuple
<d,k,v> e.g. <d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>
Users could share tuples with a field that identifies them (u),
<u=usrXXX, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>
with CLIQZ. But that would make CLIQZ a tracker! Instead, each user sends the
tuple – if observed – once and only once per hour:
<d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>
Actual values are not needed; counting and membership test on GWL
<d=ed5c0cf7b05572eb, k=4d3a21d8c684c09c19b93be911827fd5,
v=e60f936dc719ca649a80a97490a09940>
43. Evaluation: Site Breakage
Reload Rate % Increase
over baseline
% Increase
over CLIQZ
BASELINE
(without tracking
protection)
0.00101 -- --
CLIQZ 0.00104 4% --
Adblock Plus
(counting exceptions added by users)
0.00110 10% 150%
CLIQZ as
Blocklist
0.00125 25% 525%
44. Conclusions
Tracking is a BIG problem
– Privacy is seriously at risk
Tracking Protection is not an easy task
– Trade-off between site breakage and protection
coverage
Blocklist-based approaches have limitations
– Maintainability
– Coarse-grained resolution
– Too many false positives
CLIQZ tracking protection addresses them to a large extent
45. Future Work
CLIQZ tracking protection
might be better than the
state-of-the-art. But it is far
from perfect,
• still produces site-
breakages
• protection coverage is
not 100%
• it can be attacked in
multiple ways
[Picture from http://mtthwhgn.com/tag/flooding/]
we provide a bigger hammer for the whack-a-tracker
48. Implementation Details
Realtime Component
1) Parsing request
2) Local safeness:
membership test on LWL
3) Global safeness:
membership test on GWL
LWL and GWL are Bloom
Filters, combined less
than < 512KB, FP ratio
of 0.1%.
Takes about 1-12 ms.
Offline Component
Data from users needs to be sent to CLIQZ to build
GWL for the safeness quorum.
GWL needs to be sent back to the users’ browsers.
We use an eventual consistency model with
incremental updates over daily snapshots.
Bandwidth costs per user per day: 90KB upload,
566KB download. For a worse-case propagation lag
of 10 minutes.
False positive unsafe data elements due to
transient state is 0.07%
49. Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x1024>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x981>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Cookies from potential trackers
are always blocked.
POST requests are also analyzed,
blocked only if:
• match Cookie values
• match QS values declared
unsafe
• match values from browser-
fingerprinting
User initiated actions are always
ALLOWED (even if tracking)