Tracking The Trackers WWW 2016

Tracking the Trackers
Zhonghao Yu zhonghao@cliqz.com
Sam Macbeth sam@cliqz.com
Konark Modi konarkm@cliqz.com
Josep M. Pujol josep@cliqz.com

Page load triggers
requests to multiple 3rd
parties

Even on pages on sites
that you probably want
to keep private, like
this dating site.

Of course, general
news domains also load
many 3rd parties

as well as electronic
commerce sites like
Ebay

Twitter pages only
accessible to the
authenticated user also
load 3rd parties like GA

Twitter pages only
accessible to the
authenticated user also
load 3rd parties like GA
This browsing session
on 5 different sites
involved more than 60
different 3rd parties.

GET /css?family=Open+Sans+Condensed:300,700
Host: fonts.googleapis.com
User-Agent: Mozilla/5.0 ... Firefox/45.0
Referer: http://www.meetic.com/home/index.php
IP: 79.227.235.241
fonts.googleapis.com is a potential tracker
<meetic.com/home/index.php, UID>
<www20016.ca/, UID>
<wired.com/, UID>
However, in THIS request, there is no data element that can be used as a
UID.
Since there is no unsafe data element, the request is safe.

GET /impression.php/f3ae074XXX/api_key=597038480XXX&lid=115…
Host: www.facebook.com
User-Agent: Mozilla/5.0 … Firefox/45.0
Cookie: datr=0IPhVj5YHEJ20XXX; c_user=10973XXXX; … csm=2;
IP: 79.227.235.241
facebook.com is a potential tracker too,
<meetic.com/home/index.php, 10973XXXX>
<www20016.ca/, 10973XXXX>
<wired.com/, 10973XXXX>
<ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200, 10973XXXX>
Unlike fonts.googleapi.com, the request above is not safe with regards to
privacy because it contain two values that we consider unsafe, thus could be
used as UIDs,
c_user=10973XXXX and datr=0IPhVj5YHEJ20XXX
Because it contains at least one unsafe value, the request is considered unsafe.

GET /collect?
v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&
..._u=QCCAAAABI~&jid=&cid=6531474...
Host: www.google-analytics.com
IP: 79.227.235.241
google-analytics.com is a potential tracker too,
<meetic.com/home/index.php, 1291x522:79.227.235.241>
<www20016.ca/, 1291x522:79.227.235.241>
<wired.com/, 1291x522:79.227.235.241>
<ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200,
1291x522:79.227.235.241>
<analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241>
The UID is not as evident as for Facebook. But the combination vp+IP is an
unsafe data element, it can be used as a UID. Therefore this request is also
unsafe.
vp+IP = 1291x522:79.227.235.241

Not a conveniently chosen example…
...tracking is a pervasive problem.

Tracking in the Wild
Largest field study with real traffic to date,
200,000 users in Germany for a week(*)
21M page loads,
5M unique pages (URLs)
from 350K domains
(*) Between 09/09/2015 and 16/09/2015

Tracking in the Wild: Prevalence
Potential trackers
are 3rd parties that are
present in many different
domains.
Unsafe data
elements
are data elements for which
we cannot rule out that
possibility that they are
UIDs.
21 M
page
loads
without
poten3al
trackers
with
poten3al
trackers
1 to 9 >= 10
5% 95%
24%76%

Potential trackers
present in many different
domains.
Unsafe data
elements
UIDs.
21 M
page
loads
without
unsafe
values
with
unsafe
values
1 to 9 >= 10
22% 78%
21%79%

Potential trackers
loaded in many different
domains.
Unsafe values
UIDs.
21 M
page
loads
without
unsafe
values
with
unsafe
values
1 to 9 >= 10
22% 78%
21%79%
78%
of all page loads
can be tracked

Tracking in the Wild: Reach
% of page loads
seen
% of page loads seen with unsafe
data elements (tracking)
rank
Google 62.4% 42.4% 1st
Facebook 21.1% 18.5% 2nd
AppNexus 10.15% 9.9% 3rd
ADITION 8.7% 8.4% 4th
Criteo 8.7% 8.2% 5th
…
Comscore 6.1% 5.9% --
DoublePimp 0.5% 0.5% --
NewRelic 2% 0.03% --
…

Tracking in the Wild: Reach
% of page loads
seen
% of page loads seen with unsafe
data elements (tracking)
rank
Google 62.4% 42.4% 1st
Facebook 21.1% 18.5% 2nd
AppNexus 10.15 9.9% 3rd
ADITION 8.7% 8.4% 4th
Criteo 8.7% 8.2% 5th
…
Comscore 6.1% 5.9% --
DoublePimp 0.5% 0.5% --
NewRelic 2% 0.03% --
…
58
organizations
with a reach
larger than 1%

CLIQZ Tracking Protection
Maximize
coverage,
minimize false
positives

CLIQZ Tracking Protection
Maximize
coverage,
minimize false
positives
Aggressiveness is counter-productive…
•  increases site breakage, which forces users to add exceptions, thus
reducing protection coverage.
•  affects legitimate services and data collection

Block only the Ability to Track
GET /collect?
v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&
..._u=QCCAAAABI~&jid=&cid=6531474...
Host: www.google-analytics.com
IP: 79.227.235.241
Intervention only on unsafe data elements – those
elements that can be used as UIDs,
Should protect the user, while minimizing side-effects:
a)  site-breakage for users
b) legitimate data collection for 3rd parties

Blocklists are coarse-grained
CDF of the number of requests with observed unsafe data elements by 3rd
party domains contained both in Disconnect Blocklist and CLIQZ list of
potential trackers (~2000 domains each). Intersection is 477 domains.

Only 2% of
tracker
domains in
Disconnect
always send
unsafe data
elements.

98% of tracker domains
have a MIXED
behavior
Lack of resolution…
Only 2% of
tracker
domains is
Disconnect
always send
unsafe data
elements.

Blocklists by domain (reverse suffix) are too coarse-grained.
BLOCKLIST by Domain

Blocklists are too coarse-grained
EasyPrivacy (from Adblock Plus) has hundreds of regular
expressions to cover for mixed behavior of trackers.
BLOCKLIST by Domain + RegExp Exceptions

Blocklists are too coarse-grained
BLOCKLIST by Domain + More RegExp Exceptions
EasyPrivacy (from Adblock Plus) has hundreds of regular
expressions to cover for mixed behavior of trackers.

We propose a more fine-grained approach
to algorithmically determine the safeness
level of individual data elements within a
request to a 3rd party

Determining Safeness
Each 3rd party request to a potential tracker is parsed to obtain a list of tuples
T = [<s, d, k, v>] whose safeness level is evaluated in real-time,
T = [
<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>,
<s=wired.com/, d=3rdparty.com, k=fl,v=21.0>,
<s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>,
<s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>,
<s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>,
<s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The aim is to identify which data elements (including combinations) are unsafe,
and therefore, they are candidates to be used as UIDs.

Each 3rd party request to a potential tracker is parsed to obtain a list of tuples
T = [<s, d, k, v>] whose safeness level is evaluated in real-time,
T = [
<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>,
<s=wired.com/, d=3rdparty.com, k=fl,v=21.0>,
<s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>,
<s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>,
<s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>,
<s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The aim is to identify which data elements (including combinations) are unsafe,
and therefore, they are candidates to be used as UIDs.

We cannot do this effectively. But we can do the opposite, identify data
elements that cannot be used effectively as UIDs, and consider them safe.

T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>,
<s=w..., d=3rdparty.com, k=fl,v=21.0>,
<s=w..., d=3rdparty.com, k=u, v=CCAAAABI>,
<s=w..., d=3rdparty.com, k=vr,v=1440x900>,
<s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>,
<s=w..., d=3rdparty.com, k=vp,v=1322x781>,
<s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
All tuples are
UNSAFE by
default unless we
can determine
that the given
data-element is
not a good UID,
hence safe.

T = [
]
The value
1501498154 has
never been seen
before
for <d, k>.
Thus, cannot be
used as UID =>
SAFE

T = [
]
The value 21.0 is
to short to encode
any UID => SAFE

T = [
]
More than 3
different values in
less than 2 days
by the same tuple
<d,k>.
Not persistent,
bad UID => SAFE

T = [
]
Always the same
value for
<d,k>.
We cannot rule
out that the data-
elements are UID
=> keep as
UNSAFE
Only using local information is not
enough; vr=1440x1024 is not a UID…
We need something extra.

T = [
]
Locally UNSAFE, i.e. always the same
value for <d, k>.
Globally SAFE since more than 20 other
users have observed the same value
1440x900 for tuple <d,k> =
<3rdparty.com,vr> in the last 2 days.

T = [
]
Locally UNSAFE, i.e. always the same
value for tuple <d,k>.
Globally SAFE since it has reach
the safeness-quorum based on
k-Anonymity.

T = [
]
Locally UNSAFE, i.e. always the same value for <d, k>.
Globally UNSAFE not enough people has seen the value
for <d, k>, always same <d, k, u>. Not safe to send.
Two options:
a)  it is a UID, or an element that could be used as such.
b)  a false positive due to the Transient State (0.07%)

T = [
]
Locally UNSAFE and Globally UNSAFE
At this point the request analysis is complete:
1)  ALLOW Request removing unsafe data-elements
2)  ALLOW Request obfuscating unsafe data-elements
3)  BLOCK Request or ALLOW Request without alteration

Safeness Quorum without Tracking
To determine that a data-element is globally safe we need to count the number of
unique users that have observed a tuple
<d,k,v> e.g. <d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>
Users could share tuples with a field that identifies them (u),
<u=usrXXX, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>
with CLIQZ. But that would make CLIQZ a tracker! Instead, each user sends the
tuple – if observed – once and only once per hour:
<d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>
Actual values are not needed; counting and membership test on GWL
<d=ed5c0cf7b05572eb, k=4d3a21d8c684c09c19b93be911827fd5,
v=e60f936dc719ca649a80a97490a09940>

Evaluation: Protection Coverage
Requests
Blocked
False positives
ratio (requests
blocked without
unsafe data-
elements)
Protection Misses
(requests allowed
with unsafe data-
elements)
CLIQZ 51.7% -- --
Disconnect 66.1% 38.8% 12.3%
Kontaxis & Chew
(Firefox Tracking
Protection) [est.]
36.6% 29.4% 25.4%

Evaluation: Site Breakage
Reload Rate % Increase
over baseline
% Increase
over CLIQZ
BASELINE
(without tracking
protection)
0.00101 -- --
CLIQZ 0.00104 4% --
Adblock Plus
(counting exceptions added by users)
0.00110 10% 150%
CLIQZ as
Blocklist
0.00125 25% 525%

Conclusions
Tracking is a BIG problem
–  Privacy is seriously at risk
Tracking Protection is not an easy task
–  Trade-off between site breakage and protection
coverage
Blocklist-based approaches have limitations
–  Maintainability
–  Coarse-grained resolution
–  Too many false positives
CLIQZ tracking protection addresses them to a large extent

Future Work
CLIQZ tracking protection
might be better than the
state-of-the-art. But it is far
from perfect,
•  still produces site-
breakages
•  protection coverage is
not 100%
•  it can be attacked in
multiple ways
[Picture from http://mtthwhgn.com/tag/flooding/]
we provide a bigger hammer for the whack-a-tracker

Thanks a lot!
Q&A
Zhonghao Yu Sam Macbeth Konark Modi

Implementation Details
Realtime Component
1) Parsing request
2) Local safeness:
membership test on LWL
3) Global safeness:
membership test on GWL
LWL and GWL are Bloom
Filters, combined less
than < 512KB, FP ratio
of 0.1%.
Takes about 1-12 ms.
Offline Component
Data from users needs to be sent to CLIQZ to build
GWL for the safeness quorum.
GWL needs to be sent back to the users’ browsers.
We use an eventual consistency model with
incremental updates over daily snapshots.
Bandwidth costs per user per day: 90KB upload,
566KB download. For a worse-case propagation lag
of 10 minutes.
False positive unsafe data elements due to
transient state is 0.07%

T = [
]
Cookies from potential trackers
are always blocked.
POST requests are also analyzed,
blocked only if:
•  match Cookie values
•  match QS values declared
unsafe
•  match values from browser-
fingerprinting
User initiated actions are always
ALLOWED (even if tracking)

Tracking The Trackers WWW 2016

Recommended

Recommended

More Related Content

Similar to Tracking The Trackers WWW 2016

Similar to Tracking The Trackers WWW 2016 (20)

Recently uploaded

Recently uploaded (20)

Tracking The Trackers WWW 2016