User data is the primary input of digital advertising, fueling the free Internet as we know it. As a result, web companies invest a lot in elaborate tracking mechanisms to acquire user data that can sell to data markets and advertisers. However, with same-origin policy and cookies as a primary identification mechanism on the web, each tracker knows the same user with a different ID. To mitigate this, Cookie Synchronization (CSync) came to the rescue, facilitating an information sharing channel between 3rd-parties that may or not have direct access to the website the user visits. In the background, with CSync, they merge user data they own, but also reconstruct a user’s browsing history, bypassing the same origin policy.
In this paper, we perform a first to our knowledge in-depth study of CSync in the wild, using a year-long weblog from 850 real mobile users. Through our study, we aim to understand the characteristics of the CSync protocol and the impact it has on web users’ privacy. For this, we design and implement CONRAD, a holistic mechanism to detect CSync events at real time, and the privacy loss on the user side, even when the synced IDs are obfuscated. Using CONRAD, we find that 97% of the regular web users are exposed to CSync: most of them within the first week of their browsing, and the median userID gets leaked, on average, to 3.5 different domains. Finally, we see that CSync increases the number of domains that track the user by a factor of 6.75.
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
Cookie Synchronization: Everything You Always Wanted to Know But Were Afraid to Ask
1. Cookie Synchronization:
Everything You Always Wanted to Know But
Were Afraid to Ask
Panagiotis (Panos) Papadopoulos
Nicolas Kourtellis Evangelos P. Markatos
2. Online user data is the new oil
Panos Papadopoulos ~ panpap@brave.com
• Trackers, data brokers (e.g., Axciom) and data
management platforms (e.g., Cambridge Analytica, Turn)
collect and process user data to form user profiles
• User profiles may contain information
not only from online but from offline world, too
ü e.g., phone number, city/state, email address, SSN,
bankruptcy/education information, employment details,
information on marriage/divorce, property records, etc.
• Profiles are sold in data markets to advertisers
for targeted advertising.
3. Attribution of collected data
All this volume of collected data must be attributed to a single ID to be usefull:
gender
birthdate
browsing history
interests
sexual preferences
Panos Papadopoulos ~ panpap@brave.com
“Bond, James Bond”
or just:
“ade87e60-5336-4dd9-9a2a-
763e85516f6d-tuct150ff6a”
4. Universal User Identification
Cookies are domain specific…
- Data Broker: “Psst… I have data for user: userABC”
- Advertiser: “Huh? Who’s is that user?”
Broker knows the user by the ID “userABC”
but advertiser knows the same user as “user123”
Ø How such a data merge can get finalized?
Ø i.e., userABC==user123
Some universal user identification must appear!
Panos Papadopoulos ~ panpap@brave.com
6. Cookie … what?
• Cookie Synchronization: a mechanism to
Øbypass the same-origin policy
Øallow web companies to share cookies, and match the different IDs they
assign for the same user.
• 157 of top 200 websites (i.e. 78%) have 3rd parties which synchronize
cookies with at least one other 3rd party
Øthey can reconstruct 62-73% of a user’s browsing history*
*Steven Englehardt and Arvind Narayanan. Online Tracking: A 1-million-site Measurement and Analysis. (ACM CCS ’16).
Panos Papadopoulos ~ panpap@brave.com
7. How does it work? (1/3)
website3
website2
website1
Panos Papadopoulos ~ panpap@brave.com
8. How does it work? (2/3)
website3
website2
website1
Panos Papadopoulos ~ panpap@brave.com
9. How does it work? (3/3)
website3
website2
website1
Panos Papadopoulos ~ panpap@brave.com
Panagiotis Papadopoulos ~ panpap@ics.forth.gr
How to measure Privacy loss?
• Loss of anonymity on the web
• Users have a specific budget of aliases (cookie IDs)
• Cookie Synchronization (CSync): a mechanism to increase user identifiability
After CSync:
user123==userABC
userABC
user123
Heuristics-
based detection
ML-based
cookie-less detection
Cookie Synchronization
Detection
Privacy Analysis
- Statistics
- Diffusion of leaked IDs
- Personal information
leaks
HTTP
traffic
Results
Fig. 2: High-level overview of the internal components of CONRAD.
TABLE I: Examples of userIDs getting synchronized between dif-
ferent entities.
URLs of Cookie Synchronization HTTP Requests
1. a.atemda.com/id/csync?s=L2zaWQvMS9lkLzMxOUwOTUw
2. bidtheater.com/UserMatch.ashx?bidderid=23&
bidderuid=L2zaWQvMS9lkLzMxOUwOTUw&
expiration=1426598931
3. d.turn.com/r/id/L2zaWQvMS9lkLzMxOUwOTUw/mpid/
mechanisms [1], [11], [3] are as follows: (i) It offers the
ability to detect synchronizations when the userID is embedded
not only in the URL’s parameter, but also in its path (either
in case of request/response URL or Location URL of the
referrer). (ii) By filtering-out domains of the same provider,
our approach can discriminate between intentional CSync
F
ni
Example of a userID getting synced between different 3rd
parties.
10. Privacy implications for users
advertiser.com learns that:
1. user has just visited website3.com
2. whom it knew as “userABC” is also known as “user123”
• Reduction of user aliases -> loss of anonymity
3. server-to-server user data merges
• merge data known for “userABC” and “user123” into a single profile
4. coupled with evercookie, or user fingerprinting, CSync allows re-identification of users
even after they delete their cookies
• re-link the two user profiles (before and after cookie erasure)
• users cannot abolish their assigned userIDs
Panos Papadopoulos ~ panpap@brave.com
12. Studying Cookie Synchronization in the wild
• 179M HTTP requests from 850
volunteering users across 2016
• web traffic redirection through a
set of proxies
• Heuristic-based CSync detection
Panos Papadopoulos ~ panpap@brave.com
Heuristics-based Cookie Synchronization detection mechanism.
ID-looking strings
in params/path/Referrer:
(length, # of digits/alphas)
Have you
seen this ID
again?
Yes
No
Yes
Store it along with
its domain
Cookie
Synchronization!
From different domain?
(DNS whois, blacklists)
Capture set HTTP cookies
filter-out session cookies,
extract/store cookie IDs
ID == cookie ID?
Yes
1
3
2a
2d
2c
HTTP requests
2b
15. Number of affected users
0%
20%
40%
60%
80%
100%
1 10 100 1000
Cumulativedistributionofusers
Median number of different
userIDs per domain
Fig. 7: Number of unique userIDs set per do-
main, across the year. 80% of users are known
to a single domain with only ⇠2 aliases, on
average, throughout the entire year. About 1%
of users are more cautious and erase cookies,
thus, receiving more than 9.5 different userIDs,
on average.
0%
20%
40%
60%
80%
100%
0 50 100 150 200
CDF
# of days
Fig. 8: Distribution of the time it takes for the
first CSync to appear per user. Around 20% of
the users get their first userID synced in 1 day
or less, and 38% of users get synced within
their first week of browsing.
0
0.005
0.01
0.015
0.02
Jan
Averagesyncs/reqperuser
Fig. 9: CS
average user
their total n
user receive
GET requesPanos Papadopoulos ~ panpap@brave.com
Distribution of the time it takes for the first CSync
to appear per user.
Around 20% of the users get their
first userID synced in 1 day or less
• 97% of users with regular activity on
the web (>10 HTTP reqs/day) affected
16. Number of synchronizations per ID
Panos Papadopoulos ~ panpap@brave.com
Distribution of synchronizations per userID.
The median userID gets synced
with 3 different 3rd parties.
17. Re-using previously set IDs of other domains
• Cases of domains setting cookies using userIDs previously used by other domains.
• Example:
1. baidu.com sets cookie baiduid = {idA}
2. Later different domains set their own cookies… again by using baiduid = {idA}
• “Cookies are domain-specific”
J … Yeah, right!
Panos Papadopoulos ~ panpap@brave.com
18. Summaries of set cookies
Panos Papadopoulos ~ panpap@brave.com
Ds per
ynced,
ynced.
# of synchronizations
Fig. 11: Distribution of synchronizations per
userID. The median userID gets synced with
3.5 different entities.
Domains learned about the user
Fig. 12: Distribution of the number of entities
learned at least one userIDs of the user with
and without the effect of Cookie Synchroniza-
tion. As we can see, after syncing the entities
that learned about the median user grew by a
factor of 6.75.
that tracking entities may
measure the number of
user. Evidently, a median
, and 3% of users has up
apparent that the IDs of a
domains through CSync.
, in Figure 11, we plot the
ts per userID. As we see,
average, to 3.5 different
14%, that gets leaked to
CSync on the diffusion of
for each user the number
e., that learned at least one
yncs. As we can see from
the entities that learned
ncs grew by a factor of
ctor becomes > 10. This
nc, when the user visited
rack them were only the
ies, but in an independent
TABLE V: Example of an ID Summary stored on the user’s browser.
It includes userIDs and expiration dates used for the particular user
by 4 different domains.
ID Summary stored in cookie by adap.tv
“key=valueclickinc:value=708b532c-5128-4b00-a4f2-
2b1fac03de81:expiresat=wed apr 01 15:03:42 pdt
2015,key=mediamathinc:value=60e05435-9357-4b00-
8135-273a46820ef2:expiresat=thu mar 19 01:09:47 pst
2015,key=turn:value=2684830505759170345:expiresat=fri mar
06 16:43:34 pst 2015,key=rocketfuelinc:value=639511
149771413484:expiresat=sun mar 29 15:43:36 pst 2015”
Summaries. In these summaries, we see the userIDs that other
domains use for the particular user previously obtained by
CSyncs. An example of such summaries in JSON is shown in
Table V. As one can see, the cookie set by adap.tv includes
the userIDs and cookie expiration dates of valueclick.com,
mediamath.com, turn.com and rocketfuel.com. In our
dataset, we find at least 3 such companies providing ID
Summaries to other collaborating entities. This user-side info
allows (i) the synchronizing entities to learn more userIDs
through a single synchronization request, and (ii) adap.tv to
• Example of an ID Summary stored in a cookie on the user’s mobile browser.
• It includes (previously synced) userIDs and expiration dates as set by 4 different domains.
19. Diffusion of user ID sharing
101
102
103
# of synchronizations
istribution of synchronizations per
e median userID gets synced with
nt entities.
0%
20%
40%
60%
80%
100%
1 10 100 1000
CDFofsyncedusers
Domains learned about the user
before CSync
after CSync
Fig. 12: Distribution of the number of entities
learned at least one userIDs of the user with
and without the effect of Cookie Synchroniza-
tion. As we can see, after syncing the entities
that learned about the median user grew by a
factor of 6.75.
ties may
Panos Papadopoulos ~ panpap@brave.com
Distribution of the number of entities learned at least one userID of
the user (i) with and (ii) without the effect of Cookie Synchronization.
After syncing the entities
that reached the median
user grew 6.75x
20. Sensitive information leaked along with the
userIDs
• 13 sync requests leaking the user’s city level location
• 2 leaking the user’s registered phone number
• 10 leaking the user’s gender
• 9 leaking the exact user’s age
• 3 leaking the user’s full birth date
• 2 leaking the user’s first and last name
• 16 leaking the user’s email address
• 4 leaking user login credentials: username/password
Panos Papadopoulos ~ panpap@brave.com
21. What if userID in the HTTP
request is concealed?
Panos Papadopoulos ~ panpap@brave.com
22. What if userID is concealed?
• Doubleclick encrypts their userIDs during Cookie Synchronization
How can one detect synchronizations in this case?
• Based on our ground truth dataset, we train a Decision Tree to classify
cookie synchronizing requests
• Features from the HTTP request:
Domain -- Type of domain (i.e., Advertiser, Analytics, Social) -- Status code -- Number of
parameters -- Names of the parameters -- and more
• Detection Performance: 90% Accuracy - 0.97 AUC
Panos Papadopoulos ~ panpap@brave.com
23. In Summary…
1. We study the cookie sharing flows between the user-tracking entities
ØCookie Synchronization: the catalyst for any user data sharing
2. We propose a CSync detection mechanism
Øcapable of detecting syncing requests even when cookie information in the URLs is concealed
3. We analyze a dataset of 850 real mobile users
4. Our results show that:
• 97% of regular web users are exposed to CSync.
• The median user ID gets synced on average with 3 different 3rd parties
• CSync Increases the number of trackers reaching the user by 6.75x.
Panos Papadopoulos ~ panpap@brave.com