Data Collection without Privacy Side Effects

Josep M. Pujol
Josep M. PujolResearcher at CLIQZ
CLIQZ @ BIG 2016…
Data Collection without Privacy
Side-Effects
Konark Modi
Josep M. Pujol
@konarkmodi
@solso
CLIQZ @ BIG 2016…
Data collection on Big Data
Where does the data of Big Data comes
from?
The Elephant in the room
Applications of Big Data
CLIQZ @ BIG 2016…
Who collects data on the Web?
Wired, Ebay and Meetic collect data as 1st
parties as a user visits/interacts with their sites.
However there are a lot of 3rd parties that also
collect data.
On CLIQZ’s paper: “Tracking the Trackers”. To
be presented at WWW 2016 [1]
>> 78% of page loads send information to at
least one 3rd party that is deemed unsafe wrt
privacy.
CLIQZ @ BIG 2016…
Motivation: A recurring real-life conversation
Hi, this is company X. CLIQZ anti-tracking is affecting us.
Can we talk?
We are not trackers. We only measure audiences (or
collect aggregated or measure goal conversion or site
performance metrics). We take privacy very seriously.
Sure
Understood, let us check what’s going on
Well, you are actually tracking users. See the
attachment. You have the ability to know that these
20 webpages were visited by the same person, and
to make things worse, you can derive his real
identity. Users privacy is at risk
Thanks a lot
No, no. We do NOT use that information at all, we
remove it as soon it is received. We are only interesting
in measuring XYZ.
But we just show you an example of tracking.
Intentionally or not does not should not matter,
right?
I repeat that we are NOT using this data at all for
anything, see our Privacy Policy. To implement our
service we require that data element that can be used as
user identifier, there is no other way…
There is another way. Happy to show you …
CLIQZ @ BIG 2016…
Motivation: A recurring real-life conversation
Unfortunately, they never come back L. We formulated 3 hypotheses:
1) They were interested in collecting data from users. They are
“intentionally” tracking.
2) They are not concerned about privacy side-effects. On the trade-off
between privacy and convenience, chose the later.
3) We could not successfully explain our alternative approach for
privacy-preserving data collection.
...mmm, thanks… ...er... ...we will get back to you...
Great! Looking forward to it.
There is another way. Happy to show you …
CLIQZ @ BIG 2016…
Motivation: A recurring real-life conversation
We hope that it is not #1, that’s why we decided:
•  To open-source a prototype of a Google Analytics look-
alike that does not rely on tracking. Hoping that the
code will be more explanatory.
•  To write this paper and presentation.
...mmm, thanks… ...er... ...we will get back to you...
Great! Looking forward to it.
There is another way. Happy to show you …
CLIQZ @ BIG 2016…
An Example of Unintentional Tracking
Google Analytics (GA)
•  GA is massive, present in
44% of all page loads.
•  GA does not offer any
service (public) that requires
to build the a session with all
user’s activity
•  GA actually cares a lot about
privacy
–  Ephemeral UIDs
–  Sanitization of URLs
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12
[137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12
[137.9.10.X,
1140x645]
ebay-kleinanzeigen.de/
s-muenchen/cyclocross/
k0l6411r200
09:50:02
[137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12
[137.9.10.X,
1140x645]
ebay-kleinanzeigen.de/
s-muenchen/cyclocross/
k0l6411r200
09:50:02
[137.9.10.X,
1140x645]
twitter.com/solso 09:52:10
[137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12
[137.9.10.X,
1140x645]
ebay-kleinanzeigen.de/
s-muenchen/cyclocross/
k0l6411r200
09:50:02
[137.9.10.X,
1140x645]
twitter.com/solso 09:52:10
[137.9.10.X,
1140x645]
www.meetic.com/
home/index.php
09:59:01
[137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
Privacy Breaches are Unavoidable (even for GA)
wired.com/ 09:49:12
[137.9.10.X,
1140x645]
ebay-kleinanzeigen.de/
s-muenchen/cyclocross/
k0l6411r200
09:50:02
[137.9.10.X,
1140x645]
twitter.com/solso 09:52:10
[137.9.10.X,
1140x645]
www.meetic.com/
home/index.php
09:59:01
[137.9.10.X,
1140x645]
analytics.twitter.com/
user/solso/home
10:05:45
[137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
wired.com/ 09:49:12
[137.9.10.X,
1140x645]
ebay-kleinanzeigen.de/
s-muenchen/cyclocross/
k0l6411r200
09:50:02
[137.9.10.X,
1140x645]
twitter.com/solso 09:52:10
[137.9.10.X,
1140x645]
www.meetic.com/
home/index.php
09:59:01
[137.9.10.X,
1140x645]
analytics.twitter.com/
user/solso/home
10:05:45
[137.9.10.X,
1140x645]
Last page is only accessible after login and
it contains my username => Personal
Identifiable Information (PII) leak.
IP: 137.9.10.XX https://www.google-
analytics.com/collect? … dl=https%3A%2F
%2Fanalytics.twitter.com%2Fuser%2Fsolso
%2Fhome& ... &vp=1140x645&...
Privacy Breaches are Unavoidable (even for GA)
CLIQZ @ BIG 2016…
Example: Counting Unique Visitors
wired.com/xyz 09:48:40 82.143.2.X
wired.com/xyz 09:48:42 137.9.10.X
wired.com/xyz 09:48:59 137.9.10.X
wired.com/xyz 09:49:12 137.9.10.X
4 people visited
wired.com/xyz?
1 person visited
wired.com/xyz 4 times?
How can it be resolved?
GA backend
CLIQZ @ BIG 2016…
Example: Counting Unique Visitors
wired.com/xyz 09:48:40 82.143.2.X
wired.com/xyz 09:48:42 137.9.10.X
wired.com/xyz 09:48:59 137.9.10.X
wired.com/xyz 09:49:12 137.9.10.X
4 people visited
wired.com/xyz?
1 person visited
wired.com/xyz 4 times?
How can it be resolved?
GA backend
wired.com/xyz 09:48:40 [82.143.2.X, 1320x910]
wired.com/xyz 09:48:42 [137.9.10.X, 1266x809]
wired.com/xyz 09:48:59 [137.9.10.X, 940x645]
wired.com/xyz 09:49:12 [137.9.10.X, 940x645]
GA backend
Identifying which records
come from the same
person to avoid over-
counting.
A UID is needed
4 visits, 3 unique visitors
CLIQZ @ BIG 2016…
Example: Counting Unique Visitors
wired.com/xyz 09:48:40 ---
wired.com/xyz 09:48:42 ---
wired.com/xyz 09:48:59 ---
wired.com/xyz 09:49:12 ---
4 people visited
wired.com/xyz?
1 person visited
wired.com/xyz 4 times?
How can it be resolved?
GA backend
wired.com/xyz 09:48:40 [82.143.2.X, 1320x910]
wired.com/xyz 09:48:42 [137.9.10.X, 1266x809]
wired.com/xyz 09:48:59 [137.9.10.X, 940x645]
wired.com/xyz 09:49:12 [137.9.10.X, 940x645]
GA backend
Identifying which records
come from the same
person to avoid over-
counting.
A UID is needed
4 visits, 3 unique visitors
wired.com/ 09:49:12
[137.9.10.X,
1140x645]
ebay-kleinanzeigen.de/
s-muenchen/cyclocross/
k0l6411r200
09:50:02
[137.9.10.X,
1140x645]
twitter.com/solso 09:52:10
[137.9.10.X,
1140x645]
www.meetic.com/
home/index.php
09:59:01
[137.9.10.X,
1140x645]
analytics.twitter.com/
user/solso/home
10:05:45
[137.9.10.X,
1140x645]
CLIQZ @ BIG 2016…
As long as aggregation of data per user on
the server-side is needed,
we will always incur on undesired privacy side-
effects.
CLIQZ @ BIG 2016…
Since server-side aggregation is the root of
the problem,
we should move the aggregation of data to
the client-side (i.e. the user’s browser)
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
Browser	 Browser
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser	
visit
wired.com/xyz unique-visit
wired.com/xyz
state = []
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser	
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [
H(wired.com/xyz,
unique-visit,
timestamp)]
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser	
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [
H(wired.com/xyz,
unique-visit,
timestamp)]
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
Browser	 Browser	
visit
wired.com/xyz unique-visit
wired.com/xyz
Count
Uniques
Count
Uniques
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
Browser	 Browser	
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser	
state = [
H(wired.com/xyz,
unique-visit,
timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser	
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [
H(wired.com/xyz,
unique-visit,
timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser	
visit
wired.com/xyz unique-visit
wired.com/xyz
state = [
H(wired.com/xyz,
unique-visit,
timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
Possible if you control the
browser (i.e. CLIQZ).
But also possible with
HTML5 LocalStorage and
PostMessage APIs.
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz wired.com/xyz
3rd party
tracking
script
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
3rd party
tracking
script
Browser	 Browser	
visitwired.com/xyz
state = [
H(wired.com/xyz,
unique-visit,
timestamp)]
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
CLIQZ @ BIG 2016…
Counting Unique Visitors…
Server-side	Aggrega-on	
–	Google	Analy-cs	
wired.com/xyz [137.9.10.X, 940x645]
GA	Backend	 CGT	Backend	
Client-side	Aggrega-on	
–	CLIQZ	Green	Tracker	
Browser	 Browser	
visitwired.com/xyz
wired.com/xyz [137.9.10.X, 940x645] visit
wired.com/xyz unique-visit
wired.com/xyz
Count
Uniques
Count
Uniques
CLIQZ @ BIG 2016…
Beyond Counting Unique Visitors?
Working prototype of a GA-clone featuring:
–  Unique visits and page loads.
–  Returning customers.
–  Goal conversion to track campaigns.
–  Cross site correlations.
–  In-site click-troughs.
–  Visits and time in page per user (without beacons).
A privacy preserving tracking agent: green-tracker, which
implements all this 6 use-cases in less than 200 lines of code.
Demo: http://site1.test.cliqz.com/
CLIQZ @ BIG 2016…
Conclusions
Data collection based on server-side aggregation of user’s data is very
problematic as it implies tracking users.
Tracking leads to to privacy side-effects, we provided evidence of
privacy leaks on Google Analytics.
Tracking can be avoided if one switches the design pattern to client-
side aggregation.
To demonstrate the feasibility of client-side aggregation we build and
open-sourced a Google Analytics look-alike:
https://github.com/cliqz/green-tracker
that implements on a privacy preserving way a wide range of use-
cases that require tracking users.
CLIQZ @ BIG 2016…
Q&A
Thanks for your attention!
CLIQZ @ BIG 2016…
Appendix
CLIQZ @ BIG 2016…
Keeping State on the Client
Modern browsers have the ability to keep state via HTML5 LocalStorage.
Therefore, a – privacy preserving tracking script – can keep a persistent
state across multiple sites if loaded from an IFRAME
•  Looks pretty familiar, but is slightly different:
–  LocalStorage belongs to green-tracker.fbt.co (the collector backend)
–  Respects CORS
–  IFRAME is sandboxed (no access to Document)
–  Explicit control from site-owner (postMessage)
–  Explicit control from user (messages and state can be removed and inspect at will)
CLIQZ @ BIG 2016…
Limitations
As always, there are limitations that one must consider:
•  Deploy is not immediate. It requires code changes both in the
tracking script and collectors.
•  Unplanned use-cases might not be possible retrospectively.
•  Business logic of the data collector is explicit to the user.
•  The state of the client can become a privacy issue if not handled
properly; careful of not creating a duplicated history.
•  Browser might have factory-default options that prevent
LocalStorage to work as expected. For instance, Safari blocks 3rd
party cookies which affect LocalStorage, the user can change the
setting but this is sub-optimal.
1 of 36

Recommended

Tracking The Trackers WWW 2016 by
Tracking The Trackers WWW 2016Tracking The Trackers WWW 2016
Tracking The Trackers WWW 2016Josep M. Pujol
3.1K views51 slides
The little engine(s) that could: scaling online social networks by
The little engine(s) that could: scaling online social  networksThe little engine(s) that could: scaling online social  networks
The little engine(s) that could: scaling online social networksJosep M. Pujol
1.3K views66 slides
keyvi the key value index @ Cliqz by
keyvi the key value index @ Cliqzkeyvi the key value index @ Cliqz
keyvi the key value index @ CliqzHendrik Muhs
1.5K views13 slides
Philippine Republic Act No. 10173 Data Privacy Act of 2012 by
Philippine Republic Act No. 10173 Data Privacy Act of 2012Philippine Republic Act No. 10173 Data Privacy Act of 2012
Philippine Republic Act No. 10173 Data Privacy Act of 2012Macoy Mejia
2.5K views14 slides
PayPal Big Data and MySQL Cluster by
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
16.8K views29 slides
Data privacy act of 2012 presentation by
Data privacy act of 2012 presentationData privacy act of 2012 presentation
Data privacy act of 2012 presentationKittelson & Carpo Consulting
27.3K views28 slides

More Related Content

Similar to Data Collection without Privacy Side Effects

Confection Investor Pitch Deck by
Confection Investor Pitch DeckConfection Investor Pitch Deck
Confection Investor Pitch DeckQuimby Melton
203 views24 slides
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google by
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleData Con LA
6.5K views74 slides
Corporates' malicious behaviour: Intent or accident? by
Corporates' malicious behaviour: Intent or accident?Corporates' malicious behaviour: Intent or accident?
Corporates' malicious behaviour: Intent or accident?Konark modi
78 views59 slides
Let’s Get Cirrus About Personal Clouds by
Let’s Get Cirrus About Personal CloudsLet’s Get Cirrus About Personal Clouds
Let’s Get Cirrus About Personal CloudsT.Rob Wyatt
1.9K views23 slides
Mrmcd2017 by
Mrmcd2017Mrmcd2017
Mrmcd2017Konark modi
244 views68 slides
Blackhat Analytics 2 @ Superweek by
Blackhat Analytics 2  @ SuperweekBlackhat Analytics 2  @ Superweek
Blackhat Analytics 2 @ SuperweekPhil Pearce
8.5K views105 slides

Similar to Data Collection without Privacy Side Effects(20)

Confection Investor Pitch Deck by Quimby Melton
Confection Investor Pitch DeckConfection Investor Pitch Deck
Confection Investor Pitch Deck
Quimby Melton203 views
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google by Data Con LA
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
Data Con LA6.5K views
Corporates' malicious behaviour: Intent or accident? by Konark modi
Corporates' malicious behaviour: Intent or accident?Corporates' malicious behaviour: Intent or accident?
Corporates' malicious behaviour: Intent or accident?
Konark modi78 views
Let’s Get Cirrus About Personal Clouds by T.Rob Wyatt
Let’s Get Cirrus About Personal CloudsLet’s Get Cirrus About Personal Clouds
Let’s Get Cirrus About Personal Clouds
T.Rob Wyatt1.9K views
Blackhat Analytics 2 @ Superweek by Phil Pearce
Blackhat Analytics 2  @ SuperweekBlackhat Analytics 2  @ Superweek
Blackhat Analytics 2 @ Superweek
Phil Pearce8.5K views
Big Data Scotland 2017 by Ray Bugg
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017
Ray Bugg2.4K views
Release The Hounds: Part 2 “11 Years Is A Long Ass Time” by Casey Ellis
Release The Hounds: Part 2 “11 Years Is A Long Ass Time”Release The Hounds: Part 2 “11 Years Is A Long Ass Time”
Release The Hounds: Part 2 “11 Years Is A Long Ass Time”
Casey Ellis152 views
Smartly Secure, Securely Smart _ Enterprise IT News by Krishna Arani
Smartly Secure, Securely Smart _ Enterprise IT NewsSmartly Secure, Securely Smart _ Enterprise IT News
Smartly Secure, Securely Smart _ Enterprise IT News
Krishna Arani351 views
Outside the Comfort Zone: Cross Industry Use Cases in Big Data Analytics by Rising Media Ltd.
Outside the Comfort Zone: Cross Industry Use Cases in Big Data AnalyticsOutside the Comfort Zone: Cross Industry Use Cases in Big Data Analytics
Outside the Comfort Zone: Cross Industry Use Cases in Big Data Analytics
Rising Media Ltd.977 views
BSides Lisbon 2017: David Sopas's 'GTFO Mr. User' by Checkmarx
BSides Lisbon 2017: David Sopas's 'GTFO Mr. User'BSides Lisbon 2017: David Sopas's 'GTFO Mr. User'
BSides Lisbon 2017: David Sopas's 'GTFO Mr. User'
Checkmarx85 views
Killing the golden calf of coding - We are Developers keynote by Christian Heilmann
Killing the golden calf of coding - We are Developers keynoteKilling the golden calf of coding - We are Developers keynote
Killing the golden calf of coding - We are Developers keynote
Christian Heilmann3.1K views
South By South Best 2018 by James Quinlan
South By South Best 2018 South By South Best 2018
South By South Best 2018
James Quinlan797 views
Less is More: Behind the Data at Risk I/O by Michael Roytman
Less is More: Behind the Data at Risk I/OLess is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/O
Michael Roytman1.6K views
ActivityStrea.ms: Is It Getting Streamy In Here? by Chris Messina
ActivityStrea.ms: Is It Getting Streamy In Here?ActivityStrea.ms: Is It Getting Streamy In Here?
ActivityStrea.ms: Is It Getting Streamy In Here?
Chris Messina66.2K views
Progressing JavaScript and Apps the Web way… by Christian Heilmann
 Progressing JavaScript and Apps the Web way…  Progressing JavaScript and Apps the Web way…
Progressing JavaScript and Apps the Web way…
Christian Heilmann1.6K views
Blackhat Analytics 3 @ superweek - Do be evil: Force awakens by Phil Pearce
Blackhat Analytics 3 @  superweek - Do be evil: Force awakensBlackhat Analytics 3 @  superweek - Do be evil: Force awakens
Blackhat Analytics 3 @ superweek - Do be evil: Force awakens
Phil Pearce5.2K views

Recently uploaded

CYTOSKELETON STRUCTURE.ppt by
CYTOSKELETON STRUCTURE.pptCYTOSKELETON STRUCTURE.ppt
CYTOSKELETON STRUCTURE.pptEstherShobhaR
14 views19 slides
vitamine B1.pptx by
vitamine B1.pptxvitamine B1.pptx
vitamine B1.pptxajithkilpart
29 views22 slides
Krishna VSC 692 Credit Seminar.pptx by
Krishna VSC 692 Credit Seminar.pptxKrishna VSC 692 Credit Seminar.pptx
Krishna VSC 692 Credit Seminar.pptxKrishnaSharma682993
11 views54 slides
TF-FAIR.pdf by
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdfDirk Roorda
6 views120 slides
IMMUNODIAGNOSTICS KITS.pdf by
IMMUNODIAGNOSTICS KITS.pdfIMMUNODIAGNOSTICS KITS.pdf
IMMUNODIAGNOSTICS KITS.pdfvetrivel303632
17 views10 slides

Recently uploaded(20)

Indian council for child welfare by RenuWaghmare2
Indian council for child welfareIndian council for child welfare
Indian council for child welfare
RenuWaghmare27 views
2. Natural Sciences and Technology Author Siyavula.pdf by ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa11 views
Factors affecting fluorescence and phosphorescence.pptx by SamarthGiri1
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptx
SamarthGiri17 views
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by Trustlife
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Trustlife114 views
Presentation on experimental laboratory animal- Hamster by Kanika13641
Presentation on experimental laboratory animal- HamsterPresentation on experimental laboratory animal- Hamster
Presentation on experimental laboratory animal- Hamster
Kanika136416 views
Best Hybrid Event Platform.pptx by Harriet Davis
Best Hybrid Event Platform.pptxBest Hybrid Event Platform.pptx
Best Hybrid Event Platform.pptx
Harriet Davis8 views
Applications of Large Language Models in Materials Discovery and Design by Anubhav Jain
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain14 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI6 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani19 views
Note on the Riemann Hypothesis by vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank28 views
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by SwagatBehera9
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
SwagatBehera95 views

Data Collection without Privacy Side Effects

  • 1. CLIQZ @ BIG 2016… Data Collection without Privacy Side-Effects Konark Modi Josep M. Pujol @konarkmodi @solso
  • 2. CLIQZ @ BIG 2016… Data collection on Big Data Where does the data of Big Data comes from? The Elephant in the room Applications of Big Data
  • 3. CLIQZ @ BIG 2016… Who collects data on the Web? Wired, Ebay and Meetic collect data as 1st parties as a user visits/interacts with their sites. However there are a lot of 3rd parties that also collect data. On CLIQZ’s paper: “Tracking the Trackers”. To be presented at WWW 2016 [1] >> 78% of page loads send information to at least one 3rd party that is deemed unsafe wrt privacy.
  • 4. CLIQZ @ BIG 2016… Motivation: A recurring real-life conversation Hi, this is company X. CLIQZ anti-tracking is affecting us. Can we talk? We are not trackers. We only measure audiences (or collect aggregated or measure goal conversion or site performance metrics). We take privacy very seriously. Sure Understood, let us check what’s going on Well, you are actually tracking users. See the attachment. You have the ability to know that these 20 webpages were visited by the same person, and to make things worse, you can derive his real identity. Users privacy is at risk Thanks a lot No, no. We do NOT use that information at all, we remove it as soon it is received. We are only interesting in measuring XYZ. But we just show you an example of tracking. Intentionally or not does not should not matter, right? I repeat that we are NOT using this data at all for anything, see our Privacy Policy. To implement our service we require that data element that can be used as user identifier, there is no other way… There is another way. Happy to show you …
  • 5. CLIQZ @ BIG 2016… Motivation: A recurring real-life conversation Unfortunately, they never come back L. We formulated 3 hypotheses: 1) They were interested in collecting data from users. They are “intentionally” tracking. 2) They are not concerned about privacy side-effects. On the trade-off between privacy and convenience, chose the later. 3) We could not successfully explain our alternative approach for privacy-preserving data collection. ...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it. There is another way. Happy to show you …
  • 6. CLIQZ @ BIG 2016… Motivation: A recurring real-life conversation We hope that it is not #1, that’s why we decided: •  To open-source a prototype of a Google Analytics look- alike that does not rely on tracking. Hoping that the code will be more explanatory. •  To write this paper and presentation. ...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it. There is another way. Happy to show you …
  • 7. CLIQZ @ BIG 2016… An Example of Unintentional Tracking Google Analytics (GA) •  GA is massive, present in 44% of all page loads. •  GA does not offer any service (public) that requires to build the a session with all user’s activity •  GA actually cares a lot about privacy –  Ephemeral UIDs –  Sanitization of URLs
  • 8. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645]
  • 9. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645]
  • 10. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]
  • 11. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645]
  • 12. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645] analytics.twitter.com/ user/solso/home 10:05:45 [137.9.10.X, 1140x645]
  • 13. CLIQZ @ BIG 2016… wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645] analytics.twitter.com/ user/solso/home 10:05:45 [137.9.10.X, 1140x645] Last page is only accessible after login and it contains my username => Personal Identifiable Information (PII) leak. IP: 137.9.10.XX https://www.google- analytics.com/collect? … dl=https%3A%2F %2Fanalytics.twitter.com%2Fuser%2Fsolso %2Fhome& ... &vp=1140x645&... Privacy Breaches are Unavoidable (even for GA)
  • 14. CLIQZ @ BIG 2016… Example: Counting Unique Visitors wired.com/xyz 09:48:40 82.143.2.X wired.com/xyz 09:48:42 137.9.10.X wired.com/xyz 09:48:59 137.9.10.X wired.com/xyz 09:49:12 137.9.10.X 4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved? GA backend
  • 15. CLIQZ @ BIG 2016… Example: Counting Unique Visitors wired.com/xyz 09:48:40 82.143.2.X wired.com/xyz 09:48:42 137.9.10.X wired.com/xyz 09:48:59 137.9.10.X wired.com/xyz 09:49:12 137.9.10.X 4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved? GA backend wired.com/xyz 09:48:40 [82.143.2.X, 1320x910] wired.com/xyz 09:48:42 [137.9.10.X, 1266x809] wired.com/xyz 09:48:59 [137.9.10.X, 940x645] wired.com/xyz 09:49:12 [137.9.10.X, 940x645] GA backend Identifying which records come from the same person to avoid over- counting. A UID is needed 4 visits, 3 unique visitors
  • 16. CLIQZ @ BIG 2016… Example: Counting Unique Visitors wired.com/xyz 09:48:40 --- wired.com/xyz 09:48:42 --- wired.com/xyz 09:48:59 --- wired.com/xyz 09:49:12 --- 4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved? GA backend wired.com/xyz 09:48:40 [82.143.2.X, 1320x910] wired.com/xyz 09:48:42 [137.9.10.X, 1266x809] wired.com/xyz 09:48:59 [137.9.10.X, 940x645] wired.com/xyz 09:49:12 [137.9.10.X, 940x645] GA backend Identifying which records come from the same person to avoid over- counting. A UID is needed 4 visits, 3 unique visitors wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645] analytics.twitter.com/ user/solso/home 10:05:45 [137.9.10.X, 1140x645]
  • 17. CLIQZ @ BIG 2016… As long as aggregation of data per user on the server-side is needed, we will always incur on undesired privacy side- effects.
  • 18. CLIQZ @ BIG 2016… Since server-side aggregation is the root of the problem, we should move the aggregation of data to the client-side (i.e. the user’s browser)
  • 19. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser
  • 20. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser
  • 21. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = []
  • 22. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)]
  • 23. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)]
  • 24. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser visit wired.com/xyz unique-visit wired.com/xyz Count Uniques Count Uniques
  • 25. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  • 26. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  • 27. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  • 28. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz Possible if you control the browser (i.e. CLIQZ). But also possible with HTML5 LocalStorage and PostMessage APIs.
  • 29. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visitwired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  • 30. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser visitwired.com/xyz wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz Count Uniques Count Uniques
  • 31. CLIQZ @ BIG 2016… Beyond Counting Unique Visitors? Working prototype of a GA-clone featuring: –  Unique visits and page loads. –  Returning customers. –  Goal conversion to track campaigns. –  Cross site correlations. –  In-site click-troughs. –  Visits and time in page per user (without beacons). A privacy preserving tracking agent: green-tracker, which implements all this 6 use-cases in less than 200 lines of code. Demo: http://site1.test.cliqz.com/
  • 32. CLIQZ @ BIG 2016… Conclusions Data collection based on server-side aggregation of user’s data is very problematic as it implies tracking users. Tracking leads to to privacy side-effects, we provided evidence of privacy leaks on Google Analytics. Tracking can be avoided if one switches the design pattern to client- side aggregation. To demonstrate the feasibility of client-side aggregation we build and open-sourced a Google Analytics look-alike: https://github.com/cliqz/green-tracker that implements on a privacy preserving way a wide range of use- cases that require tracking users.
  • 33. CLIQZ @ BIG 2016… Q&A Thanks for your attention!
  • 34. CLIQZ @ BIG 2016… Appendix
  • 35. CLIQZ @ BIG 2016… Keeping State on the Client Modern browsers have the ability to keep state via HTML5 LocalStorage. Therefore, a – privacy preserving tracking script – can keep a persistent state across multiple sites if loaded from an IFRAME •  Looks pretty familiar, but is slightly different: –  LocalStorage belongs to green-tracker.fbt.co (the collector backend) –  Respects CORS –  IFRAME is sandboxed (no access to Document) –  Explicit control from site-owner (postMessage) –  Explicit control from user (messages and state can be removed and inspect at will)
  • 36. CLIQZ @ BIG 2016… Limitations As always, there are limitations that one must consider: •  Deploy is not immediate. It requires code changes both in the tracking script and collectors. •  Unplanned use-cases might not be possible retrospectively. •  Business logic of the data collector is explicit to the user. •  The state of the client can become a privacy issue if not handled properly; careful of not creating a duplicated history. •  Browser might have factory-default options that prevent LocalStorage to work as expected. For instance, Safari blocks 3rd party cookies which affect LocalStorage, the user can change the setting but this is sub-optimal.