Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Collection without Privacy Side Effects

2,045 views

Published on

Presented at WWW BIG 2016. Paper available at: http://josepmpujol.net/public/papers/big_green_tracker.pdf

Abstract: The standard approach to collect users’ activity data on the Web relies on server-side processing. This approach requires the presence of user-identifiers in order to aggregate data in sessions, which leads to tracking. Server-side aggregation is bound to produce side-effects because the scope of sessions cannot be safely limited to a particular use-case. We provide several examples of such side-effects.
To preserve privacy we propose an alternative approach based on client-side aggregation, where user-identifiers are not needed because sessions only exist on the client-side (i.e. the user’s browser). We demonstrate the feasibility of this approach by providing an implementation of a tracking agent – green-tracker – able to gather the data needed to power a service functionally equivalent to Google Analytics.

Published in: Science
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download Full EPUB Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download Full doc Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download PDF EBOOK here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download EPUB Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download doc Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you’re struggling with your assignments like me, check out ⇒ www.WritePaper.info ⇐. My friend sent me a link to to tis site. This awesome company. After I was continuously complaining to my family and friends about the ordeals of student life. They wrote my entire research paper for me, and it turned out brilliantly. I highly recommend this service to anyone in my shoes. ⇒ www.WritePaper.info ⇐.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • but all of our surveys do! ➢➢➢ https://dwz1.cc/DU3z4dss
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Can you earn $7000 a month from home? Are you feeling trapped by your life? Stuck in a dead-end job you hate, but too scared to call it quits, because after all, the rent's due on the first of the month, right? Are you ready to change your life for the better?  http://t.cn/AisJWUCf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Got a new Iphone 6 in just 7 days completing surveys and offers! Now I'm just a few days away from completing and receiving my samsung tablet! Highly recommended! Definitely the best survey site out there! ♥♥♥ https://tinyurl.com/vd3y33w
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Data Collection without Privacy Side Effects

  1. 1. CLIQZ @ BIG 2016… Data Collection without Privacy Side-Effects Konark Modi Josep M. Pujol @konarkmodi @solso
  2. 2. CLIQZ @ BIG 2016… Data collection on Big Data Where does the data of Big Data comes from? The Elephant in the room Applications of Big Data
  3. 3. CLIQZ @ BIG 2016… Who collects data on the Web? Wired, Ebay and Meetic collect data as 1st parties as a user visits/interacts with their sites. However there are a lot of 3rd parties that also collect data. On CLIQZ’s paper: “Tracking the Trackers”. To be presented at WWW 2016 [1] >> 78% of page loads send information to at least one 3rd party that is deemed unsafe wrt privacy.
  4. 4. CLIQZ @ BIG 2016… Motivation: A recurring real-life conversation Hi, this is company X. CLIQZ anti-tracking is affecting us. Can we talk? We are not trackers. We only measure audiences (or collect aggregated or measure goal conversion or site performance metrics). We take privacy very seriously. Sure Understood, let us check what’s going on Well, you are actually tracking users. See the attachment. You have the ability to know that these 20 webpages were visited by the same person, and to make things worse, you can derive his real identity. Users privacy is at risk Thanks a lot No, no. We do NOT use that information at all, we remove it as soon it is received. We are only interesting in measuring XYZ. But we just show you an example of tracking. Intentionally or not does not should not matter, right? I repeat that we are NOT using this data at all for anything, see our Privacy Policy. To implement our service we require that data element that can be used as user identifier, there is no other way… There is another way. Happy to show you …
  5. 5. CLIQZ @ BIG 2016… Motivation: A recurring real-life conversation Unfortunately, they never come back L. We formulated 3 hypotheses: 1) They were interested in collecting data from users. They are “intentionally” tracking. 2) They are not concerned about privacy side-effects. On the trade-off between privacy and convenience, chose the later. 3) We could not successfully explain our alternative approach for privacy-preserving data collection. ...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it. There is another way. Happy to show you …
  6. 6. CLIQZ @ BIG 2016… Motivation: A recurring real-life conversation We hope that it is not #1, that’s why we decided: •  To open-source a prototype of a Google Analytics look- alike that does not rely on tracking. Hoping that the code will be more explanatory. •  To write this paper and presentation. ...mmm, thanks… ...er... ...we will get back to you... Great! Looking forward to it. There is another way. Happy to show you …
  7. 7. CLIQZ @ BIG 2016… An Example of Unintentional Tracking Google Analytics (GA) •  GA is massive, present in 44% of all page loads. •  GA does not offer any service (public) that requires to build the a session with all user’s activity •  GA actually cares a lot about privacy –  Ephemeral UIDs –  Sanitization of URLs
  8. 8. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645]
  9. 9. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645]
  10. 10. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645]
  11. 11. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645]
  12. 12. CLIQZ @ BIG 2016… Privacy Breaches are Unavoidable (even for GA) wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645] analytics.twitter.com/ user/solso/home 10:05:45 [137.9.10.X, 1140x645]
  13. 13. CLIQZ @ BIG 2016… wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645] analytics.twitter.com/ user/solso/home 10:05:45 [137.9.10.X, 1140x645] Last page is only accessible after login and it contains my username => Personal Identifiable Information (PII) leak. IP: 137.9.10.XX https://www.google- analytics.com/collect? … dl=https%3A%2F %2Fanalytics.twitter.com%2Fuser%2Fsolso %2Fhome& ... &vp=1140x645&... Privacy Breaches are Unavoidable (even for GA)
  14. 14. CLIQZ @ BIG 2016… Example: Counting Unique Visitors wired.com/xyz 09:48:40 82.143.2.X wired.com/xyz 09:48:42 137.9.10.X wired.com/xyz 09:48:59 137.9.10.X wired.com/xyz 09:49:12 137.9.10.X 4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved? GA backend
  15. 15. CLIQZ @ BIG 2016… Example: Counting Unique Visitors wired.com/xyz 09:48:40 82.143.2.X wired.com/xyz 09:48:42 137.9.10.X wired.com/xyz 09:48:59 137.9.10.X wired.com/xyz 09:49:12 137.9.10.X 4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved? GA backend wired.com/xyz 09:48:40 [82.143.2.X, 1320x910] wired.com/xyz 09:48:42 [137.9.10.X, 1266x809] wired.com/xyz 09:48:59 [137.9.10.X, 940x645] wired.com/xyz 09:49:12 [137.9.10.X, 940x645] GA backend Identifying which records come from the same person to avoid over- counting. A UID is needed 4 visits, 3 unique visitors
  16. 16. CLIQZ @ BIG 2016… Example: Counting Unique Visitors wired.com/xyz 09:48:40 --- wired.com/xyz 09:48:42 --- wired.com/xyz 09:48:59 --- wired.com/xyz 09:49:12 --- 4 people visited wired.com/xyz? 1 person visited wired.com/xyz 4 times? How can it be resolved? GA backend wired.com/xyz 09:48:40 [82.143.2.X, 1320x910] wired.com/xyz 09:48:42 [137.9.10.X, 1266x809] wired.com/xyz 09:48:59 [137.9.10.X, 940x645] wired.com/xyz 09:49:12 [137.9.10.X, 940x645] GA backend Identifying which records come from the same person to avoid over- counting. A UID is needed 4 visits, 3 unique visitors wired.com/ 09:49:12 [137.9.10.X, 1140x645] ebay-kleinanzeigen.de/ s-muenchen/cyclocross/ k0l6411r200 09:50:02 [137.9.10.X, 1140x645] twitter.com/solso 09:52:10 [137.9.10.X, 1140x645] www.meetic.com/ home/index.php 09:59:01 [137.9.10.X, 1140x645] analytics.twitter.com/ user/solso/home 10:05:45 [137.9.10.X, 1140x645]
  17. 17. CLIQZ @ BIG 2016… As long as aggregation of data per user on the server-side is needed, we will always incur on undesired privacy side- effects.
  18. 18. CLIQZ @ BIG 2016… Since server-side aggregation is the root of the problem, we should move the aggregation of data to the client-side (i.e. the user’s browser)
  19. 19. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser
  20. 20. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser
  21. 21. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = []
  22. 22. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)]
  23. 23. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)]
  24. 24. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser visit wired.com/xyz unique-visit wired.com/xyz Count Uniques Count Uniques
  25. 25. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  26. 26. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  27. 27. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  28. 28. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visit wired.com/xyz unique-visit wired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz Possible if you control the browser (i.e. CLIQZ). But also possible with HTML5 LocalStorage and PostMessage APIs.
  29. 29. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz wired.com/xyz 3rd party tracking script wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker 3rd party tracking script Browser Browser visitwired.com/xyz state = [ H(wired.com/xyz, unique-visit, timestamp)] wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz
  30. 30. CLIQZ @ BIG 2016… Counting Unique Visitors… Server-side Aggrega-on – Google Analy-cs wired.com/xyz [137.9.10.X, 940x645] GA Backend CGT Backend Client-side Aggrega-on – CLIQZ Green Tracker Browser Browser visitwired.com/xyz wired.com/xyz [137.9.10.X, 940x645] visit wired.com/xyz unique-visit wired.com/xyz Count Uniques Count Uniques
  31. 31. CLIQZ @ BIG 2016… Beyond Counting Unique Visitors? Working prototype of a GA-clone featuring: –  Unique visits and page loads. –  Returning customers. –  Goal conversion to track campaigns. –  Cross site correlations. –  In-site click-troughs. –  Visits and time in page per user (without beacons). A privacy preserving tracking agent: green-tracker, which implements all this 6 use-cases in less than 200 lines of code. Demo: http://site1.test.cliqz.com/
  32. 32. CLIQZ @ BIG 2016… Conclusions Data collection based on server-side aggregation of user’s data is very problematic as it implies tracking users. Tracking leads to to privacy side-effects, we provided evidence of privacy leaks on Google Analytics. Tracking can be avoided if one switches the design pattern to client- side aggregation. To demonstrate the feasibility of client-side aggregation we build and open-sourced a Google Analytics look-alike: https://github.com/cliqz/green-tracker that implements on a privacy preserving way a wide range of use- cases that require tracking users.
  33. 33. CLIQZ @ BIG 2016… Q&A Thanks for your attention!
  34. 34. CLIQZ @ BIG 2016… Appendix
  35. 35. CLIQZ @ BIG 2016… Keeping State on the Client Modern browsers have the ability to keep state via HTML5 LocalStorage. Therefore, a – privacy preserving tracking script – can keep a persistent state across multiple sites if loaded from an IFRAME •  Looks pretty familiar, but is slightly different: –  LocalStorage belongs to green-tracker.fbt.co (the collector backend) –  Respects CORS –  IFRAME is sandboxed (no access to Document) –  Explicit control from site-owner (postMessage) –  Explicit control from user (messages and state can be removed and inspect at will)
  36. 36. CLIQZ @ BIG 2016… Limitations As always, there are limitations that one must consider: •  Deploy is not immediate. It requires code changes both in the tracking script and collectors. •  Unplanned use-cases might not be possible retrospectively. •  Business logic of the data collector is explicit to the user. •  The state of the client can become a privacy issue if not handled properly; careful of not creating a duplicated history. •  Browser might have factory-default options that prevent LocalStorage to work as expected. For instance, Safari blocks 3rd party cookies which affect LocalStorage, the user can change the setting but this is sub-optimal.

×