Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tracking The Trackers WWW 2016

1,776 views

Published on

Paper at: http://www2016.net/proceedings/proceedings/p121.pdf

Abstract: Online tracking poses a serious privacy challenge that has drawn significant attention in both academia and industry. Existing approaches for preventing user tracking, based on curated blocklists, suffer from limited coverage and coarse-grained resolution for classification, rely on exceptions that impact sites’ functionality and appearance, and require significant manual maintenance. In this paper we propose a novel approach, based on the concepts leveraged from k-Anonymity, in which users collectively identify unsafe data elements, which have the potential to identify uniquely an individual user, and remove them from requests. We deployed our system to 200,000 German users running the Cliqz Browser or the Cliqz Firefox extension to evaluate its efficiency and feasibility. Results indicate that our approach achieves better privacy protection than blocklists, as provided by Disconnect, while keeping the site breakage to a minimum, even lower than the community-optimized Ad-Block Plus. We also provide evidence of the prevalence and reach of trackers to over 21 million pages of 350,000 unique sites, the largest scale empirical evaluation to date. 95% of the pages visited contain 3rd party requests to potential trackers and 78% attempt to transfer unsafe data. Tracker organizations are also ranked, showing that a single organization can reach up to 42% of all page visits in Germany.

Published in: Technology
  • Be the first to comment

Tracking The Trackers WWW 2016

  1. 1. Tracking the Trackers Zhonghao Yu zhonghao@cliqz.com Sam Macbeth sam@cliqz.com Konark Modi konarkm@cliqz.com Josep M. Pujol josep@cliqz.com
  2. 2. Page load triggers requests to multiple 3rd parties
  3. 3. Even on pages on sites that you probably want to keep private, like this dating site.
  4. 4. Of course, general news domains also load many 3rd parties
  5. 5. as well as electronic commerce sites like Ebay
  6. 6. Twitter pages only accessible to the authenticated user also load 3rd parties like GA
  7. 7. Twitter pages only accessible to the authenticated user also load 3rd parties like GA This browsing session on 5 different sites involved more than 60 different 3rd parties.
  8. 8. GET /css?family=Open+Sans+Condensed:300,700 Host: fonts.googleapis.com User-Agent: Mozilla/5.0 ... Firefox/45.0 Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241 fonts.googleapis.com is a potential tracker <meetic.com/home/index.php, UID> <www20016.ca/, UID> <wired.com/, UID> However, in THIS request, there is no data element that can be used as a UID. Since there is no unsafe data element, the request is safe.
  9. 9. GET /impression.php/f3ae074XXX/api_key=597038480XXX&lid=115… Host: www.facebook.com User-Agent: Mozilla/5.0 … Firefox/45.0 Referer: http://www.meetic.com/home/index.php Cookie: datr=0IPhVj5YHEJ20XXX; c_user=10973XXXX; … csm=2; IP: 79.227.235.241 facebook.com is a potential tracker too, <meetic.com/home/index.php, 10973XXXX> <www20016.ca/, 10973XXXX> <wired.com/, 10973XXXX> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200, 10973XXXX> Unlike fonts.googleapi.com, the request above is not safe with regards to privacy because it contain two values that we consider unsafe, thus could be used as UIDs, c_user=10973XXXX and datr=0IPhVj5YHEJ20XXX Because it contains at least one unsafe value, the request is considered unsafe.
  10. 10. GET /collect? v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524& ..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241 google-analytics.com is a potential tracker too, <meetic.com/home/index.php, 1291x522:79.227.235.241> <www20016.ca/, 1291x522:79.227.235.241> <wired.com/, 1291x522:79.227.235.241> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200, 1291x522:79.227.235.241> <analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241> The UID is not as evident as for Facebook. But the combination vp+IP is an unsafe data element, it can be used as a UID. Therefore this request is also unsafe. vp+IP = 1291x522:79.227.235.241
  11. 11. GET /collect? v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524& ..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241 google-analytics.com is a potential tracker too, <meetic.com/home/index.php, 1291x522:79.227.235.241> <www20016.ca/, 1291x522:79.227.235.241> <wired.com/, 1291x522:79.227.235.241> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200, 1291x522:79.227.235.241> <analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241> The UID is not as evident as for Facebook. But the combination vp+IP is an unsafe data element, it can be used as a UID. Therefore this request is also unsafe. vp+IP = 1291x522:79.227.235.241
  12. 12. Not a conveniently chosen example… ...tracking is a pervasive problem.
  13. 13. Tracking in the Wild Largest field study with real traffic to date, 200,000 users in Germany for a week(*) 21M page loads, 5M unique pages (URLs) from 350K domains (*) Between 09/09/2015 and 16/09/2015
  14. 14. Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are present in many different domains. Unsafe data elements are data elements for which we cannot rule out that possibility that they are UIDs. 21 M page loads without poten3al trackers with poten3al trackers 1 to 9 >= 10 5% 95% 24%76%
  15. 15. Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are present in many different domains. Unsafe data elements are data elements for which we cannot rule out that possibility that they are UIDs. 21 M page loads without unsafe values with unsafe values 1 to 9 >= 10 22% 78% 21%79%
  16. 16. Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are loaded in many different domains. Unsafe values are data elements for which we cannot rule out that possibility that they are UIDs. 21 M page loads without unsafe values with unsafe values 1 to 9 >= 10 22% 78% 21%79% 78% of all page loads can be tracked
  17. 17. Tracking in the Wild: Reach % of page loads seen % of page loads seen with unsafe data elements (tracking) rank Google 62.4% 42.4% 1st Facebook 21.1% 18.5% 2nd AppNexus 10.15% 9.9% 3rd ADITION 8.7% 8.4% 4th Criteo 8.7% 8.2% 5th … Comscore 6.1% 5.9% -- DoublePimp 0.5% 0.5% -- NewRelic 2% 0.03% -- …
  18. 18. Tracking in the Wild: Reach % of page loads seen % of page loads seen with unsafe data elements (tracking) rank Google 62.4% 42.4% 1st Facebook 21.1% 18.5% 2nd AppNexus 10.15 9.9% 3rd ADITION 8.7% 8.4% 4th Criteo 8.7% 8.2% 5th … Comscore 6.1% 5.9% -- DoublePimp 0.5% 0.5% -- NewRelic 2% 0.03% -- … 58 organizations with a reach larger than 1%
  19. 19. CLIQZ Tracking Protection Maximize coverage, minimize false positives
  20. 20. CLIQZ Tracking Protection Maximize coverage, minimize false positives Aggressiveness is counter-productive… •  increases site breakage, which forces users to add exceptions, thus reducing protection coverage. •  affects legitimate services and data collection
  21. 21. Block only the Ability to Track GET /collect? v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524& ..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241 Intervention only on unsafe data elements – those elements that can be used as UIDs, Should protect the user, while minimizing side-effects: a)  site-breakage for users b) legitimate data collection for 3rd parties
  22. 22. Blocklists are coarse-grained CDF of the number of requests with observed unsafe data elements by 3rd party domains contained both in Disconnect Blocklist and CLIQZ list of potential trackers (~2000 domains each). Intersection is 477 domains.
  23. 23. Blocklists are coarse-grained CDF of the number of requests with observed unsafe data elements by 3rd party domains contained both in Disconnect Blocklist and CLIQZ list of potential trackers (~2000 domains each). Intersection is 477 domains. Only 2% of tracker domains in Disconnect always send unsafe data elements.
  24. 24. Blocklists are coarse-grained CDF of the number of requests with observed unsafe data elements by 3rd party domains contained both in Disconnect Blocklist and CLIQZ list of potential trackers (~2000 domains each). Intersection is 477 domains. 98% of tracker domains have a MIXED behavior Lack of resolution… Only 2% of tracker domains is Disconnect always send unsafe data elements.
  25. 25. Blocklists are coarse-grained Blocklists by domain (reverse suffix) are too coarse-grained. BLOCKLIST by Domain
  26. 26. Blocklists are too coarse-grained EasyPrivacy (from Adblock Plus) has hundreds of regular expressions to cover for mixed behavior of trackers. BLOCKLIST by Domain + RegExp Exceptions
  27. 27. Blocklists are too coarse-grained BLOCKLIST by Domain + More RegExp Exceptions EasyPrivacy (from Adblock Plus) has hundreds of regular expressions to cover for mixed behavior of trackers.
  28. 28. We propose a more fine-grained approach to algorithmically determine the safeness level of individual data elements within a request to a 3rd party
  29. 29. Determining Safeness Each 3rd party request to a potential tracker is parsed to obtain a list of tuples T = [<s, d, k, v>] whose safeness level is evaluated in real-time, T = [ <s=wired.com/, d=3rdparty.com, k=z, v=1501498154>, <s=wired.com/, d=3rdparty.com, k=fl,v=21.0>, <s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>, <s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>, <s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>, <s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] The aim is to identify which data elements (including combinations) are unsafe, and therefore, they are candidates to be used as UIDs.
  30. 30. Determining Safeness Each 3rd party request to a potential tracker is parsed to obtain a list of tuples T = [<s, d, k, v>] whose safeness level is evaluated in real-time, T = [ <s=wired.com/, d=3rdparty.com, k=z, v=1501498154>, <s=wired.com/, d=3rdparty.com, k=fl,v=21.0>, <s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>, <s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>, <s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>, <s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] The aim is to identify which data elements (including combinations) are unsafe, and therefore, they are candidates to be used as UIDs. We cannot do this effectively. But we can do the opposite, identify data elements that cannot be used effectively as UIDs, and consider them safe.
  31. 31. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] All tuples are UNSAFE by default unless we can determine that the given data-element is not a good UID, hence safe.
  32. 32. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] The value 1501498154 has never been seen before for <d, k>. Thus, cannot be used as UID => SAFE
  33. 33. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] The value 21.0 is to short to encode any UID => SAFE
  34. 34. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] More than 3 different values in less than 2 days by the same tuple <d,k>. Not persistent, bad UID => SAFE
  35. 35. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] Always the same value for <d,k>. We cannot rule out that the data- elements are UID => keep as UNSAFE Only using local information is not enough; vr=1440x1024 is not a UID… We need something extra.
  36. 36. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] Locally UNSAFE, i.e. always the same value for <d, k>. Globally SAFE since more than 20 other users have observed the same value 1440x900 for tuple <d,k> = <3rdparty.com,vr> in the last 2 days.
  37. 37. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] Locally UNSAFE, i.e. always the same value for tuple <d,k>. Globally SAFE since it has reach the safeness-quorum based on k-Anonymity.
  38. 38. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] Locally UNSAFE, i.e. always the same value for <d, k>. Globally UNSAFE not enough people has seen the value for <d, k>, always same <d, k, u>. Not safe to send. Two options: a)  it is a UID, or an element that could be used as such. b)  a false positive due to the Transient State (0.07%)
  39. 39. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] Locally UNSAFE and Globally UNSAFE At this point the request analysis is complete: 1)  ALLOW Request removing unsafe data-elements 2)  ALLOW Request obfuscating unsafe data-elements 3)  BLOCK Request or ALLOW Request without alteration
  40. 40. Safeness Quorum without Tracking To determine that a data-element is globally safe we need to count the number of unique users that have observed a tuple <d,k,v> e.g. <d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> Users could share tuples with a field that identifies them (u), <u=usrXXX, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> with CLIQZ. But that would make CLIQZ a tracker! Instead, each user sends the tuple – if observed – once and only once per hour: <d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> Actual values are not needed; counting and membership test on GWL <d=ed5c0cf7b05572eb, k=4d3a21d8c684c09c19b93be911827fd5, v=e60f936dc719ca649a80a97490a09940>
  41. 41. Evaluation: Protection Coverage Requests Blocked False positives ratio (requests blocked without unsafe data- elements) Protection Misses (requests allowed with unsafe data- elements) CLIQZ 51.7% -- -- Disconnect 66.1% 38.8% 12.3% Kontaxis & Chew (Firefox Tracking Protection) [est.] 36.6% 29.4% 25.4%
  42. 42. Evaluation: Site Breakage Reload Rate % Increase over baseline % Increase over CLIQZ BASELINE (without tracking protection) 0.00101 -- -- CLIQZ 0.00104 4% -- Adblock Plus (counting exceptions added by users) 0.00110 10% 150% CLIQZ as Blocklist 0.00125 25% 525%
  43. 43. Conclusions Tracking is a BIG problem –  Privacy is seriously at risk Tracking Protection is not an easy task –  Trade-off between site breakage and protection coverage Blocklist-based approaches have limitations –  Maintainability –  Coarse-grained resolution –  Too many false positives CLIQZ tracking protection addresses them to a large extent
  44. 44. Future Work CLIQZ tracking protection might be better than the state-of-the-art. But it is far from perfect, •  still produces site- breakages •  protection coverage is not 100% •  it can be attacked in multiple ways [Picture from http://mtthwhgn.com/tag/flooding/] we provide a bigger hammer for the whack-a-tracker
  45. 45. Thanks a lot! Q&A Zhonghao Yu Sam Macbeth Konark Modi
  46. 46. Appendix
  47. 47. Implementation Details Realtime Component 1) Parsing request 2) Local safeness: membership test on LWL 3) Global safeness: membership test on GWL LWL and GWL are Bloom Filters, combined less than < 512KB, FP ratio of 0.1%. Takes about 1-12 ms. Offline Component Data from users needs to be sent to CLIQZ to build GWL for the safeness quorum. GWL needs to be sent back to the users’ browsers. We use an eventual consistency model with incremental updates over daily snapshots. Bandwidth costs per user per day: 90KB upload, 566KB download. For a worse-case propagation lag of 10 minutes. False positive unsafe data elements due to transient state is 0.07%
  48. 48. Determining Safeness T = [ <s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x1024>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x981>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>, ] Cookies from potential trackers are always blocked. POST requests are also analyzed, blocked only if: •  match Cookie values •  match QS values declared unsafe •  match values from browser- fingerprinting User initiated actions are always ALLOWED (even if tracking)
  49. 49. Protection Coverage
  50. 50. Unsafe Data Origins

×