Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)

2,550 views

Published on

Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. In this tutorial, we will first present an overview of privacy breaches over the last two decades and the lessons learned, key regulations and laws, and evolution of privacy techniques leading to differential privacy definition / techniques. Then, we will focus on the application of privacy-preserving data mining techniques in practice, by presenting case studies such as Apple's differential privacy deployment for iOS / macOS, Google's RAPPOR, LinkedIn Salary, and Microsoft's differential privacy deployment for collecting Windows telemetry. We will conclude with open problems and challenges for the data mining / machine learning community, based on our experiences in industry.

Published in: Internet
  • Be the first to comment

Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)

  1. 1. Privacy-preserving Data Mining in Industry WWW 2019 Tutorial May 2019 Krishnaram Kenthapadi (AI @ LinkedIn) Ilya Mironov (Google AI) Abhradeep Thakurta (UC Santa Cruz) https://sites.google.com/view/www19-privacy-tutorial
  2. 2. Fairness Privacy Transparency Explainability
  3. 3. Fairness Privacy Transparency Explainability Related WWW’19 sessions: 1.Tutorial: Designing Equitable Algorithms for the Web 2.Tutorial: Economic Theories of Distributive Justice for Fair Machine Learning 3.Tutorial: Socially Responsible NLP 4.Tutorial: Fairness-aware Machine Learning in Practice 5.Tutorial: Explainable Recommendation and Search 6.Workshop: FATE and Society on the Web 7.Session: Fairness, Credibility, and Search (Wednesday, 10:30 – 12:30) 8.Session: Privacy and Trust (Wednesday, 16:00 – 17:30) 9.Special Track: Designing an Ethical Web (Friday)
  4. 4. “Privacy by Design” for AI products
  5. 5. Outline / Learning Outcomes • Privacy breaches and lessons learned • Evolution of privacy techniques • Differential privacy: definition and techniques • Privacy techniques in practice: Challenges and Lessons Learned • Google’s RAPPOR • Apple’s differential privacy deployment for iOS • Privacy in AI @ LinkedIn (Analytics framework & LinkedIn Salary) • Key Takeaways
  6. 6. Privacy: A Historical Perspective Evolution of Privacy Techniques and Privacy Breaches
  7. 7. Privacy Breaches and Lessons Learned Attacks on privacy •Governor of Massachusetts •AOL •Netflix •Web browsing data •Facebook •Amazon •Genetic data
  8. 8. born July 31, 1945 resident of 02138 Massachusetts Group Insurance Commission (1997): Anonymized medical history of state employees (all hospital visits, diagnosis, prescriptions) Latanya Sweeney (MIT grad student): $20 – Cambridge voter roll William Weld vs Latanya Sweeney
  9. 9. 64 %uniquely identifiable with ZIP + birth date + gender (in the US population) Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”,
  10. 10. Attacker's Advantage Auxiliary information
  11. 11. August 4, 2006: AOL Research publishes anonymized search logs of 650,000 users August 9: New York Times AOL Data Release
  12. 12. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs
  13. 13. Netflix Prize
  14. 14. Oct 2006: Netflix announces Netflix Prize • 10% of their users • average 200 ratings per user Narayanan, Shmatikov (2006): Netflix Prize
  15. 15. Deanonymizing Netflix Data Narayanan, Shmatikov, Robust De- anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset), 2008
  16. 16. ● Noam Chomsky in Our Times ● Farenheit 9/11 ● Jesus of Nazareth ● Queer as Folk
  17. 17. Key idea: ● Similar intuition as the attack on medical records ● Medical records: Each person can be identified based on a combination of a few attributes ● Web browsing history: Browsing history is unique for each person ● Each person has a distinctive social network  links appearing in one’s feed is unique ● Users likely to visit links in their feed with higher probability than a random user ● “Browsing histories contain tell-tale marks of identity” Su et al, De-anonymizing Web Browsing Data with Social Networks, 2017 De-anonymizing Web Browsing Data with Social Networks
  18. 18. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality
  19. 19. Ad targeting: Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM Privacy Attacks On Ad Targeting
  20. 20. 10 campaigns targeting 1 person (zip code, gender, workplace, alma mater) Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM Facebook vs Korolova Age 21 22 23 … 30 Ad Impressions in a week 0 0 8 … 0
  21. 21. 10 campaigns targeting 1 person (zip code, gender, workplace, alma mater) Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM Facebook vs Korolova Interest A B C … Z Ad Impressions in a week 0 0 8 … 0
  22. 22. ● Context: Microtargeted Ads ● Takeaway: Attackers can instrument ad campaigns to identify individual users. ● Two types of attacks: ○ Inference from Impressions ○ Inference from Clicks Facebook vs Korolova: Recap
  23. 23. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality Active
  24. 24. Items frequently bought together Bought: A B C D E Z: Customers Who Bought This Item Also Bought Calandrino, Kilzer, Narayanan, Felten, Shmatikov, “You Might Also Like: Privacy Risks of Collaborative Attacking Amazon.com A C D E
  25. 25. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality Active Observant
  26. 26. Homer et al., “Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high- density SNP genotyping microarrays”, PLoS Genetics, 2008 Genetic data
  27. 27. Reference population Bayesian Analysis
  28. 28. “In all mixtures, the identification of the presence of a person’s genomic DNA was possible.”
  29. 29. Zerhouni, NIH Director: “As a result, the NIH has removed from open-access databases the aggregate results (including P values and genotype counts) for all the GWAS that had been available on NIH sites” … one week later
  30. 30. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality Active Observant Clever
  31. 31. Negative Results
  32. 32. Dinur-Nissim 0 1 1 0 1 0 0 0 1 1 0 1Data query: 𝚺 Dinur-Nissim 2003: If error is o(√n), then reconstruction is possible up to n−o(n) ...even if 23.9% of errors are arbitrary [DMT07] ...even with O(n) queries [DY08]
  33. 33. Dwork-Naor Tore Dalenius desideratum (aka as “semantic security”): “Access to a statistical database should not enable one to learn anything about an individual that could not be learned without access.” (1977) Dwork-Naor (~2006): If the database teaches us anything, there is always some auxiliary information that breaks Dalenius desideratum.
  34. 34. Differential Privacy
  35. 35. Curator Defining Privacy
  36. 36. Curator Defining Privacy: Fool's Errand
  37. 37. Defining Privacy 38 CuratorCurator + your data - your data
  38. 38. Differential Privacy 39 Databases D and D′ are neighbors if they differ in one person’s data. Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). CuratorCurator + your data - your data Dwork, McSherry, Nissim, Smith [TCC 2006]
  39. 39. ε-Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Differential Privacy 40 CuratorCurator Parameter ε quantifies information leakage ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]. + your data - your data Dwork, McSherry, Nissim, Smith [TCC 2006]
  40. 40. ε-Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Differential Privacy 41 CuratorCurator Parameter ε quantifies information leakage ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]+𝛿. Parameter 𝛿 gives some slack Dwork, Kenthapadi, McSherry, Mironov, Naor [EUROCRYPT 2006] + your data - your data
  41. 41. 42 f(D) f(D′) — bad outcomes — probability with record x — probability without record x “Bad Outcomes” Interpretation
  42. 42. ● Prior on databases p ● Observed output O ● Does the database contain record x? 43 Bayesian Interpretation
  43. 43. Differential Privacy ● Robustness to auxiliary data ● Post-processing: If M(D) is differentially private, so is f(M(D)). ● Composability: Run two ε-DP mechanisms. Full interaction is 2ε-DP. ● Group privacy: Graceful degradation in the presence of correlated inputs. 44
  44. 44. Differential Privacy: Laplace Mechanism Define ℓ1-sensitivity of f: D→ℝn: maxD,D′ ||f(D) − f(D′)||1 < 1, then the Laplace mechanism f(D) + Laplacen(1/ε) offers ε-differential privacy. Dwork, McSherry, Nissim, Smith, “Calibrating Noise to Sensitivity in Private Data Analysis”, TCC 2006
  45. 45. Differential Privacy: Gaussian Mechanism If ℓ2-sensitivity of f:D→ℝn: maxD,D′ ||f(D) − f(D′)||2 < 1, then the Gaussian mechanism f(D) + Nn(0, σ2) offers (ε, δ)-differential privacy, where δ = ⅘·exp(-(εσ)2/2). Dwork, Kenthapadi, McSherry, Mironov, Naor, “Our Data, Ourselves”, Eurocrypt 2006
  46. 46. What Differential Privacy Isn’t ● Algorithm, architecture, or a rule book ● Secure Computation: what not how ● All-encompassing guarantee: trends may be sensitive too
  47. 47. Strava Fitness App
  48. 48. BBC: “Fitness app Strava lights up staff at military bases”
  49. 49. Differential Privacy: Takeaway points • Privacy as a notion of stability of randomized algorithms in respect to small perturbations in their input • Worst-case definition • Robust (to auxiliary data, correlated inputs) • Composable • Quantifiable • Concept of a privacy budget • Noise injection
  50. 50. Case Studies
  51. 51. Google’s RAPPOR
  52. 52. ...Mountain View, 2014
  53. 53. Central Model Curator
  54. 54. Local Model
  55. 55. Differential Privacy ε-Differential Privacy: The distribution of the output M(D) on database D is (nearly) the same as M(D′) for all adjacent databases D and D′: ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S].
  56. 56. Local Differential Privacy ε-Differential Privacy: The distribution of the output M(D) on database D is (nearly) the same as M(D′) for all adjacent databases D and D′: ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S].
  57. 57. Local-Differentially Private Mechanisms ● Stanley L. Warner, "Randomized response: a survey technique for eliminating evasive answer bias", Journal of American Statistical Association, March 1965. ● Arijit Chaudhuri, Rahul Mukerjee. Randomized Response. Theory and Techniques. 1988.
  58. 58. Randomized Response (Warner 1965) Q1: Are you a citizen of the United States? Q2: Are you not a citizen of the United States? 𝜃 - the true fraction of citizens in the sample Answer Q1 Answer Q2 p 1 − p -DP
  59. 59. RAPPOR Erlingsson, Pihur, Korolova. "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response." ACM CCS 2014.
  60. 60. RAPPOR: two-level randomized response Can we do repeated surveys of sensitive attributes? — Average of randomized responses will reveal a user’s true answer :-( Solution: Memoize! Re-use the same random answer — Memoization can hurt privacy too! Long, random bit sequence can be a unique tracking ID :-( Solution: Use 2-levels! Randomize the memoized response
  61. 61. RAPPOR: two-level randomized response ● Store client value v into bloom filter B using hash functions ● Memoize a Permanent Randomized Response (PRR) B′ ● Report an Instantaneous Randomized Response (IRR) S
  62. 62. RAPPOR: two-level randomized response ● Store client value v into bloom filter B using hash functions ● Memoize a Permanent Randomized Response (PRR) B′ ● Report an Instantaneous Randomized Response (IRR) S f = ½ q = ¾ , p = ½
  63. 63. RAPPOR: Life of a report Value Bloom Filter PRR IRR “www.google.com”
  64. 64. Value Bloom Filter PRR IRR “www.google.com” P(1) = 0.25 P(1) = 0.75 RAPPOR: Life of a report
  65. 65. Value Bloom Filter PRR IRR “www.google.com” P(1) = 0.50 P(1) = 0.75 RAPPOR: Life of a report
  66. 66. Differential privacy of RAPPOR ● Permanent Randomized Response satisfies differential privacy at ● Instantaneous Randomized Response has differential privacy at = 4 ln(3) = ln(3)
  67. 67. Differential Privacy of RAPPOR: Measurable privacy bounds Each report offers differential privacy with ε = ln(3) Attacker’s guess goes from 0.1% → 0.3% in the worst case Differential privacy even if attacker gets all reports (infinite data!!!) Also… Base Rate Fallacy prevents attackers from finding needles in haystacks
  68. 68. Cohorts Bloom Filter: 2 bits out of 128 — too many false positives ... user 0xA0FE91B76: google.com cohort 2cohort 1 cohort 128 h2
  69. 69. Decoding RAPPOR
  70. 70. From Raw Counts to De-noised Counts True bit counts, with no noise De-noised RAPPOR reports
  71. 71. From De-Noised Count to Distribution True bit counts, with no noise De-noised RAPPOR reports google.com: yahoo.com: bing.com:
  72. 72. From De-Noised Count to Distribution Linear Regression: minX ||B - A X||2 LASSO: minX (||B - A X||2)2 + λ||X||1 Hybrid: 1. Find support of X via LASSO 2. Solve linear regression to find weights
  73. 73. Deploying RAPPOR
  74. 74. Coverage
  75. 75. Explaining RAPPOR “Having the cake and eating it too…” “Seeing the forest without seeing the trees…”
  76. 76. Metaphor for RAPPOR
  77. 77. Microdata: An Individual’s Report
  78. 78. Microdata: An Individual’s Report Each bit is flipped with probability 25%
  79. 79. Big Picture Remains!
  80. 80. Google Chrome Privacy White Paper https://www.google.com/chrome/browser/privacy/whitepaper.html Phishing and malware protection Google Chrome includes an optional feature called "Safe Browsing" to help protect you against phishing and malware attacks. This helps prevent evil-doers from tricking you into sharing personal information with them (“phishing”) or installing malicious software on your computer (“malware”). The approach used to accomplish this was designed specifically to protect your privacy and is also used by other popular browsers. If you'd rather not send any information to Safe Browsing, you can also turn these features off. Please be aware that Chrome will no longer be able to protect you from websites that try to steal your information or install harmful software if you disable this feature. We really don't recommend turning it off. … If a URL was indeed dangerous, Chrome reports this anonymously to Google to improve Safe Browsing. The data sent is randomized, constructed in a manner that ensures differential privacy, permitting only monitoring of aggregate statistics that apply to tens of thousands of users at minimum. The reports are an instance of Randomized Aggregatable Privacy-Preserving Ordinal Responses, whose full technical details have been published in a technical report and presented at the 2014 ACM Computer and Communications Security conference. This means that Google cannot infer which website you have visited from this.
  81. 81. Developers’ Uptake
  82. 82. RAPPOR: Lessons Learned
  83. 83. Growing Pains ● Transitioning from a research prototype to a real product ● Scalability ● Versioning
  84. 84. Communicating Uncertainty
  85. 85. The "Three Commas" Club
  86. 86. Follow-up - Bassily, Smith, “Local, Private, Efficient Protocols for Succinct Histograms,” STOC 2015 - Kairouz, Bonawitz, Ramage, “Discrete Distribution Estimation under Local Privacy”, https://arxiv.org/abs/1602.07387 - Qin et al., “Heavy Hitter Estimation over Set-Valued Data with Local Differential Privacy”, CCS 2016
  87. 87. Key takeaway points RAPPOR - locally differentially-private mechanism for reporting of categorical and string data ● First Internet-scale deployment of differential privacy ● Explainable ● Conservative ● Open-sourced
  88. 88. Apple's On-Device Differential Privacy Abhradeep Guha Thakurta, UC Santa Cruz
  89. 89. Apple WWDC, June 2016
  90. 90. References https://arxiv.org/abs/1709.02753
  91. 91. Phablet Derp Photobomb Woot Phablet OMG Woot Troll Prepone Phablet awwww dp Learning from private data Learn new (and frequent) words typed
  92. 92. Learning from private data Learn frequent emojis typed
  93. 93. Apple's On-Device Differential Privacy: Discovering New Words
  94. 94. Roadmap 1. Private frequency estimation with count-min-sketch 2. Private heavy hitters with puzzle piece algorithm 3. Private heavy hitters with tree histogram protocol
  95. 95. Private Frequency Oracle
  96. 96. Private frequency oracle Building block for private heavy hitters 𝑑2𝑑1 𝑑 𝑛 All errors within 𝛾 = O( 𝑛 log|𝒮|) frequency Words (𝒮) 𝛾 "phablet" frequency("phablet")
  97. 97. Private frequency oracle: Design constraints Computational and communication constraints: Client side: Logarithm in size of the domain (|S|) and n Communication to server: very few bits Server-side cost for one query: size of the domain (|S|) and n
  98. 98. Private frequency oracle: Design constraints Computational and communication constraints: Client side: size of the domain (|S|) and n # characters > 3,000 For 8-character words: size of the domain |S|=3,000^8 number of clients ~ 1B Efficiently [BS15] ~ n Our goal ~ O(log |S|)
  99. 99. Private frequency oracle: Design constraints Computational and communication constraints: Client side: O(log |S|) Communication to server: O(1) bits Server-side cost for one query: O(log |S|)
  100. 100. Private frequency oracle A starter solution: Randomized response 𝑑 0 1 0 𝑖 1 0 1 𝑖 Protects ε-differential privacy (with the right bias) Randomized response: d′
  101. 101. 1 0 0 1 1 0 1 0 1 + With bias correction frequency All domain elements Error in each estimate: Θ( 𝑛 log|𝒮|) Optimal error under privacy Private frequency oracle A starter solution: Randomized response
  102. 102. Private frequency oracle A starter solution: Randomized response Computational and communication constraints: Client side: O(|S|) Communication to server: O(|S|) bits Server-side cost for one query: O(1) 1 0 1 𝑖
  103. 103. 𝑑 0 01 0 01 0 01 Hash function: ℎ1 Hash function: ℎ2 Hash function: ℎ 𝑘 Number of hash bins: 𝑛 Computation= 𝑂(log|𝒮|) 𝑘 ≈ log|𝒮| Private frequency oracle Non-private count-min sketch [CM05]
  104. 104. 0 01 0 01 0 01 0 01 1 00 0 11 1 𝑘 1 + 245 127 9123 2132 𝑛 Reducing server computation Private frequency oracle Non-private count-min sketch [CM05]
  105. 105. Reducing server computation 1 𝑘 1 Phablet 245 127 9123 2132 𝑛 9146 2212 Frequency estimate: min (9146, 2212, 2132) Error in each estimate: O( 𝑛log|𝒮|) Server side query cost: 𝑂(log|𝒮|) 𝑘 ≈ log |𝒮| Private frequency oracle Non-private count-min sketch [CM05] "phablet"
  106. 106. Private frequency oracle Private count-min sketch 𝑑 Making client computation differentially private 0 01 0 01 0 01 1 01 1 00 0 00 𝑘𝜖-diff. private, since 𝑘 pieces of information
  107. 107. Private frequency oracle Private count-min sketch 𝑑 Theorem: Sampling ensures 𝜖-differential privacy without hurting accuracy, rather improves it by a factor of 𝑘 0 01 1 00
  108. 108. Private frequency oracle Private count-min sketch Reducing client communication 0 01 +1 +1-1 Hadamard transform
  109. 109. Private frequency oracle Private count-min sketch Reducing client communication 0 01 +1 +1-1 Hadamard transform -1 +1 Communication: 𝑂(1) bit Theorem: Hadamard transform and sampling do not hurt accuracy
  110. 110. Private frequency oracle Private count-min sketch Computational and communication constraints: Client side: O(log |S|) Communication to server: O(1) bits Server-side cost for one query: O(log |S|) Error in each estimate: O( 𝑛log|𝒮|)
  111. 111. Roadmap 1. Private frequency estimation with count-min-sketch 2. Private heavy hitters with puzzle piece algorithm 3. Private heavy hitters with tree histogram protocol
  112. 112. Private heavy hitters: Using the frequency oracle Private frequency oracle Private count-min sketch Domain 𝒮 Too many elements in 𝒮 to search. Element s in S Frequency(s) Find all s in S with frequency > γ
  113. 113. Roadmap 1. Private frequency estimation with count-min-sketch 2. Private heavy hitters with puzzle piece algorithm 3. Private heavy hitters with tree histogram protocol
  114. 114. Puzzle piece algorithm (works well in practice, no theoretical guarantees) [Bassily Nissim Stemmer Thakurta, 2017 and Apple differential privacy team, 2017]
  115. 115. Private heavy hitters Observation: If a word is frequent, its bigrams are frequent too. Ph ab le t$ Frequency > 𝛾 Each bi-gram frequency > 𝛾
  116. 116. Private heavy hitters Natural algorithm: Cartesian product of frequent bi-grams Sanitized bi-grams, and the complete word ab ad ph ba ab ax le ab ab Position P1 Position P2 Position P3 le ab t$ Position P4 Frequent bi-grams
  117. 117. Private heavy hitters ab ad ph ba ab ax le ab ab Position P1 Position P2 Position P3 le ab t$ Position P4 Frequent bi-grams Candidate words P1 x P2 x P3 x P4 Private frequency oracle Private count-min sketch Find frequent words Natural algorithm: Cartesian product of frequent bi-grams
  118. 118. Private heavy hitters Natural algorithm: Cartesian product of frequent bi-grams Candidate words P1 x P2 x P3 x P4 Private frequency oracle Find frequent words Combinatorial explosion In practice, all bi-grams are frequent Private count-min sketch
  119. 119. Puzzle piece algorithm Ph ab le t$ ≜ h=Hash(Phablet) Hash: 𝒮 → 1, … , ℓ Ph ab le t$h h h h Privatized bi-grams tagged with the hash, and the complete word
  120. 120. Puzzle piece algorithm: Server side ab 1 ad 5 Ph 3 ba 4 ab 3 ax 9 le 3 le 7 ab 1 Position P1 Position P2 Position P3 le 1 ab 9 t$ 3 Position P4 Frequent bi-grams tagged with {1, … , ℓ} Candidate words P1 x P2 x P3 x P4 Private frequency oracle Find frequent words Combine only matching bi-grams Private count-min sketch
  121. 121. Roadmap 1. Private frequency estimation with count-min-sketch 2. Private heavy hitters with puzzle piece algorithm 3. Private heavy hitters with tree histogram protocol
  122. 122. Tree histogram algorithm (works well in practice + optimal theoretical guarantees) [Bassily Nissim Stemmer Thakurta, 2017]
  123. 123. Private heavy hitters: Tree histograms (based on [CM05]) 1 0 0 Any string in 𝒮: log |𝒮| bits Idea: Construct prefixes of the heavy hitter bit by bit
  124. 124. Private heavy hitters: Tree histograms 0 1
  125. 125. Private heavy hitters: Tree histograms 0 1 Level 1: Frequent prefix of length 1 Use private frequency oracle If a string is a heavy hitter, its prefixes are too.
  126. 126. Private heavy hitters: Tree histograms 00 01 10 11
  127. 127. Private heavy hitters: Tree histograms Level 2: Frequent prefix of length two Idea: Each level has ≈ 𝑛 heavy hitters 00 01 10 11
  128. 128. Private heavy hitters: Tree histograms Computational and communication constraints: Client side: O(log |S|) Communication to server: O(1) bits Server-side computation: O(n log |S|) Theorem: Finds all heavy hitters with frequency at least 𝑂( 𝑛 log|𝒮|)
  129. 129. Key takeaway points • Keeping local differential privacy constant: •One low-noise report is better than many noisy ones •Weak signal with probability 1 is better than strong signal with small probability • We can learn the dictionary – at a cost • Longitudinal privacy remains a challenge
  130. 130. NeurIPS 2017
  131. 131. Microsoft: Discretization of continuous variables "These guarantees are particularly strong when user’s behavior remains approximately the same, varies slowly, or varies around a small number of values over the course of data collection."
  132. 132. Microsoft's deployment "Our mechanisms have been deployed by Microsoft across millions of devices ... to protect users’ privacy while collecting application usage statistics." B. Ding, J. Kulkarni, S. Yekhanin, NeurIPS 2017
  133. 133. Microsoft Research Blog, Dec 8, 2017
  134. 134. Privacy in AI @ LinkedIn • Framework to compute robust, privacy-preserving analytics • Privacy challenges/design for a large crowdsourced system (LinkedIn Salary)
  135. 135. Analytics & Reporting Products at LinkedIn Profile View Analytics 137 Content Analytics Ad Campaign Analytics All showing demographics of members engaging with the product
  136. 136. • Admit only a small # of predetermined query types • Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns Analytics & Reporting Products at LinkedIn
  137. 137. • Admit only a small # of predetermined query types • Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns E.g., Clicks on a given adE.g., Title = “Senior Director” Analytics & Reporting Products at LinkedIn
  138. 138. Privacy Requirements • Attacker cannot infer whether a member performed an action • E.g., click on an article or an ad • Attacker may use auxiliary knowledge • E.g., knowledge of attributes associated with the target member (say, obtained from this member’s LinkedIn profile) • E.g., knowledge of all other members that performed similar action
  139. 139. Possible Privacy Attacks 141 Targeting: Senior directors in US, who studied at Cornell Matches ~16k LinkedIn members → over minimum targeting threshold Demographic breakdown: Company = X May match exactly one person → can determine whether the person clicks on the ad or not Require minimum reporting threshold Still amenable to attacks (Refer our ACM CIKM’18 paper for details) Rounding mechanism E.g., report incremental of 10 Still amenable to attacks E.g. using incremental counts over time to infer individuals’ actions Need rigorous techniques to preserve member privacy (not reveal exact aggregate counts)
  140. 140. Key Product Desiderata • Coverage & Utility • Data Consistency • for repeated queries • over time • between total and breakdowns • across entity/action hierarchy • for top k queries
  141. 141. Problem Statement Compute robust, reliable analytics in a privacy- preserving manner, while addressing the product desiderata such as coverage, utility, and consistency.
  142. 142. Differential Privacy: Random Noise Addition If ℓ1-sensitivity of f : D → ℝn: maxD,D′ ||f(D) − f(D′)||1 = s, then adding Laplacian noise to true output f(D) + Laplacen(s/ε) offers ε-differential privacy. Dwork, McSherry, Nissim, Smith, “Calibrating Noise to Sensitivity in Private Data Analysis”, TCC 2006
  143. 143. PriPeARL: A Framework for Privacy-Preserving Analytics K. Kenthapadi, T. T. L. Tran, ACM CIKM 2018 145 Pseudo-random noise generation, inspired by differential privacy ● Entity id (e.g., ad creative/campaign/account) ● Demographic dimension ● Stat type (impressions, clicks) ● Time range ● Fixed secret seed Uniformly Random Fraction ● Cryptographic hash ● Normalize to (0,1) Random Noise Laplace Noise ● Fixed ε True Count Noisy Count To satisfy consistency requirements ● Pseudo-random noise → same query has same result over time, avoid averaging attack. ● For non-canonical queries (e.g., time ranges, aggregate multiple entities) ○ Use the hierarchy and partition into canonical queries ○ Compute noise for each canonical queries and sum up the noisy counts
  144. 144. System Architecture
  145. 145. Lessons Learned from Deployment (> 1 year) • Semantic consistency vs. unbiased, unrounded noise • Suppression of small counts • Online computation and performance requirements • Scaling across analytics applications • Tools for ease of adoption (code/API library, hands-on how-to tutorial) help!
  146. 146. Summary • Framework to compute robust, privacy-preserving analytics • Addressing challenges such as preserving member privacy, product coverage, utility, and data consistency • Future • Utility maximization problem given constraints on the ‘privacy loss budget’ per user • E.g., noise with larger variance to impressions but less noise to clicks (or conversions) • E.g., more noise to broader time range sub-queries and less noise to granular time range sub- queries • Reference: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018. • https://engineering.linkedin.com/blog/2019/04/privacy-preserving-analytics-and-reporting-at- linkedin
  147. 147. Acknowledgements •Team: • AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran • Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian Koeppe • Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke •Acknowledgements (in alphabetical order) • Deepak Agarwal, Igor Perisic, Arun Swami
  148. 148. LinkedIn Salary
  149. 149. Outline • LinkedIn Salary Overview • Challenges: Privacy, Modeling • System Design & Architecture • Privacy vs. Modeling Tradeoffs
  150. 150. LinkedIn Salary (launched in Nov, 2016)
  151. 151. Salary Collection Flow via Email Targeting
  152. 152. Current Reach (May 2019) • A few million responses out of several millions of members targeted • Targeted via emails since early 2016 • Countries: US, CA, UK, DE, IN, … • Insights available for a large fraction of US monthly active users
  153. 153. Data Privacy Challenges • Minimize the risk of inferring any one individual’s compensation data • Protection against data breach • No single point of failure Achieved by a combination of techniques: encryption, access control, , aggregation, thresholding K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976)
  154. 154. Modeling Challenges • Evaluation • Modeling on de-identified data • Robustness and stability • Outlier detection X. Chen, Y. Liu, L. Zhang, and K. Kenthapadi, How LinkedIn Economic Graph Bonds Information and Product: Applications in LinkedIn Salary, KDD 2018 (arxiv.org/abs/1806.09063) K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal, Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary, CIKM 2017 (arxiv.org/abs/1703.09845)
  155. 155. Problem Statement •How do we design LinkedIn Salary system taking into account the unique privacy and security challenges, while addressing the product requirements?
  156. 156. Differential Privacy? [Dwork et al, 2006] • Rich privacy literature (Adam-Worthmann, Samarati-Sweeney, Agrawal-Srikant, …, Kenthapadi et al, Machanavajjhala et al, Li et al, Dwork et al) • Limitation of anonymization techniques (as discussed in the first part) • Worst case sensitivity of quantiles to any one user’s compensation data is large •  Large noise to be added, depriving reliability/usefulness • Need compensation insights on a continual basis • Theoretical work on applying differential privacy under continual observations • No practical implementations / applications • Local differential privacy / Randomized response based approaches (Google’s RAPPOR; Apple’s iOS differential privacy; Microsoft’s telemetry collection) not applicable
  157. 157. Title Region $$ User Exp Designer SF Bay Area 100K User Exp Designer SF Bay Area 115K ... ... ... Title Region $$ User Exp Designer SF Bay Area 100K De-identification Example Title Region Company Industry Years of exp Degree FoS Skills $$ User Exp Designer SF Bay Area Google Internet 12 BS Interactive Media UX, Graphics, ... 100K Title Region Industry $$ User Exp Designer SF Bay Area Internet 100K Title Region Years of exp $$ User Exp Designer SF Bay Area 10+ 100K Title Region Company Years of exp $$ User Exp Designer SF Bay Area Google 10+ 100K #data points > threshold? Yes ⇒ Copy to Hadoop (HDFS) Note: Original submission stored as encrypted objects.
  158. 158. System Architecture
  159. 159. Collection & Storage
  160. 160. Collection & Storage • Allow members to submit their compensation info • Extract member attributes • E.g., canonical job title, company, region, by invoking LinkedIn standardization services • Securely store member attributes & compensation data
  161. 161. De-identification & Grouping
  162. 162. De-identification & Grouping • Approach inspired by k-Anonymity [Samarati-Sweeney] • “Cohort” or “Slice” • Defined by a combination of attributes • E.g, “User experience designers in SF Bay Area” • Contains aggregated compensation entries from corresponding individuals • No user name, id or any attributes other than those that define the cohort • A cohort available for offline processing only if it has at least k entries • Apply LinkedIn standardization software (free-form attribute  canonical version) before grouping • Analogous to the generalization step in k-Anonymity
  163. 163. De-identification & Grouping • Slicing service • Access member attribute info & submission identifiers (no compensation data) • Generate slices & track # submissions for each slice • Preparation service • Fetch compensation data (using submission identifiers), associate with the slice data, copy to HDFS
  164. 164. Insights & Modeling
  165. 165. Insights & Modeling • Salary insight service • Check whether the member is eligible • Give-to-get model • If yes, show the insights • Offline workflow • Consume de-identified HDFS dataset • Compute robust compensation insights • Outlier detection • Bayesian smoothing/inference • Populate the insight key-value stores
  166. 166. Security Mechanisms
  167. 167. Security Mechanisms • Encryption of member attributes & compensation data using different sets of keys • Separation of processing • Limiting access to the keys
  168. 168. Security Mechanisms • Key rotation • No single point of failure • Infra security
  169. 169. Preventing Timestamp Join based Attacks • Inference attack by joining these on timestamp • De-identified compensation data • Page view logs (when a member accessed compensation collection web interface) •  Not desirable to retain the exact timestamp • Perturb by adding random delay (say, up to 48 hours) • Modification based on k-Anonymity • Generalization using a hierarchy of timestamps • But, need to be incremental •  Process entries within a cohort in batches of size k • Generalize to a common timestamp • Make additional data available only in such incremental batches
  170. 170. Privacy vs Modeling Tradeoffs • LinkedIn Salary system deployed in production for ~3 years • Study tradeoffs between privacy guarantees (‘k’) and data available for computing insights • Dataset: Compensation submission history from 1.5M LinkedIn members • Amount of data available vs. minimum threshold, k • Effect of processing entries in batches of size, k
  171. 171. Amount of data available vs. threshold, k
  172. 172. Percent of data available vs. batch size, k
  173. 173. Median delay due to batching vs. batch size, k
  174. 174. Key takeaway points • LinkedIn Salary: a new internet application, with unique privacy/modeling challenges • Privacy vs. Modeling Tradeoffs • Potential directions • Privacy-preserving machine learning models in a practical setting [e.g., Chaudhuri et al, JMLR 2011; Papernot et al, ICLR 2017] • Provably private submission of compensation entries?
  175. 175. Beyond Randomized Response
  176. 176. Beyond Randomized Response • Federated Learning • DP + Machine Learning • Encode-Shuffle-Analyze architecture "Prochlo: Strong Privacy for Analytics in the Crowd" Bittau et al., SOSP 2017 • Amplification by Shuffling
  177. 177. Federated Learning "Practical secure aggregation for privacy-preserving machine learning" Bonawitz, Ivanov, Kreuter, Marcedone, McMahan, Patel, Ramage, Segal, Seth, ACM CCS 2017
  178. 178. ML and Differential Privacy
  179. 179. "Generalization Implies Privacy" Fallacy We don’t overfit, therefore our model cannot possibly violate privacy.
  180. 180. “Generalization Implies Privacy” Fallacy Generalization ● average case ● model’s accuracy Privacy ● worst case ● model’s parameters
  181. 181. “Generalization Implies Privacy” Fallacy ● Examples when it just ain’t so: ○ Person-to-person similarities ○ Support Vector Machines ● Models can be very large ○ Millions of parameters
  182. 182. Somali to English Translation
  183. 183. Somali to English Translation
  184. 184. Somali to English Translation
  185. 185. Somali to English Translation
  186. 186. Somali to English Translation
  187. 187. Maori to English
  188. 188. “Understanding Deep Networks Requires Rethinking Generalization”, Zhang et al.’17
  189. 189. ML + Differential Privacy • [DP-SGD] Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, Zhang, "Deep Learning with Differential Privacy", ACM CCS 2016 • [PATE] Papernot, Abadi, Erlingson, Goodfellow, Talwar, "Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data", ICML 2017 • [PATE] Papernot, Song, Mironov, Raghunathan, Talwar, Erlingson, "Scalable Private Learning with PATE", ICML 2018 https://github.com/tensorflow/privacy
  190. 190. Statistics + Differential Privacy Harvard Privacy Tools Project:
  191. 191. Census 2020 and Differential Privacy
  192. 192. Key takeaway points • Notion of differential privacy is a principled foundation for privacy- preserving data analyses • Local differential privacy is a powerful technique appropriate for Internet-scale telemetry • Other techniques (thresholding, shuffling) can be combined with differentially private algorithms or be used in isolation.
  193. 193. References Differential privacy: review "A Firm Foundation For Private Data Analysis", C. ACM 2011 by Dwork book "The Algorithmic Foundations of Differential Privacy" by Dwork and Roth
  194. 194. References Google's RAPPOR: paper "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response", ACM CCS 2014, Erlingsson, Pihur, Korolova blog (https://security.googleblog.com/2014/10/learning-statistics-with-privacy-aided.html) Apple's implementation: article "Learning with Privacy at Scale", Apple ML J., Dec 2017 paper "Practical Locally Private Heavy Hitters", NIPS 2017, by Bassily, Nissim, Stemmer, Thakurta paper "Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12" by Tang, Korolova, Bai, Wang, Wang LinkedIn’s privacy-preserving analytics framework paper "PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn", CIKM 2018, Kenthapadi, Tran blog (https://engineering.linkedin.com/blog/2019/04/privacy-preserving-analytics-and-reporting- at-linkedin) LinkedIn Salary: paper "LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers", IEEE PAC 2017, Kenthapadi, Chudhary, Ambler blog (https://engineering.linkedin.com/blog/2017/12/statistical-modeling-for-linkedin-salary)
  195. 195. Fairness Privacy Transparency Explainability Related WWW’19 sessions: 1.Tutorial: Designing Equitable Algorithms for the Web 2.Tutorial: Economic Theories of Distributive Justice for Fair Machine Learning 3.Tutorial: Socially Responsible NLP 4.Tutorial: Fairness-aware Machine Learning in Practice 5.Tutorial: Explainable Recommendation and Search 6.Workshop: FATE and Society on the Web 7.Session: Fairness, Credibility, and Search (Wednesday, 10:30 – 12:30) 8.Session: Privacy and Trust (Wednesday, 16:00 – 17:30) 9.Special Track: Designing an Ethical Web (Friday)
  196. 196. Thanks! Questions? •Tutorial website: https://sites.google.com/view/www19-privacy- tutorial •Feedback most welcome  • kkenthapadi@linkedin.com, mironov@google.com, aguhatha@ucsc.edu

×