SlideShare a Scribd company logo
1 of 163
Privacy-preserving Data Mining
in Industry: Practical Challenges
and Lessons Learned
KDD 2018 Tutorial
August 2018
Krishnaram Kenthapadi (AI @ LinkedIn)
Ilya Mironov (Google AI)
Abhradeep Thakurta (UC Santa Cruz)
https://sites.google.com/view/kdd2018privacytutorial
Outline / Learning Outcomes
• Privacy breaches and lessons learned
• Evolution of privacy techniques
• Differential privacy: definition and techniques
• Privacy techniques in practice: Challenges and Lessons Learned
• Google’s RAPPOR
• Apple’s differential privacy deployment for iOS
• LinkedIn Salary: Privacy Design
• Key Takeaways
Privacy: A Historical Perspective
Evolution of Privacy Techniques and Privacy Breaches
Privacy Breaches and Lessons Learned
Attacks on privacy
•Governor of Massachusetts
•AOL
•Netflix
•Web browsing data
•Facebook
•Amazon
•Genomic data
born July 31, 1945
resident of 02138
Massachusetts Group Insurance Commission (1997):
Anonymized medical history of state employees (all
hospital visits, diagnosis, prescriptions)
Latanya Sweeney (MIT grad student): $20 – Cambridge
voter roll
William Weld vs Latanya Sweeney
64
%uniquely identifiable with
ZIP + birth date + gender
(in the US population)
Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”,
Attacker's Advantage
Auxiliary information
August 4, 2006: AOL
Research publishes
anonymized search logs
of 650,000 users
August 9:
New York Times
AOL Data Release
Attacker's Advantage
Auxiliary information
Enough to succeed on a small fraction of inputs
Netflix Prize
Oct 2006: Netflix announces Netflix
Prize
• 10% of their users
• average 200 ratings per user
Narayanan, Shmatikov (2006):
Netflix Prize
Deanonymizing Netflix Data
Narayanan, Shmatikov, Robust De-
anonymization of Large Datasets (How to
Break Anonymity of the Netflix Prize
Dataset), 2008
● Noam Chomsky in Our Times
● Farenheit 9/11
● Jesus of Nazareth
● Queer as Folk
Key idea:
● Similar intuition as the attack on medical records
● Medical records: Each person can be identified
based on a combination of a few attributes
● Web browsing history: Browsing history is unique for
each person
● Each person has a distinctive social network  links
appearing in one’s feed is unique
● Users likely to visit links in their feed with higher
probability than a random user
● “Browsing histories contain tell-tale marks of identity”
Su et al, De-anonymizing Web Browsing Data with Social Networks, 2017
De-anonymizing Web Browsing Data with Social Networks
Attacker's Advantage
Auxiliary information
Enough to succeed on a small fraction of inputs
High dimensionality
Ad targeting:
Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM
Privacy Attacks On Ad Targeting
10 campaigns targeting 1 person (zip code, gender,
workplace, alma mater)
Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM
Facebook vs Korolova
Age
21
22
23
…
30
Ad Impressions in a week
0
0
8
…
0
10 campaigns targeting 1 person (zip code, gender,
workplace, alma mater)
Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM
Facebook vs Korolova
Interest
A
B
C
…
Z
Ad Impressions in a week
0
0
8
…
0
● Context: Microtargeted Ads
● Takeaway: Attackers can instrument ad campaigns to
identify individual users.
● Two types of attacks:
○ Inference from Impressions
○ Inference from Clicks
Facebook vs Korolova: Recap
Attacker's Advantage
Auxiliary information
Enough to succeed on a small fraction of inputs
High dimensionality
Active
Items frequently bought together
Bought: A B C D E
Z: Customers Who Bought This Item Also Bought
Calandrino, Kilzer, Narayanan, Felten, Shmatikov, “You Might Also Like: Privacy Risks of Collaborative
Attacking Amazon.com
A C D E
Attacker's Advantage
Auxiliary information
Enough to succeed on a small fraction of inputs
High dimensionality
Active
Observant
Homer et al., “Resolving individuals contributing trace
amounts of DNA to highly complex mixtures using high-
density SNP genotyping microarrays”, PLoS Genetics,
2008
Genetic data
Reference population
Bayesian Analysis
“In all mixtures, the identification
of the presence of a person’s
genomic DNA was possible.”
Zerhouni, NIH Director:
“As a result, the NIH has removed from
open-access databases the aggregate
results (including P values and genotype
counts) for all the GWAS that had been
available on NIH sites”
… one week later
Attacker's Advantage
Auxiliary information
Enough to succeed on a small fraction of inputs
High dimensionality
Active
Observant
Clever
Differential Privacy
Curator
Defining Privacy
31
CuratorCurator
Your data in
the database
Defining Privacy
Your data in
the database
Defining Privacy
32
CuratorCurator
Intuition:
● A member’s privacy is preserved if …
○ “The released result would nearly be the same, whether or
not the user’s information is taken into account”
● An attacker gains very little additional knowledge about any specific member from the
published result
Defining Privacy
Your data in
the database
33
CuratorCurator
Databases D and D′ are neighbors if they differ in one person’s data.
Differential Privacy [DMNS06]: The distribution of the curator’s output M(D)
on database D is (nearly) the same as M(D′).
Differential Privacy
Your data in
the database
Your data in
the database
ε-Differential Privacy: The distribution of the curator’s output M(D) on
database D is (nearly) the same as M(D′).
34
CuratorCurator
Parameter ε quantifies
information leakage
∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S].
Differential Privacy
Your data in
the database
Your data in
the database
Dwork, McSherry, Nissim, Smith [TCC 2006]
(ε, δ)-Differential Privacy: The distribution of the curator’s output M(D) on
database D is (nearly) the same as M(D′).
35
Parameter ε quantifies
information leakage
Parameter δ allows for
a small probability of
failure
∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]+δ.
CuratorCurator
Dwork, McSherry, Nissim, Smith [TCC 2006]; Dwork, Kenthapadi, McSherry, Mironov, Naor [ EUROCRYPT 2006]
Your data in
the database
Your data in
the database
Differential Privacy
36
f(D) f(D′)
— bad outcomes
— probability with record x
— probability without record x
“Bad Outcomes” Interpretation
● Prior on databases p
● Observed output O
● Does the database contain record x?
37
Bayesian Interpretation
● Robustness to auxiliary data
● Post-processing:
If M(D) is differentially private, so is f(M(D)).
● Composability:
Run two ε-DP mechanisms. Full interaction is 2ε-DP.
● Group privacy:
Graceful degradation in the presence of
correlated inputs. 38
Differential Privacy
Differential Privacy: Takeaway points
• Privacy as a notion of stability of randomized algorithms in
respect to small perturbations in their input
• Worst-case definition
• Robust (to auxiliary data, correlated inputs)
• Composable
• Quantifiable
• Concept of a privacy budget
• Noise injection
Case Studies
Google’s RAPPOR
44
London, 1854: Broad Street Cholera Outbreak
...Mountain View, 2014
Central Model
Curator
Local Model
Randomized response:
Collecting a sensitive Boolean
Developed in the 1960s for sensitive surveys
“Have you had an abortion?”
- flip a coin, in private
- if coin lands heads, respond “YES”
- if coin lands tails, respond with the truth
Unbiased estimate calculated as: 2 × (fraction of “YES” - ½ )
Randomized response:
Collecting a sensitive Boolean
Developed in the 1960s for sensitive surveys
“Have you had an abortion?”
- flip a coin, in private
- if coin lands heads, respond “YES”
flip another coin to respond “YES” or “NO”
- if coin lands tails, respond with the truth
Unbiased estimate calculated as: 2 × (fraction of “YES” - ½ )
Satisfies differential privacy
RAPPOR
Erlingsson, Pihur, Korolova. "RAPPOR: Randomized aggregatable privacy-preserving
ordinal response." ACM CCS 2014.
RAPPOR: two-level randomized response
Can we do repeated surveys of sensitive attributes?
— Average of randomized responses will reveal a user’s true answer :-(
Solution: Memoize! Re-use the same random answer
— Memoization can hurt privacy too! Long, random bit sequence can
be a unique tracking ID :-(
Solution: Use 2-levels! Randomize the memoized response
RAPPOR: two-level randomized response
● Store client value v into bloom filter B using hash functions
● Memoize a Permanent Randomized Response (PRR) B′
● Report an Instantaneous Randomized Response (IRR) S
RAPPOR: two-level randomized response
● Store client value v into bloom filter B using hash functions
● Memoize a Permanent Randomized Response (PRR) B′
● Report an Instantaneous Randomized Response (IRR) S
f = ½
q = ¾ , p = ½
RAPPOR: Life of a report
Value
Bloom
Filter
PRR
IRR
“www.google.com”
Value
Bloom
Filter
PRR
IRR
“www.google.com”
P(1) =
0.25
P(1) =
0.75
RAPPOR: Life of a report
Value
Bloom
Filter
PRR
IRR
“www.google.com”
P(1) =
0.50
P(1) =
0.75
RAPPOR: Life of a report
Differential privacy of RAPPOR
● Permanent Randomized Response satisfies differential privacy at
● Instantaneous Randomized Response has differential privacy at
= 4 ln(3)
= ln(3)
Differential Privacy of RAPPOR:
Measurable privacy bounds
Each report offers differential privacy with
ε = ln(3)
Attacker’s guess goes from 0.1% → 0.3% in the worst case
Differential privacy even if attacker gets all reports (infinite data!!!)
Also… Base Rate Fallacy prevents attackers from finding needles in
haystacks
Cohorts
Bloom Filter: 2 bits out of 128 — too many false positives
...
user 0xA0FE91B76:
google.com
cohort 2cohort 1 cohort 128
h2
Decoding RAPPOR
From Raw Counts to De-noised Counts
True bit counts, with no noise
De-noised RAPPOR reports
From De-Noised Count to Distribution
True bit counts, with no noise
De-noised RAPPOR reports
google.com:
yahoo.com:
bing.com:
From De-Noised Count to Distribution
Linear Regression:
minX ||B - A X||2
LASSO:
minX (||B - A X||2)2 + λ||X||1
Hybrid:
1. Find support of X via LASSO
2. Solve linear regression to find weights
Deploying RAPPOR
Coverage
Explaining RAPPOR
“Having the cake and eating it too…”
“Seeing the forest without seeing the trees…”
Metaphor for RAPPOR
Microdata: An Individual’s Report
Microdata: An Individual’s Report
Each bit is flipped with
probability
25%
Big Picture Remains!
Google Chrome Privacy White Paper
https://www.google.com/chrome/browser/privacy/whitepaper.html
Phishing and malware protection
Google Chrome includes an optional feature called "Safe Browsing" to help protect you against phishing and malware attacks. This
helps prevent evil-doers from tricking you into sharing personal information with them (“phishing”) or installing malicious software
on your computer (“malware”). The approach used to accomplish this was designed specifically to protect your privacy and is also
used by other popular browsers.
If you'd rather not send any information to Safe Browsing, you can also turn these features off. Please be aware that Chrome will no
longer be able to protect you from websites that try to steal your information or install harmful software if you disable this feature.
We really don't recommend turning it off.
…
If a URL was indeed dangerous, Chrome reports this anonymously to Google to improve Safe Browsing. The data sent is randomized,
constructed in a manner that ensures differential privacy, permitting only monitoring of aggregate statistics that apply to tens of
thousands of users at minimum. The reports are an instance of Randomized Aggregatable Privacy-Preserving Ordinal Responses,
whose full technical details have been published in a technical report and presented at the 2014 ACM Computer and Communications
Security conference. This means that Google cannot infer which website you have visited from this.
Developers’ Uptake
RAPPOR:
Lessons Learned
Growing Pains
● Transitioning from a research prototype to a real product
● Scalability
● Versioning
Communicating Uncertainty
Maintaining Candidates List
No missing candidates Three missing candidates
4%
13% 17%
RAPPOR Metrics in Chrome
https://chromium.googlesource.com/chromium/src/+log/master/tools/metrics/rappor/rappor.xml
Open Source Efforts
https://github.com/google/rappor
- demo you can run with a couple
of shell commands
- client library in several languages
- analysis tool and simulation
- documentation
Follow-up
- Bassily, Smith, “Local, Private, Efficient Protocols for Succinct
Histograms,” STOC 2015
- Kairouz, Bonawitz, Ramage, “Discrete Distribution Estimation under
Local Privacy”, https://arxiv.org/abs/1602.07387
- Qin et al., “Heavy Hitter Estimation over Set-Valued Data with Local
Differential Privacy”, CCS 2016
Key takeaway points
RAPPOR - locally differentially-private mechanism for reporting of
categorical and string data
● First Internet-scale deployment of differential privacy
● Explainable
● Conservative
● Open-sourced
Apple's On-Device Differential
Privacy
Abhradeep Thakurta, UC Santa Cruz
Apple WWDC, June 2016
References
https://arxiv.org/abs/1709.02753
Phablet
Derp
Photobomb
Woot
Phablet
OMG
Woot
Troll
Prepone
Phablet
awwww
dp
Learning from private data
Learn new (and frequent) words typed
Learning from private data
Learn frequent emojis typed
Apple's On-Device Differential
Privacy: Discovering New Words
Roadmap
1. Private frequency estimation with count-min-sketch
2. Private heavy hitters with puzzle piece algorithm
3. Private heavy hitters with tree histogram protocol
Private Frequency Oracle
Private frequency oracle
Building block for private heavy hitters
𝑑2𝑑1 𝑑 𝑛
All errors within
𝛾 = O( 𝑛 log|𝒮|)
frequency
Words (𝒮)
𝛾
"phablet"
frequency("phablet")
Private frequency oracle:
Design constraints
Computational and communication constraints:
Client side:
size of the domain (|S|) and n
Communication to server:
very few bits
Server-side cost for one query:
size of the domain (|S|) and n
Private frequency oracle:
Design constraints
Computational and communication constraints:
Client side:
size of the domain (|S|) and n
# characters > 3,000
For 8-character words:
size of the domain |S|=3,000^8
number of clients ~ 1B
Efficiently [BS15] ~ n
Our goal ~ O(log |S|)
Private frequency oracle:
Design constraints
Computational and communication constraints:
Client side:
O(log |S|)
Communication to server:
O(1) bits
Server-side cost for one query:
O(log |S|)
Private frequency oracle
A starter solution: Randomized response
𝑑
0 1 0
𝑖
1 0 1
𝑖
Protects ε-differential privacy
(with the right bias)
Randomized response: d′
1 0 0
1 1 0
1 0 1
+ With bias
correction
frequency
All domain elements
Error in each estimate:
Θ( 𝑛 log|𝒮|)
Optimal error under privacy
Private frequency oracle
A starter solution: Randomized response
Computational and communication constraints:
Client side:
O(|S|)
Communication to server:
O(|S|) bits
Server-side cost for one query:
O(1)
Private frequency oracle
A starter solution: Randomized response
1 0 1
𝑖
𝑑
0 01
0 01
0 01
Hash function: ℎ1
Hash function: ℎ2
Hash function: ℎ 𝑘
Number of hash bins: 𝑛
Computation= 𝑂(log|𝒮|)
𝑘 ≈ log|𝒮|
Private frequency oracle
Non-private count-min sketch [CM05]
0 01
0 01
0 01
0 01
1 00
0 11
1
𝑘
1
+
245
127
9123
2132
𝑛
Reducing server computation
Private frequency oracle
Non-private count-min sketch [CM05]
Reducing server computation
1
𝑘
1
Phablet
245
127
9123
2132
𝑛
9146
2212
Frequency estimate:
min (9146, 2212, 2132)
Error in each estimate:
O( 𝑛log|𝒮|)
Server side query cost:
𝑂(log|𝒮|)
𝑘 ≈ log |𝒮|
Private frequency oracle
Non-private count-min sketch [CM05]
"phablet"
Private frequency oracle
Private count-min sketch
𝑑
Making client computation differentially private
0 01
0 01
0 01
1 01
1 00
0 00
𝑘𝜖-diff. private, since 𝑘 pieces of information
Private frequency oracle
Private count-min sketch
𝑑
Theorem: Sampling ensures 𝜖-differential privacy without hurting accuracy,
rather improves it by a factor of 𝑘
0 01 1 00
Private frequency oracle
Private count-min sketch
0 01 +1 +1-1
Hadamard transform
Reducing client communication
Private frequency oracle
Private count-min sketch
0 01 +1 +1-1
Hadamard transform
-1 +1
Communication: 𝑂(1) bit
Theorem: Hadamard transform and sampling
do not hurt accuracy
Reducing client communication
Private frequency oracle
Private count-min sketch
Error in each estimate:
O( 𝑛log|𝒮|)
Computational and communication constraints:
Client side:
O(log |S|)
Communication to server:
O(1) bits
Server-side cost for one query:
O(log |S|)
Roadmap
1. Private frequency estimation with count-min-sketch
2. Private heavy hitters with puzzle piece algorithm
3. Private heavy hitters with tree histogram protocol
Private heavy hitters:
Using the frequency oracle
Private frequency oracle
Private count-min sketch
Domain 𝒮
Too many elements in 𝒮 to search.
Element s in S
Frequency(s)
Find all s in S with
frequency > γ
Puzzle piece algorithm
(works well in practice, no theoretical guarantees)
[Bassily Nissim Stemmer Thakurta, 2017 and Apple differential privacy team, 2017]
Private heavy hitters
Ph ab le t$ Frequency > 𝛾
Each bi-gram frequency > 𝛾
Observation: If a word is frequent, its bigrams are frequent too.
Private heavy hitters
Sanitized
bi-grams, and the
complete word
ab
ad
ph
ba
ab
ax
le
ab
ab
Position P1 Position P2 Position P3
le
ab
t$
Position P4
Frequent bi-grams
Natural algorithm: Cartesian product of frequent bi-grams
Private heavy hitters
ab
ad
ph
ba
ab
ax
le
ab
ab
Position P1 Position P2 Position P3
le
ab
t$
Position P4
Frequent bi-grams Candidate words
P1 x P2 x P3 x P4
Private frequency oracle
Private count-min sketch
Find frequent
words
Natural algorithm: Cartesian product of frequent bi-grams
Private heavy hitters
Candidate words
P1 x P2 x P3 x P4
Private frequency oracle
Find frequent
words
Combinatorial explosion
In practice, all bi-grams are frequent
Natural algorithm: Cartesian product of frequent bi-grams
Private count-min sketch
Puzzle piece algorithm
Ph ab le t$
≜
h=Hash(Phablet)
Hash: 𝒮 → 1, … , ℓ
Ph ab le t$h h h h
Privatized
bi-grams tagged
with the hash, and
the complete
word
Puzzle piece algorithm: Server side
ab 1
ad 5
Ph 3
ba 4
ab 3
ax 9
le 3
le 7
ab 1
Position P1 Position P2 Position P3
le 1
ab 9
t$ 3
Position P4
Frequent bi-grams tagged with {1, … , ℓ}
Candidate words
P1 x P2 x P3 x P4
Private frequency oracle
Find frequent
words
Combine only matching
bi-grams
Private count-min sketch
Roadmap
1. Private frequency estimation with count-min-sketch
2. Private heavy hitters with puzzle piece algorithm
3. Private heavy hitters with tree histogram protocol
Tree histogram algorithm
(works well in practice + optimal theoretical guarantees)
[Bassily Nissim Stemmer Thakurta, 2017]
Private heavy hitters:
Tree histograms (based on [CM05])
1 0 0
Any string in 𝒮:
log |𝒮| bits
Idea: Construct prefixes of the heavy hitter bit by bit
Private heavy hitters:
Tree histograms
0 1
Private heavy hitters:
Tree histograms
0 1
Level 1: Frequent prefix of length 1
Use private frequency oracle
If a string is a heavy hitter, its prefixes are too.
Private heavy hitters:
Tree histograms
00 01 10 11
Private heavy hitters:
Tree histograms
Level 2: Frequent prefix of length two
Idea: Each level has ≈ 𝑛 heavy hitters
00 01 10 11
Private heavy hitters:
Tree histograms
Theorem: Finds all heavy hitters with frequency at least
𝑂( 𝑛 log|𝒮|)
Computational and communication constraints:
Client side:
O(log |S|)
Communication to server:
O(1) bits
Server-side computation:
O(n log |S|)
Key takeaway points
• Keeping local differential privacy constant:
•One low-noise report is better than many noisy ones
•Weak signal with probability 1 is better than strong signal with small probability
• We can learn the dictionary – at a cost
• Longitudinal privacy remains a challenge
NIPS 2017
Microsoft: Discretization of continuous variables
"These guarantees are particularly strong when user’s behavior remains
approximately the same, varies slowly, or varies around a small number of
values over the course of data collection."
Microsoft's deployment
"Our mechanisms have been deployed by Microsoft
across millions of devices ... to protect users’ privacy
while collecting application usage statistics."
B. Ding, J. Kulkarni, S. Yekhanin, NIPS 2017
More details: today, 1pm-5pm
T12: Privacy at Scale: Local Differential Privacy in
Practice, G. Cormode, T. Kulkarni, N. Li, T. Wang
LinkedIn Salary
Outline
• LinkedIn Salary Overview
• Challenges: Privacy, Modeling
• System Design & Architecture
• Privacy vs. Modeling Tradeoffs
LinkedIn Salary (launched in Nov, 2016)
Salary Collection Flow via Email Targeting
Current Reach (August 2018)
• A few million responses out of several millions of members targeted
• Targeted via emails since early 2016
• Countries: US, CA, UK, DE, IN, …
• Insights available for a large fraction of US monthly active users
Data Privacy Challenges
• Minimize the risk of inferring any one individual’s compensation data
• Protection against data breach
• No single point of failure
Achieved by a combination of
techniques: encryption, access control,
, aggregation,
thresholding
K. Kenthapadi, A. Chudhary, and S.
Ambler, LinkedIn Salary: A System
for Secure Collection and
Presentation of Structured
Compensation Insights to Job
Seekers, IEEE PAC 2017
(arxiv.org/abs/1705.06976)
Modeling Challenges
• Evaluation
• Modeling on de-identified data
• Robustness and stability
• Outlier detection
X. Chen, Y. Liu, L. Zhang, and K.
Kenthapadi, How LinkedIn
Economic Graph Bonds
Information and Product:
Applications in LinkedIn Salary,
KDD 2018
(arxiv.org/abs/1806.09063)
K. Kenthapadi, S. Ambler,
L. Zhang, and D. Agarwal,
Bringing salary transparency to
the world: Computing robust
compensation insights via
LinkedIn Salary, CIKM 2017
(arxiv.org/abs/1703.09845)
Problem Statement
•How do we design LinkedIn Salary system taking into
account the unique privacy and security challenges,
while addressing the product requirements?
Differential Privacy? [Dwork et al, 2006]
• Rich privacy literature (Adam-Worthmann, Samarati-Sweeney, Agrawal-Srikant, …,
Kenthapadi et al, Machanavajjhala et al, Li et al, Dwork et al)
• Limitation of anonymization techniques (as discussed in the first part)
• Worst case sensitivity of quantiles to any one user’s compensation data is
large
•  Large noise to be added, depriving reliability/usefulness
• Need compensation insights on a continual basis
• Theoretical work on applying differential privacy under continual observations
• No practical implementations / applications
• Local differential privacy / Randomized response based approaches (Google’s RAPPOR; Apple’s
iOS differential privacy; Microsoft’s telemetry collection) not applicable
Title Region
$$
User Exp
Designer
SF Bay
Area 100K
User Exp
Designer
SF Bay
Area 115K
... ...
...
Title Region
$$
User Exp
Designer
SF Bay
Area 100K
De-identification Example
Title Region Company Industry Years of
exp
Degree FoS Skills
$$
User Exp
Designer
SF Bay
Area
Google Internet 12 BS Interactive
Media
UX,
Graphics,
...
100K
Title Region Industry
$$
User Exp
Designer
SF Bay
Area
Internet
100K
Title Region Years of
exp $$
User Exp
Designer
SF Bay
Area
10+
100K
Title Region Company Years of
exp $$
User Exp
Designer
SF Bay
Area
Google 10+
100K
#data
points >
threshold?
Yes ⇒ Copy to
Hadoop (HDFS) Note: Original submission stored as encrypted objects.
System
Architecture
Collection
&
Storage
Collection & Storage
• Allow members to submit their compensation info
• Extract member attributes
• E.g., canonical job title, company, region, by invoking LinkedIn standardization services
• Securely store member attributes & compensation data
De-identification
&
Grouping
De-identification & Grouping
• Approach inspired by k-Anonymity [Samarati-Sweeney]
• “Cohort” or “Slice”
• Defined by a combination of attributes
• E.g, “User experience designers in SF Bay Area”
• Contains aggregated compensation entries from corresponding individuals
• No user name, id or any attributes other than those that define the cohort
• A cohort available for offline processing only if it has at least k entries
• Apply LinkedIn standardization software (free-form attribute  canonical version)
before grouping
• Analogous to the generalization step in k-Anonymity
De-identification & Grouping
• Slicing service
• Access member attribute info &
submission identifiers (no
compensation data)
• Generate slices & track #
submissions for each slice
• Preparation service
• Fetch compensation data (using
submission identifiers), associate
with the slice data, copy to HDFS
Insights
&
Modeling
Insights & Modeling
• Salary insight service
• Check whether the member is
eligible
• Give-to-get model
• If yes, show the insights
• Offline workflow
• Consume de-identified HDFS
dataset
• Compute robust compensation
insights
• Outlier detection
• Bayesian smoothing/inference
• Populate the insight key-value
stores
Security
Mechanisms
Security
Mechanisms
• Encryption of
member attributes
& compensation
data using different
sets of keys
• Separation of
processing
• Limiting access to
the keys
Security
Mechanisms
• Key rotation
• No single point of
failure
• Infra security
Preventing Timestamp Join based Attacks
• Inference attack by joining these on timestamp
• De-identified compensation data
• Page view logs (when a member accessed compensation collection web interface)
•  Not desirable to retain the exact timestamp
• Perturb by adding random delay (say, up to 48 hours)
• Modification based on k-Anonymity
• Generalization using a hierarchy of timestamps
• But, need to be incremental
•  Process entries within a cohort in batches of size k
• Generalize to a common timestamp
• Make additional data available only in such incremental batches
Privacy vs Modeling Tradeoffs
• LinkedIn Salary system deployed in production for ~2.5 years
• Study tradeoffs between privacy guarantees (‘k’) and data available for
computing insights
• Dataset: Compensation submission history from 1.5M LinkedIn members
• Amount of data available vs. minimum threshold, k
• Effect of processing entries in batches of size, k
Amount of
data
available vs.
threshold, k
Percent of
data available
vs. batch size,
k
Median delay
due to
batching vs.
batch size, k
Key takeaway points
• LinkedIn Salary: a new internet application, with
unique privacy/modeling challenges
• Privacy vs. Modeling Tradeoffs
• Potential directions
• Privacy-preserving machine learning models in a practical setting
[e.g., Chaudhuri et al, JMLR 2011; Papernot et al, ICLR 2017]
• Provably private submission of compensation entries?
Beyond Randomized Response
Beyond Randomized Response
• LDP + Machine Learning:
• "Is interaction necessary for distributed private learning?"
Smith, Thakurta, Upadhyay, S&P 2017
• Federated Learning
• Encode-Shuffle-Analyze architecture
"Prochlo: Strong Privacy for Analytics in the Crowd"
Bittau et al., SOSP 2017
• Amplification by Shuffling
LDP + Machine Learning
Interactivity as a major implementation constraint
...
parallel
LDP + Machine Learning
Interactivity as a major implementation constraint
...
sequential
"Is interaction necessary for distributed private
learning?" [STU 2017]
• Single parameter learning (e.g., median):
• Maximal accuracy with full parallelism
• Multi-parameter learning:
• Polylog number of iterations
• Lower bounds
Federated Learning
"Practical secure aggregation for privacy-preserving machine learning"
Bonawitz, Ivanov, Kreuter, Marcedone, McMahan, Patel, Ramage, Segal,
Seth, ACM CCS 2017
PROCHLO:
Strong Privacy for Analytics in the Crowd
Bittau, Erlingsson, Maniatis, Mironov, Raghunathan,
Lie, Rudominer, Kode, Tinnes, Seefeld
SOSP 2017
The ESA Architecture and Its Prochlo realization
E
A
E
E
S
Σ
ESA: Encode, Shuffle, Analyze (ESA)
Prochlo: A hardened ESA realization using Intel's SGX + crypto
E
A
E
E
S
Σ
S S...
A
Σ
Σ
Σ
Local DP
Unlinkability
Randomized Thresholding Central DP
Key takeaway points
• Notion of differential privacy is a principled foundation for privacy-
preserving data analyses
• Local differential privacy is a powerful technique appropriate for
Internet-scale telemetry
• Other techniques (thresholding, shuffling) can be combined with
differentially private algorithms or be used in isolation.
References
Differential privacy:
review "A Firm Foundation For Private Data Analysis", C. ACM 2011
by Dwork
book "The Algorithmic Foundations of Differential Privacy"
by Dwork and Roth
References
Google's RAPPOR:
paper "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal
Response", ACM CCS 2014, Erlingsson, Pihur, Korolova
blog
Apple's implementation:
article "Learning with Privacy at Scale", Apple ML J., Dec 2017
paper "Practical Locally Private Heavy Hitters", NIPS 2017,
by Bassily, Nissim, Stemmer, Thakurta
paper "Privacy Loss in Apple's Implementation of Differential Privacy on
MacOS 10.12" by Tang, Korolova, Bai, Wang, Wang
LinkedIn Salary:
paper "LinkedIn Salary: A System for Secure Collection and Presentation of
Structured Compensation Insights to Job Seekers", IEEE PAC 2017,
Kenthapadi, Chudhary, Ambler
blog
Thanks & Questions
• Tutorial website: https://sites.google.com/view/kdd2018privacytutorial
• Feedback most welcome 
• kkenthapadi@linkedin.com, mironov@google.com
• Related KDD’18 tutorial today afternoon (1pm-5pm):
T12: Privacy at Scale: Local Differential Privacy in Practice, G. Cormode,
T. Kulkarni, N. Li, T. Wang

More Related Content

What's hot

Generative AI Use-cases for Enterprise - First Session
Generative AI Use-cases for Enterprise - First SessionGenerative AI Use-cases for Enterprise - First Session
Generative AI Use-cases for Enterprise - First SessionGene Leybzon
 
Responsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillarsResponsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillarsSofus Macskássy
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
 
Overview of blockchain technology and architecture
Overview of blockchain technology and   architectureOverview of blockchain technology and   architecture
Overview of blockchain technology and architectureEY
 
Orange3 widget basic
Orange3 widget basicOrange3 widget basic
Orange3 widget basicYOONSEOK JANG
 
How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?Linkurious
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Edureka!
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsXavier Amatriain
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
POLE Investigations with Neo4j
POLE Investigations with Neo4jPOLE Investigations with Neo4j
POLE Investigations with Neo4jNeo4j
 

What's hot (20)

Generative AI Use-cases for Enterprise - First Session
Generative AI Use-cases for Enterprise - First SessionGenerative AI Use-cases for Enterprise - First Session
Generative AI Use-cases for Enterprise - First Session
 
User behavior analytics
User behavior analyticsUser behavior analytics
User behavior analytics
 
Responsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillarsResponsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillars
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Overview of blockchain technology and architecture
Overview of blockchain technology and   architectureOverview of blockchain technology and   architecture
Overview of blockchain technology and architecture
 
Orange3 widget basic
Orange3 widget basicOrange3 widget basic
Orange3 widget basic
 
How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?
 
Big data mining
Big data miningBig data mining
Big data mining
 
Web mining tools
Web mining toolsWeb mining tools
Web mining tools
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Social Media Data Analytics
Social Media Data AnalyticsSocial Media Data Analytics
Social Media Data Analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data science
Data science Data science
Data science
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systems
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
POLE Investigations with Neo4j
POLE Investigations with Neo4jPOLE Investigations with Neo4j
POLE Investigations with Neo4j
 
Deepwalk vs Node2vec
Deepwalk vs Node2vecDeepwalk vs Node2vec
Deepwalk vs Node2vec
 
Chatbots
ChatbotsChatbots
Chatbots
 

Similar to Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons Learned

Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Krishnaram Kenthapadi
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Krishnaram Kenthapadi
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInKrishnaram Kenthapadi
 
Towards Statistical Queries over Distributed Private User Data
Towards Statistical Queries over Distributed Private User Data Towards Statistical Queries over Distributed Private User Data
Towards Statistical Queries over Distributed Private User Data Serafeim Chatzopoulos
 
Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...Data Driven Innovation
 
Conference talk: On the Privacy of Frequently Visited User Locations
Conference talk: On the Privacy of Frequently Visited User LocationsConference talk: On the Privacy of Frequently Visited User Locations
Conference talk: On the Privacy of Frequently Visited User LocationsZohaib Riaz
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Kato Mivule
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
Fairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedInFairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedInC4Media
 
AI Driven Product Innovation
AI Driven Product InnovationAI Driven Product Innovation
AI Driven Product Innovationebelani
 
AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19Xavier Amatriain
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptxRahulTr22
 
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...Data Driven Innovation
 
Synthetic Data Generation with DoppelGanger
Synthetic Data Generation with DoppelGangerSynthetic Data Generation with DoppelGanger
Synthetic Data Generation with DoppelGangerQuantUniversity
 
"Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective""Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective"Micah Altman
 
Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataMOSTLY AI
 
SocialCom09-tutorial.pdf
SocialCom09-tutorial.pdfSocialCom09-tutorial.pdf
SocialCom09-tutorial.pdfBalasundaramSr
 
Scientific Reproducibility from an Informatics Perspective
Scientific Reproducibility from an Informatics PerspectiveScientific Reproducibility from an Informatics Perspective
Scientific Reproducibility from an Informatics PerspectiveMicah Altman
 

Similar to Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons Learned (20)

Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedIn
 
Towards Statistical Queries over Distributed Private User Data
Towards Statistical Queries over Distributed Private User Data Towards Statistical Queries over Distributed Private User Data
Towards Statistical Queries over Distributed Private User Data
 
Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...
 
Conference talk: On the Privacy of Frequently Visited User Locations
Conference talk: On the Privacy of Frequently Visited User LocationsConference talk: On the Privacy of Frequently Visited User Locations
Conference talk: On the Privacy of Frequently Visited User Locations
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Fairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedInFairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedIn
 
AI Driven Product Innovation
AI Driven Product InnovationAI Driven Product Innovation
AI Driven Product Innovation
 
AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
 
Synthetic Data Generation with DoppelGanger
Synthetic Data Generation with DoppelGangerSynthetic Data Generation with DoppelGanger
Synthetic Data Generation with DoppelGanger
 
"Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective""Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective"
 
Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic Data
 
SocialCom09-tutorial.pdf
SocialCom09-tutorial.pdfSocialCom09-tutorial.pdf
SocialCom09-tutorial.pdf
 
Scientific Reproducibility from an Informatics Perspective
Scientific Reproducibility from an Informatics PerspectiveScientific Reproducibility from an Informatics Perspective
Scientific Reproducibility from an Informatics Perspective
 

More from Krishnaram Kenthapadi

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Krishnaram Kenthapadi
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Krishnaram Kenthapadi
 
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedPrivacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Krishnaram Kenthapadi
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Krishnaram Kenthapadi
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsKrishnaram Kenthapadi
 
Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Krishnaram Kenthapadi
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsKrishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...Krishnaram Kenthapadi
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...Krishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Krishnaram Kenthapadi
 
Privacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedInPrivacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedInKrishnaram Kenthapadi
 

More from Krishnaram Kenthapadi (16)

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
 
Amazon SageMaker Clarify
Amazon SageMaker ClarifyAmazon SageMaker Clarify
Amazon SageMaker Clarify
 
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons LearnedPrivacy in AI/ML Systems: Practical Challenges and Lessons Learned
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (KD...
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WW...
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
 
Privacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedInPrivacy-preserving Analytics and Data Mining at LinkedIn
Privacy-preserving Analytics and Data Mining at LinkedIn
 

Recently uploaded

AI Generated 3D Models | AI 3D Model Generator
AI Generated 3D Models | AI 3D Model GeneratorAI Generated 3D Models | AI 3D Model Generator
AI Generated 3D Models | AI 3D Model Generator3DailyAI1
 
The Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfThe Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfe-Market Hub
 
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样AS
 
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书B
 
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样Fi
 
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...Varun Mithran
 
I’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 ShirtI’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 Shirtrahman018755
 
Reggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirtsrahman018755
 
Washington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers ShirtWashington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers Shirtrahman018755
 
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理Fir
 
APNIC Updates presented by Paul Wilson at CaribNOG 27
APNIC Updates presented by Paul Wilson at  CaribNOG 27APNIC Updates presented by Paul Wilson at  CaribNOG 27
APNIC Updates presented by Paul Wilson at CaribNOG 27APNIC
 
一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理
一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理
一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理Fir
 
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303Dewi Agency
 
Thank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirtsThank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirtsrahman018755
 
Loker Pemandu Lagu LC Semarang 085746015303
Loker Pemandu Lagu LC Semarang 085746015303Loker Pemandu Lagu LC Semarang 085746015303
Loker Pemandu Lagu LC Semarang 085746015303Dewi Agency
 
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download NowHUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download NowIdeoholics
 
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书Fir
 
Beyond Inbound: Unlocking the Secrets of API Egress Traffic Management
Beyond Inbound: Unlocking the Secrets of API Egress Traffic ManagementBeyond Inbound: Unlocking the Secrets of API Egress Traffic Management
Beyond Inbound: Unlocking the Secrets of API Egress Traffic Managementseank14
 
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...musaddumba454
 

Recently uploaded (20)

AI Generated 3D Models | AI 3D Model Generator
AI Generated 3D Models | AI 3D Model GeneratorAI Generated 3D Models | AI 3D Model Generator
AI Generated 3D Models | AI 3D Model Generator
 
The Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfThe Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdf
 
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
 
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
 
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
一比一原版(UWE毕业证书)西英格兰大学毕业证原件一模一样
 
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
 
I’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 ShirtI’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 Shirt
 
Reggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirtsReggie miller choke t shirts
Reggie miller choke t shirtsReggie miller choke t shirts
 
Washington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers ShirtWashington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers Shirt
 
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
一比一原版(TRU毕业证书)温哥华社区学院毕业证如何办理
 
APNIC Updates presented by Paul Wilson at CaribNOG 27
APNIC Updates presented by Paul Wilson at  CaribNOG 27APNIC Updates presented by Paul Wilson at  CaribNOG 27
APNIC Updates presented by Paul Wilson at CaribNOG 27
 
一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理
一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理
一比一原版(PSU毕业证书)美国宾州州立大学毕业证如何办理
 
GOOGLE Io 2024 At takes center stage.pdf
GOOGLE Io 2024 At takes center stage.pdfGOOGLE Io 2024 At takes center stage.pdf
GOOGLE Io 2024 At takes center stage.pdf
 
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
 
Thank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirtsThank You Luv I’ll Never Walk Alone Again T shirts
Thank You Luv I’ll Never Walk Alone Again T shirts
 
Loker Pemandu Lagu LC Semarang 085746015303
Loker Pemandu Lagu LC Semarang 085746015303Loker Pemandu Lagu LC Semarang 085746015303
Loker Pemandu Lagu LC Semarang 085746015303
 
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download NowHUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
 
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
一比一定制(USC毕业证书)美国南加州大学毕业证学位证书
 
Beyond Inbound: Unlocking the Secrets of API Egress Traffic Management
Beyond Inbound: Unlocking the Secrets of API Egress Traffic ManagementBeyond Inbound: Unlocking the Secrets of API Egress Traffic Management
Beyond Inbound: Unlocking the Secrets of API Egress Traffic Management
 
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
 

Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons Learned

  • 1. Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons Learned KDD 2018 Tutorial August 2018 Krishnaram Kenthapadi (AI @ LinkedIn) Ilya Mironov (Google AI) Abhradeep Thakurta (UC Santa Cruz) https://sites.google.com/view/kdd2018privacytutorial
  • 2. Outline / Learning Outcomes • Privacy breaches and lessons learned • Evolution of privacy techniques • Differential privacy: definition and techniques • Privacy techniques in practice: Challenges and Lessons Learned • Google’s RAPPOR • Apple’s differential privacy deployment for iOS • LinkedIn Salary: Privacy Design • Key Takeaways
  • 3. Privacy: A Historical Perspective Evolution of Privacy Techniques and Privacy Breaches
  • 4. Privacy Breaches and Lessons Learned Attacks on privacy •Governor of Massachusetts •AOL •Netflix •Web browsing data •Facebook •Amazon •Genomic data
  • 5. born July 31, 1945 resident of 02138 Massachusetts Group Insurance Commission (1997): Anonymized medical history of state employees (all hospital visits, diagnosis, prescriptions) Latanya Sweeney (MIT grad student): $20 – Cambridge voter roll William Weld vs Latanya Sweeney
  • 6. 64 %uniquely identifiable with ZIP + birth date + gender (in the US population) Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”,
  • 8. August 4, 2006: AOL Research publishes anonymized search logs of 650,000 users August 9: New York Times AOL Data Release
  • 9. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs
  • 11. Oct 2006: Netflix announces Netflix Prize • 10% of their users • average 200 ratings per user Narayanan, Shmatikov (2006): Netflix Prize
  • 12. Deanonymizing Netflix Data Narayanan, Shmatikov, Robust De- anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset), 2008
  • 13. ● Noam Chomsky in Our Times ● Farenheit 9/11 ● Jesus of Nazareth ● Queer as Folk
  • 14. Key idea: ● Similar intuition as the attack on medical records ● Medical records: Each person can be identified based on a combination of a few attributes ● Web browsing history: Browsing history is unique for each person ● Each person has a distinctive social network  links appearing in one’s feed is unique ● Users likely to visit links in their feed with higher probability than a random user ● “Browsing histories contain tell-tale marks of identity” Su et al, De-anonymizing Web Browsing Data with Social Networks, 2017 De-anonymizing Web Browsing Data with Social Networks
  • 15. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality
  • 16. Ad targeting: Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM Privacy Attacks On Ad Targeting
  • 17. 10 campaigns targeting 1 person (zip code, gender, workplace, alma mater) Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM Facebook vs Korolova Age 21 22 23 … 30 Ad Impressions in a week 0 0 8 … 0
  • 18. 10 campaigns targeting 1 person (zip code, gender, workplace, alma mater) Korolova, “Privacy Violations Using Microtargeted Ads: A Case Study”, PADM Facebook vs Korolova Interest A B C … Z Ad Impressions in a week 0 0 8 … 0
  • 19. ● Context: Microtargeted Ads ● Takeaway: Attackers can instrument ad campaigns to identify individual users. ● Two types of attacks: ○ Inference from Impressions ○ Inference from Clicks Facebook vs Korolova: Recap
  • 20. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality Active
  • 21. Items frequently bought together Bought: A B C D E Z: Customers Who Bought This Item Also Bought Calandrino, Kilzer, Narayanan, Felten, Shmatikov, “You Might Also Like: Privacy Risks of Collaborative Attacking Amazon.com A C D E
  • 22. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality Active Observant
  • 23. Homer et al., “Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high- density SNP genotyping microarrays”, PLoS Genetics, 2008 Genetic data
  • 25.
  • 26. “In all mixtures, the identification of the presence of a person’s genomic DNA was possible.”
  • 27. Zerhouni, NIH Director: “As a result, the NIH has removed from open-access databases the aggregate results (including P values and genotype counts) for all the GWAS that had been available on NIH sites” … one week later
  • 28. Attacker's Advantage Auxiliary information Enough to succeed on a small fraction of inputs High dimensionality Active Observant Clever
  • 31. 31 CuratorCurator Your data in the database Defining Privacy Your data in the database
  • 32. Defining Privacy 32 CuratorCurator Intuition: ● A member’s privacy is preserved if … ○ “The released result would nearly be the same, whether or not the user’s information is taken into account” ● An attacker gains very little additional knowledge about any specific member from the published result Defining Privacy Your data in the database
  • 33. 33 CuratorCurator Databases D and D′ are neighbors if they differ in one person’s data. Differential Privacy [DMNS06]: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Differential Privacy Your data in the database Your data in the database
  • 34. ε-Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). 34 CuratorCurator Parameter ε quantifies information leakage ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]. Differential Privacy Your data in the database Your data in the database Dwork, McSherry, Nissim, Smith [TCC 2006]
  • 35. (ε, δ)-Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). 35 Parameter ε quantifies information leakage Parameter δ allows for a small probability of failure ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]+δ. CuratorCurator Dwork, McSherry, Nissim, Smith [TCC 2006]; Dwork, Kenthapadi, McSherry, Mironov, Naor [ EUROCRYPT 2006] Your data in the database Your data in the database Differential Privacy
  • 36. 36 f(D) f(D′) — bad outcomes — probability with record x — probability without record x “Bad Outcomes” Interpretation
  • 37. ● Prior on databases p ● Observed output O ● Does the database contain record x? 37 Bayesian Interpretation
  • 38. ● Robustness to auxiliary data ● Post-processing: If M(D) is differentially private, so is f(M(D)). ● Composability: Run two ε-DP mechanisms. Full interaction is 2ε-DP. ● Group privacy: Graceful degradation in the presence of correlated inputs. 38 Differential Privacy
  • 39. Differential Privacy: Takeaway points • Privacy as a notion of stability of randomized algorithms in respect to small perturbations in their input • Worst-case definition • Robust (to auxiliary data, correlated inputs) • Composable • Quantifiable • Concept of a privacy budget • Noise injection
  • 42. 44 London, 1854: Broad Street Cholera Outbreak
  • 44.
  • 47. Randomized response: Collecting a sensitive Boolean Developed in the 1960s for sensitive surveys “Have you had an abortion?” - flip a coin, in private - if coin lands heads, respond “YES” - if coin lands tails, respond with the truth Unbiased estimate calculated as: 2 × (fraction of “YES” - ½ )
  • 48. Randomized response: Collecting a sensitive Boolean Developed in the 1960s for sensitive surveys “Have you had an abortion?” - flip a coin, in private - if coin lands heads, respond “YES” flip another coin to respond “YES” or “NO” - if coin lands tails, respond with the truth Unbiased estimate calculated as: 2 × (fraction of “YES” - ½ ) Satisfies differential privacy
  • 49. RAPPOR Erlingsson, Pihur, Korolova. "RAPPOR: Randomized aggregatable privacy-preserving ordinal response." ACM CCS 2014.
  • 50. RAPPOR: two-level randomized response Can we do repeated surveys of sensitive attributes? — Average of randomized responses will reveal a user’s true answer :-( Solution: Memoize! Re-use the same random answer — Memoization can hurt privacy too! Long, random bit sequence can be a unique tracking ID :-( Solution: Use 2-levels! Randomize the memoized response
  • 51. RAPPOR: two-level randomized response ● Store client value v into bloom filter B using hash functions ● Memoize a Permanent Randomized Response (PRR) B′ ● Report an Instantaneous Randomized Response (IRR) S
  • 52. RAPPOR: two-level randomized response ● Store client value v into bloom filter B using hash functions ● Memoize a Permanent Randomized Response (PRR) B′ ● Report an Instantaneous Randomized Response (IRR) S f = ½ q = ¾ , p = ½
  • 53. RAPPOR: Life of a report Value Bloom Filter PRR IRR “www.google.com”
  • 56. Differential privacy of RAPPOR ● Permanent Randomized Response satisfies differential privacy at ● Instantaneous Randomized Response has differential privacy at = 4 ln(3) = ln(3)
  • 57. Differential Privacy of RAPPOR: Measurable privacy bounds Each report offers differential privacy with ε = ln(3) Attacker’s guess goes from 0.1% → 0.3% in the worst case Differential privacy even if attacker gets all reports (infinite data!!!) Also… Base Rate Fallacy prevents attackers from finding needles in haystacks
  • 58. Cohorts Bloom Filter: 2 bits out of 128 — too many false positives ... user 0xA0FE91B76: google.com cohort 2cohort 1 cohort 128 h2
  • 60. From Raw Counts to De-noised Counts True bit counts, with no noise De-noised RAPPOR reports
  • 61. From De-Noised Count to Distribution True bit counts, with no noise De-noised RAPPOR reports google.com: yahoo.com: bing.com:
  • 62. From De-Noised Count to Distribution Linear Regression: minX ||B - A X||2 LASSO: minX (||B - A X||2)2 + λ||X||1 Hybrid: 1. Find support of X via LASSO 2. Solve linear regression to find weights
  • 65. Explaining RAPPOR “Having the cake and eating it too…” “Seeing the forest without seeing the trees…”
  • 68. Microdata: An Individual’s Report Each bit is flipped with probability 25%
  • 70. Google Chrome Privacy White Paper https://www.google.com/chrome/browser/privacy/whitepaper.html Phishing and malware protection Google Chrome includes an optional feature called "Safe Browsing" to help protect you against phishing and malware attacks. This helps prevent evil-doers from tricking you into sharing personal information with them (“phishing”) or installing malicious software on your computer (“malware”). The approach used to accomplish this was designed specifically to protect your privacy and is also used by other popular browsers. If you'd rather not send any information to Safe Browsing, you can also turn these features off. Please be aware that Chrome will no longer be able to protect you from websites that try to steal your information or install harmful software if you disable this feature. We really don't recommend turning it off. … If a URL was indeed dangerous, Chrome reports this anonymously to Google to improve Safe Browsing. The data sent is randomized, constructed in a manner that ensures differential privacy, permitting only monitoring of aggregate statistics that apply to tens of thousands of users at minimum. The reports are an instance of Randomized Aggregatable Privacy-Preserving Ordinal Responses, whose full technical details have been published in a technical report and presented at the 2014 ACM Computer and Communications Security conference. This means that Google cannot infer which website you have visited from this.
  • 73. Growing Pains ● Transitioning from a research prototype to a real product ● Scalability ● Versioning
  • 75. Maintaining Candidates List No missing candidates Three missing candidates 4% 13% 17%
  • 76. RAPPOR Metrics in Chrome https://chromium.googlesource.com/chromium/src/+log/master/tools/metrics/rappor/rappor.xml
  • 77. Open Source Efforts https://github.com/google/rappor - demo you can run with a couple of shell commands - client library in several languages - analysis tool and simulation - documentation
  • 78. Follow-up - Bassily, Smith, “Local, Private, Efficient Protocols for Succinct Histograms,” STOC 2015 - Kairouz, Bonawitz, Ramage, “Discrete Distribution Estimation under Local Privacy”, https://arxiv.org/abs/1602.07387 - Qin et al., “Heavy Hitter Estimation over Set-Valued Data with Local Differential Privacy”, CCS 2016
  • 79. Key takeaway points RAPPOR - locally differentially-private mechanism for reporting of categorical and string data ● First Internet-scale deployment of differential privacy ● Explainable ● Conservative ● Open-sourced
  • 84. Learning from private data Learn frequent emojis typed
  • 86. Roadmap 1. Private frequency estimation with count-min-sketch 2. Private heavy hitters with puzzle piece algorithm 3. Private heavy hitters with tree histogram protocol
  • 88. Private frequency oracle Building block for private heavy hitters 𝑑2𝑑1 𝑑 𝑛 All errors within 𝛾 = O( 𝑛 log|𝒮|) frequency Words (𝒮) 𝛾 "phablet" frequency("phablet")
  • 89. Private frequency oracle: Design constraints Computational and communication constraints: Client side: size of the domain (|S|) and n Communication to server: very few bits Server-side cost for one query: size of the domain (|S|) and n
  • 90. Private frequency oracle: Design constraints Computational and communication constraints: Client side: size of the domain (|S|) and n # characters > 3,000 For 8-character words: size of the domain |S|=3,000^8 number of clients ~ 1B Efficiently [BS15] ~ n Our goal ~ O(log |S|)
  • 91. Private frequency oracle: Design constraints Computational and communication constraints: Client side: O(log |S|) Communication to server: O(1) bits Server-side cost for one query: O(log |S|)
  • 92. Private frequency oracle A starter solution: Randomized response 𝑑 0 1 0 𝑖 1 0 1 𝑖 Protects ε-differential privacy (with the right bias) Randomized response: d′
  • 93. 1 0 0 1 1 0 1 0 1 + With bias correction frequency All domain elements Error in each estimate: Θ( 𝑛 log|𝒮|) Optimal error under privacy Private frequency oracle A starter solution: Randomized response
  • 94. Computational and communication constraints: Client side: O(|S|) Communication to server: O(|S|) bits Server-side cost for one query: O(1) Private frequency oracle A starter solution: Randomized response 1 0 1 𝑖
  • 95. 𝑑 0 01 0 01 0 01 Hash function: ℎ1 Hash function: ℎ2 Hash function: ℎ 𝑘 Number of hash bins: 𝑛 Computation= 𝑂(log|𝒮|) 𝑘 ≈ log|𝒮| Private frequency oracle Non-private count-min sketch [CM05]
  • 96. 0 01 0 01 0 01 0 01 1 00 0 11 1 𝑘 1 + 245 127 9123 2132 𝑛 Reducing server computation Private frequency oracle Non-private count-min sketch [CM05]
  • 97. Reducing server computation 1 𝑘 1 Phablet 245 127 9123 2132 𝑛 9146 2212 Frequency estimate: min (9146, 2212, 2132) Error in each estimate: O( 𝑛log|𝒮|) Server side query cost: 𝑂(log|𝒮|) 𝑘 ≈ log |𝒮| Private frequency oracle Non-private count-min sketch [CM05] "phablet"
  • 98. Private frequency oracle Private count-min sketch 𝑑 Making client computation differentially private 0 01 0 01 0 01 1 01 1 00 0 00 𝑘𝜖-diff. private, since 𝑘 pieces of information
  • 99. Private frequency oracle Private count-min sketch 𝑑 Theorem: Sampling ensures 𝜖-differential privacy without hurting accuracy, rather improves it by a factor of 𝑘 0 01 1 00
  • 100. Private frequency oracle Private count-min sketch 0 01 +1 +1-1 Hadamard transform Reducing client communication
  • 101. Private frequency oracle Private count-min sketch 0 01 +1 +1-1 Hadamard transform -1 +1 Communication: 𝑂(1) bit Theorem: Hadamard transform and sampling do not hurt accuracy Reducing client communication
  • 102. Private frequency oracle Private count-min sketch Error in each estimate: O( 𝑛log|𝒮|) Computational and communication constraints: Client side: O(log |S|) Communication to server: O(1) bits Server-side cost for one query: O(log |S|)
  • 103. Roadmap 1. Private frequency estimation with count-min-sketch 2. Private heavy hitters with puzzle piece algorithm 3. Private heavy hitters with tree histogram protocol
  • 104. Private heavy hitters: Using the frequency oracle Private frequency oracle Private count-min sketch Domain 𝒮 Too many elements in 𝒮 to search. Element s in S Frequency(s) Find all s in S with frequency > γ
  • 105. Puzzle piece algorithm (works well in practice, no theoretical guarantees) [Bassily Nissim Stemmer Thakurta, 2017 and Apple differential privacy team, 2017]
  • 106. Private heavy hitters Ph ab le t$ Frequency > 𝛾 Each bi-gram frequency > 𝛾 Observation: If a word is frequent, its bigrams are frequent too.
  • 107. Private heavy hitters Sanitized bi-grams, and the complete word ab ad ph ba ab ax le ab ab Position P1 Position P2 Position P3 le ab t$ Position P4 Frequent bi-grams Natural algorithm: Cartesian product of frequent bi-grams
  • 108. Private heavy hitters ab ad ph ba ab ax le ab ab Position P1 Position P2 Position P3 le ab t$ Position P4 Frequent bi-grams Candidate words P1 x P2 x P3 x P4 Private frequency oracle Private count-min sketch Find frequent words Natural algorithm: Cartesian product of frequent bi-grams
  • 109. Private heavy hitters Candidate words P1 x P2 x P3 x P4 Private frequency oracle Find frequent words Combinatorial explosion In practice, all bi-grams are frequent Natural algorithm: Cartesian product of frequent bi-grams Private count-min sketch
  • 110. Puzzle piece algorithm Ph ab le t$ ≜ h=Hash(Phablet) Hash: 𝒮 → 1, … , ℓ Ph ab le t$h h h h Privatized bi-grams tagged with the hash, and the complete word
  • 111. Puzzle piece algorithm: Server side ab 1 ad 5 Ph 3 ba 4 ab 3 ax 9 le 3 le 7 ab 1 Position P1 Position P2 Position P3 le 1 ab 9 t$ 3 Position P4 Frequent bi-grams tagged with {1, … , ℓ} Candidate words P1 x P2 x P3 x P4 Private frequency oracle Find frequent words Combine only matching bi-grams Private count-min sketch
  • 112. Roadmap 1. Private frequency estimation with count-min-sketch 2. Private heavy hitters with puzzle piece algorithm 3. Private heavy hitters with tree histogram protocol
  • 113. Tree histogram algorithm (works well in practice + optimal theoretical guarantees) [Bassily Nissim Stemmer Thakurta, 2017]
  • 114. Private heavy hitters: Tree histograms (based on [CM05]) 1 0 0 Any string in 𝒮: log |𝒮| bits Idea: Construct prefixes of the heavy hitter bit by bit
  • 115. Private heavy hitters: Tree histograms 0 1
  • 116. Private heavy hitters: Tree histograms 0 1 Level 1: Frequent prefix of length 1 Use private frequency oracle If a string is a heavy hitter, its prefixes are too.
  • 117. Private heavy hitters: Tree histograms 00 01 10 11
  • 118. Private heavy hitters: Tree histograms Level 2: Frequent prefix of length two Idea: Each level has ≈ 𝑛 heavy hitters 00 01 10 11
  • 119. Private heavy hitters: Tree histograms Theorem: Finds all heavy hitters with frequency at least 𝑂( 𝑛 log|𝒮|) Computational and communication constraints: Client side: O(log |S|) Communication to server: O(1) bits Server-side computation: O(n log |S|)
  • 120. Key takeaway points • Keeping local differential privacy constant: •One low-noise report is better than many noisy ones •Weak signal with probability 1 is better than strong signal with small probability • We can learn the dictionary – at a cost • Longitudinal privacy remains a challenge
  • 122. Microsoft: Discretization of continuous variables "These guarantees are particularly strong when user’s behavior remains approximately the same, varies slowly, or varies around a small number of values over the course of data collection."
  • 123. Microsoft's deployment "Our mechanisms have been deployed by Microsoft across millions of devices ... to protect users’ privacy while collecting application usage statistics." B. Ding, J. Kulkarni, S. Yekhanin, NIPS 2017 More details: today, 1pm-5pm T12: Privacy at Scale: Local Differential Privacy in Practice, G. Cormode, T. Kulkarni, N. Li, T. Wang
  • 125. Outline • LinkedIn Salary Overview • Challenges: Privacy, Modeling • System Design & Architecture • Privacy vs. Modeling Tradeoffs
  • 126. LinkedIn Salary (launched in Nov, 2016)
  • 127. Salary Collection Flow via Email Targeting
  • 128. Current Reach (August 2018) • A few million responses out of several millions of members targeted • Targeted via emails since early 2016 • Countries: US, CA, UK, DE, IN, … • Insights available for a large fraction of US monthly active users
  • 129. Data Privacy Challenges • Minimize the risk of inferring any one individual’s compensation data • Protection against data breach • No single point of failure Achieved by a combination of techniques: encryption, access control, , aggregation, thresholding K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976)
  • 130. Modeling Challenges • Evaluation • Modeling on de-identified data • Robustness and stability • Outlier detection X. Chen, Y. Liu, L. Zhang, and K. Kenthapadi, How LinkedIn Economic Graph Bonds Information and Product: Applications in LinkedIn Salary, KDD 2018 (arxiv.org/abs/1806.09063) K. Kenthapadi, S. Ambler, L. Zhang, and D. Agarwal, Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary, CIKM 2017 (arxiv.org/abs/1703.09845)
  • 131. Problem Statement •How do we design LinkedIn Salary system taking into account the unique privacy and security challenges, while addressing the product requirements?
  • 132. Differential Privacy? [Dwork et al, 2006] • Rich privacy literature (Adam-Worthmann, Samarati-Sweeney, Agrawal-Srikant, …, Kenthapadi et al, Machanavajjhala et al, Li et al, Dwork et al) • Limitation of anonymization techniques (as discussed in the first part) • Worst case sensitivity of quantiles to any one user’s compensation data is large •  Large noise to be added, depriving reliability/usefulness • Need compensation insights on a continual basis • Theoretical work on applying differential privacy under continual observations • No practical implementations / applications • Local differential privacy / Randomized response based approaches (Google’s RAPPOR; Apple’s iOS differential privacy; Microsoft’s telemetry collection) not applicable
  • 133. Title Region $$ User Exp Designer SF Bay Area 100K User Exp Designer SF Bay Area 115K ... ... ... Title Region $$ User Exp Designer SF Bay Area 100K De-identification Example Title Region Company Industry Years of exp Degree FoS Skills $$ User Exp Designer SF Bay Area Google Internet 12 BS Interactive Media UX, Graphics, ... 100K Title Region Industry $$ User Exp Designer SF Bay Area Internet 100K Title Region Years of exp $$ User Exp Designer SF Bay Area 10+ 100K Title Region Company Years of exp $$ User Exp Designer SF Bay Area Google 10+ 100K #data points > threshold? Yes ⇒ Copy to Hadoop (HDFS) Note: Original submission stored as encrypted objects.
  • 136. Collection & Storage • Allow members to submit their compensation info • Extract member attributes • E.g., canonical job title, company, region, by invoking LinkedIn standardization services • Securely store member attributes & compensation data
  • 138. De-identification & Grouping • Approach inspired by k-Anonymity [Samarati-Sweeney] • “Cohort” or “Slice” • Defined by a combination of attributes • E.g, “User experience designers in SF Bay Area” • Contains aggregated compensation entries from corresponding individuals • No user name, id or any attributes other than those that define the cohort • A cohort available for offline processing only if it has at least k entries • Apply LinkedIn standardization software (free-form attribute  canonical version) before grouping • Analogous to the generalization step in k-Anonymity
  • 139. De-identification & Grouping • Slicing service • Access member attribute info & submission identifiers (no compensation data) • Generate slices & track # submissions for each slice • Preparation service • Fetch compensation data (using submission identifiers), associate with the slice data, copy to HDFS
  • 141. Insights & Modeling • Salary insight service • Check whether the member is eligible • Give-to-get model • If yes, show the insights • Offline workflow • Consume de-identified HDFS dataset • Compute robust compensation insights • Outlier detection • Bayesian smoothing/inference • Populate the insight key-value stores
  • 143. Security Mechanisms • Encryption of member attributes & compensation data using different sets of keys • Separation of processing • Limiting access to the keys
  • 144. Security Mechanisms • Key rotation • No single point of failure • Infra security
  • 145. Preventing Timestamp Join based Attacks • Inference attack by joining these on timestamp • De-identified compensation data • Page view logs (when a member accessed compensation collection web interface) •  Not desirable to retain the exact timestamp • Perturb by adding random delay (say, up to 48 hours) • Modification based on k-Anonymity • Generalization using a hierarchy of timestamps • But, need to be incremental •  Process entries within a cohort in batches of size k • Generalize to a common timestamp • Make additional data available only in such incremental batches
  • 146. Privacy vs Modeling Tradeoffs • LinkedIn Salary system deployed in production for ~2.5 years • Study tradeoffs between privacy guarantees (‘k’) and data available for computing insights • Dataset: Compensation submission history from 1.5M LinkedIn members • Amount of data available vs. minimum threshold, k • Effect of processing entries in batches of size, k
  • 149. Median delay due to batching vs. batch size, k
  • 150. Key takeaway points • LinkedIn Salary: a new internet application, with unique privacy/modeling challenges • Privacy vs. Modeling Tradeoffs • Potential directions • Privacy-preserving machine learning models in a practical setting [e.g., Chaudhuri et al, JMLR 2011; Papernot et al, ICLR 2017] • Provably private submission of compensation entries?
  • 152. Beyond Randomized Response • LDP + Machine Learning: • "Is interaction necessary for distributed private learning?" Smith, Thakurta, Upadhyay, S&P 2017 • Federated Learning • Encode-Shuffle-Analyze architecture "Prochlo: Strong Privacy for Analytics in the Crowd" Bittau et al., SOSP 2017 • Amplification by Shuffling
  • 153. LDP + Machine Learning Interactivity as a major implementation constraint ... parallel
  • 154. LDP + Machine Learning Interactivity as a major implementation constraint ... sequential
  • 155. "Is interaction necessary for distributed private learning?" [STU 2017] • Single parameter learning (e.g., median): • Maximal accuracy with full parallelism • Multi-parameter learning: • Polylog number of iterations • Lower bounds
  • 156. Federated Learning "Practical secure aggregation for privacy-preserving machine learning" Bonawitz, Ivanov, Kreuter, Marcedone, McMahan, Patel, Ramage, Segal, Seth, ACM CCS 2017
  • 157. PROCHLO: Strong Privacy for Analytics in the Crowd Bittau, Erlingsson, Maniatis, Mironov, Raghunathan, Lie, Rudominer, Kode, Tinnes, Seefeld SOSP 2017
  • 158. The ESA Architecture and Its Prochlo realization E A E E S Σ ESA: Encode, Shuffle, Analyze (ESA) Prochlo: A hardened ESA realization using Intel's SGX + crypto
  • 160. Key takeaway points • Notion of differential privacy is a principled foundation for privacy- preserving data analyses • Local differential privacy is a powerful technique appropriate for Internet-scale telemetry • Other techniques (thresholding, shuffling) can be combined with differentially private algorithms or be used in isolation.
  • 161. References Differential privacy: review "A Firm Foundation For Private Data Analysis", C. ACM 2011 by Dwork book "The Algorithmic Foundations of Differential Privacy" by Dwork and Roth
  • 162. References Google's RAPPOR: paper "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response", ACM CCS 2014, Erlingsson, Pihur, Korolova blog Apple's implementation: article "Learning with Privacy at Scale", Apple ML J., Dec 2017 paper "Practical Locally Private Heavy Hitters", NIPS 2017, by Bassily, Nissim, Stemmer, Thakurta paper "Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12" by Tang, Korolova, Bai, Wang, Wang LinkedIn Salary: paper "LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers", IEEE PAC 2017, Kenthapadi, Chudhary, Ambler blog
  • 163. Thanks & Questions • Tutorial website: https://sites.google.com/view/kdd2018privacytutorial • Feedback most welcome  • kkenthapadi@linkedin.com, mironov@google.com • Related KDD’18 tutorial today afternoon (1pm-5pm): T12: Privacy at Scale: Local Differential Privacy in Practice, G. Cormode, T. Kulkarni, N. Li, T. Wang