Preserving Privacy and Utility in Text Data Analysis

T
Preserving Privacy and Utility in Text Data Analysis
Tom Diethe, Oluwaseyi Feyisetan, Thomas Drake, Borja Balle
{sey,tdiethe,draket}@amazon.com
borja.balle@gmail.com
PrivateNLP Workshop, WSDM
February 7 2020
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 1 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 2 / 41
Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
How do we ...
create robust and efficient AI systems?
maintain the privacy of customer data?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
Failure Modes
Unintentional failures: ML system produces a formally correct but completely unsafe
outcome
Outliers/anomalies
Dataset shift
Limited memory
Intentional failures: failure is caused by an active adversary attempting to subvert the
system to attain her goals, such as to:
misclassify the result
infer private training data
steal the underlying algorithm
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 4 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 5 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
UK IT £##### 1980-1985 - Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Mechanisms:
Randomised response −→ plausible
deniability
Laplace mechanism: e.g. ˜µ = µ + ξ,
ξ ∼ Lap 1
n
Output perturbation
...
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Claim: Above algorithm satisfies (log 3)-differential privacy
Pr[Response = Yes|x = Yes]
Pr[Response = Yes|x = No]
=
1/2 × 1 + 1/2 × 1/2
1/2 × 0 + 1/2 × 1/2
=
3/4
1/4
= 3 =⇒ e = 3
Same for Pr[Response=No|x=Yes]
Pr[Response=No|x=No] .
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
Important Properties
Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP
Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is
( n
i=1 i , n
i=1 δi )-DP
Protects against arbitrary side knowledge
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 11 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 12 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
Desired Functionality
Intent Query x Modified Query x
GetWeather Will it be colder in Cleveland Will it be colder in Ohio
PlayMusic Play Cantopop on lastfm Play C-pop on lastfm
BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County
SearchCreativeWork I want to watch Manthan film I want to watch Hindi film
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 14 / 41
Word Embeddings
Mapping from words into vectors of real numbers (many ways to do this!)
e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText)
Defines a mapping φ : W → Rn
Nearest neigbours are often synonyms
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 15 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 18 / 41
Differential Privacy in the Space of Euclidean Word Embedding
Adding noise to a location always produces
a valid location — a point somewhere on
the earth’s surface
Adding noise to a word embedding
produces a new point in the embedding
space, but it’s A.S. not the location of a
valid word embedding
We perform approximate nearest neighbors
find the nearest valid embedding
Nearest valid embedding could be the
original word itself: in that case, the
original word is returned
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 19 / 41
Practical Considerations
To help choose , we define:
Uncertainty statistics for the adversary over the outputs
Indistinguishability statistics: plausible deniability
Find a radius of high protection: guarantee on the likelihood of changing any word in the
embedding vocabulary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 20 / 41
Euclidean Experiments: Setup
Dataset IMDb Enron InsuranceQA
Task type Sentiment analysis Author identification Question answering
Evaluation Metric accuracy accuracy MAP, MRR
Training set size 25, 000 8, 517 12, 887
Test set size 25, 000 850 1, 800
Total word count 5, 958, 157 307, 639 92, 095
Vocabulary size 79, 428 15, 570 2, 745
Sentence length
µ = 42.27
σ = 34.38
µ = 30.68
σ = 31.54
µ = 7.15
σ = 2.06
Scenario 1: Train time protection little access to public data (10%), but abundant
access to private training data (90%); model training is done on the combined dataset
(i.e. public subset + perturbed private subset)
Scenario 2: Test time protection models trained on complete training set; evaluation
on privatized version of the test sets
We used 300-D GloVe word embeddings with biLSTM models
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 21 / 41
Results
IMDb reviews – Accuracy vs baseline for different values of ε
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at training time)
Accuracy
Baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at test time)
Accuracy
Baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Results
Enron emails – Accuracy vs baseline for different values of ε
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at training time)
Accuracy
Baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at test time)
Accuracy
Baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Results
InsuranceQA – MAP/MRR scores for different values of ε on the dev set
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
Scores for dev at training time
MAP on dev
MRR on dev
MAP baseline
MRR baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
Scores for dev at test time
MAP on dev
MRR on dev
MAP baseline
MRR baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Machine Auditors
Probabilistic record linkage auditing attack
Objective: link a user in a public dataset, to a user in a (leaked) private dataset.
Attack simulation: simulate public and “leaked” datasets by randomly splitting
an initial dataset. The attack takes advantage of rare words and queries issued
by users. A vector of word counts can be extracted from user queries and used to
perform the linkage.
Assumptions: attacker is able to narrow the attack set (using side knowledge)
Evaluation: how many accurate links can the attacker reconstruct?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
Machine Auditors
Membership auditing attack [Shokri et al ’17, Song & Shmatikov ’18]
Objective: identify whether an individual’s data (queries) were used in the
training set of an ML model.
Attack simulation: train ML model on queries from m users. Train “shadow”
models using data from a different set of n users. The attack model is a classifier
built using the output of the shadow models
Assumptions: attacker is able to narrow the attack set (using side knowledge)
Evaluation: can the attacker correctly detect m users inside and outside the
model’s dataset
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 25 / 41
Hyperbolic Spaces
(a) (b)
(a) Projection of a point in the Lorentz model Hn to the Poincaré model
(b) WebIsADb is-a relationships in GloVe vocabulary on B2 Poincaré disk
Continuous analog of a tree
structure
Natural language captures
hypernomy and hyponomy
−→ embeddings require fewer
dimensions
Use models of Hyperbolic space -
projections into Euclidean space
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 26 / 41
Hyperbolic Differential Privacy
Distances in n−dimensional Poincaré ball are given by:
dBn (u, v) = arcosh 1 + 2
u − v 2
(1 − u 2
)(1 − v 2
)
Claim: dBn (u, v) is a valid metric. Proof (via Lorentzian model) in the paper
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 27 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Privacy Experiments 1
Task: obfuscation vs. Koppel’s authorship attribution algorithm
Datasets: TPAN@Clef tasks, correct author predictions (lower=better)
Pan-11 Pan-12
small large set-A set-C set-D set-I
0.5 36 72 4 3 2 5
1 35 73 3 3 2 5
2 40 78 4 3 2 5
8 65 116 4 5 4 5
∞ 147 259 6 6 6 12
Correct author predictions (lower is better)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 29 / 41
Hyperbolic Privacy Experiments 2
Task: expected privacy vs Euclidean baseline
Datasets: 100/200/300d GloVe embeddings
expected value Nw
ε worst-case Nw hyp-100 euc-100 euc-200 euc-300
0.125 134 1.25 38.54 39.66 39.88
0.5 148 1.62 42.48 43.62 43.44
1 172 2.07 48.80 50.26 53.82
2 297 3.92 92.42 93.75 90.90
8 960 140.67 602.21 613.11 587.68
Privacy comparisons (lower Nw is better)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 30 / 41
Hyperbolic Utility Experiments
5 classification tasks: sentiment x2, product reviews, opinion polarity, question-type
3 natural language tasks: NL inference, paraphrase detection, semantic textual similarity
baselines: utility results baselined using SentEval against random replacement
hyp-100d original
dataset random ε = 0.125 ε = 1 ε = 8 InferSent SkipThought fastText
MR 58.19 58.38 63.56 74.52 81.10 79.40 78.20
CR 77.48 83.21∗∗
83.92∗∗
85.19∗∗
86.30 83.1 80.20
MPQA 84.27 88.53∗
88.62∗
88.98∗
90.20 89.30 88.00
SST-5 30.81 41.76 42.40 42.53 46.30 − 45.10
TREC-6 75.20 82.40 82.40 84.20∗
88.20 88.40 83.40
SICK-E 79.20 81.00∗∗
82.38∗∗
82.34∗∗
86.10 79.5 78.9
MRPC 69.86 74.78∗
75.07∗
75.01∗
76.20 − 74.40
STS14 0.17/0.16 0.44/0.45 0.45/0.46∗
0.52/0.53∗
0.68/0.65 0.44/0.45 0.65/0.63
Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 31 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 32 / 41
UTILITYPRIVACY
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 33 / 41
Example: Differentially Private SGD
Algorithm 1: Differentially Private SGD
Input: dataset z = (z1, . . . , zn)
Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance
σ2, clipping norm L
Initialize w ← 0
for t ∈ [T] do
for k ∈ [n/m] do
Sample S ⊂ [n] with |S| = m uniformly at random
Let g ← 1
m j∈S clipL( (zj , w)) + 2L
m N(0, σ2I)
Update w ← w − ηg
return w
5+ hyper-parameters affecting both privacy and utility
For deep learning applications we only have empirical utility (not analyitic)
How do we find the hyperparameters that give us an optimal trade-off?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 34 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
DPareto
DPareto
Repeat:
1 For each objective (privacy, utility):
1 Fit a surrogate model (Gaussian process (GP)) using the available dataset
2 Calculate the predictive distribution using the GP mean and variance functions
2 Use the posterior of the surrogate models to form an acquisition function
3 Collect the next point at the estimated global max. of the acquisition function
until budget exhausted
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 37 / 41
DPareto vs Random Sampling
28
)
20
22
24
26
28
Sampled points
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
PFhypervolume
Hypervolume Evolution
MLP1 (RS)
MLP1 (BO)
MLP2 (RS)
MLP2 (BO)
10−1
100
101
ε
0.0
0.2
0.4
0.6
0.8
1.0
Classificationerror
MLP2 Pareto Fronts
Initial
+256 RS
+256 BO
10−1
100
101
ε
0.16
0.18
0.20
0.22
0.24
Classificationerror
LogReg+SGD Samples
1500 RS
256 BO
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 38 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 39 / 41
Summary: Privacy Enhancing Technologies
Privacy
Privacy risks can be counter-intuitive and tricky to formalize
High-dimensional data and side knowledge make privacy hard
Semantic guarantees (eg. DP) behave better than syntactic ones (eg.
k-anonymization)
Differential privacy is a mature privacy enhancing technology
Metric DP provides local plausible deniability, accuracy can be good even in
cases with an infinite number of outcomes
Empirical privacy-utility trade-off evaluation enables application-specific decisions
Bayesian optimization provides computationally efficient method to recover the
Pareto front (esp. with large number of hyper-parameters)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 40 / 41
Questions?
tdiethe@amazon.com
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 41 / 41
1 of 85

Recommended

ϵ-DIFFERENTIAL PRIVACY MODEL FOR VERTICALLY PARTITIONED DATA TO SECURE THE PR... by
ϵ-DIFFERENTIAL PRIVACY MODEL FOR VERTICALLY PARTITIONED DATA TO SECURE THE PR...ϵ-DIFFERENTIAL PRIVACY MODEL FOR VERTICALLY PARTITIONED DATA TO SECURE THE PR...
ϵ-DIFFERENTIAL PRIVACY MODEL FOR VERTICALLY PARTITIONED DATA TO SECURE THE PR...International Journal of Technical Research & Application
439 views6 slides
Privacy Preserving for Mobile Health Data by
Privacy Preserving for Mobile Health DataPrivacy Preserving for Mobile Health Data
Privacy Preserving for Mobile Health DataIRJET Journal
32 views6 slides
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/11/2020) by
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/11/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/11/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/11/2020)Ipsos Public Affairs
34.6K views14 slides
The Big Data by
The Big DataThe Big Data
The Big DataAmber Voisine
3 views45 slides
Suicide Survey Statistics by
Suicide Survey StatisticsSuicide Survey Statistics
Suicide Survey StatisticsAngela Overton
4 views47 slides
data, big data, open data by
data, big data, open datadata, big data, open data
data, big data, open dataVincenzo Patruno
1.1K views87 slides

More Related Content

Similar to Preserving Privacy and Utility in Text Data Analysis

MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of... by
MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of...MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of...
MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of...MongoDB
957 views20 slides
Qualitative Research Analysis Of Irritable Bowel Syndrome... by
Qualitative Research Analysis Of Irritable Bowel Syndrome...Qualitative Research Analysis Of Irritable Bowel Syndrome...
Qualitative Research Analysis Of Irritable Bowel Syndrome...Anna Shaw
1 view40 slides
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney by
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeneydatascienceiqss
413 views15 slides
Using Apache Spark and Differential Privacy for Protecting the Privacy of the... by
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...Databricks
446 views63 slides
Explainability for NLP by
Explainability for NLPExplainability for NLP
Explainability for NLPIsabelle Augenstein
275 views86 slides
Strata Conference NY: The Accidental Chief Privacy Officer by
Strata Conference NY: The Accidental Chief Privacy OfficerStrata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy OfficerJim Adler
4.2K views54 slides

Similar to Preserving Privacy and Utility in Text Data Analysis(20)

MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of... by MongoDB
MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of...MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of...
MongoDB IoT City Tour LONDON: The IoT and the Self in 2024 - What a future of...
MongoDB957 views
Qualitative Research Analysis Of Irritable Bowel Syndrome... by Anna Shaw
Qualitative Research Analysis Of Irritable Bowel Syndrome...Qualitative Research Analysis Of Irritable Bowel Syndrome...
Qualitative Research Analysis Of Irritable Bowel Syndrome...
Anna Shaw1 view
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney by datascienceiqss
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
datascienceiqss413 views
Using Apache Spark and Differential Privacy for Protecting the Privacy of the... by Databricks
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Databricks446 views
Strata Conference NY: The Accidental Chief Privacy Officer by Jim Adler
Strata Conference NY: The Accidental Chief Privacy OfficerStrata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy Officer
Jim Adler4.2K views
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei by Galit Shmueli
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Galit Shmueli492 views
1 tenea lewissocw 6301methodological approach by licservernoida
1 tenea lewissocw 6301methodological approach1 tenea lewissocw 6301methodological approach
1 tenea lewissocw 6301methodological approach
licservernoida32 views
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020) by Ipsos Public Affairs
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020)
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet? by Jim Adler
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Jim Adler1.7K views
IE_expressyourself_EssayH by jk6653284
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayH
jk665328453 views
Data collection for cultural project by Danilo Supino
Data collection for cultural projectData collection for cultural project
Data collection for cultural project
Danilo Supino185 views
1. Data Science overview - part1.pptx by RahulTr22
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
RahulTr227 views

Recently uploaded

Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...ILRI
5 views6 slides
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfKerryNuez1
24 views5 slides
himalay baruah acid fast staining.pptx by
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptxHimalayBaruah
7 views16 slides
DATABASE MANAGEMENT SYSTEM by
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDr. GOPINATH D
7 views50 slides
ELECTRON TRANSPORT CHAIN by
ELECTRON TRANSPORT CHAINELECTRON TRANSPORT CHAIN
ELECTRON TRANSPORT CHAINDEEKSHA RANI
7 views16 slides
CSF -SHEEBA.D presentation.pptx by
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptxSheebaD7
11 views13 slides

Recently uploaded(20)

Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez124 views
himalay baruah acid fast staining.pptx by HimalayBaruah
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptx
HimalayBaruah7 views
CSF -SHEEBA.D presentation.pptx by SheebaD7
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptx
SheebaD711 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH16 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople37 views
Distinct distributions of elliptical and disk galaxies across the Local Super... by Sérgio Sacani
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...
Sérgio Sacani31 views
Nitrosamine & NDSRI.pptx by NileshBonde4
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptx
NileshBonde413 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew6 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific49 views
"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy23 views
RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen136 views
PRINCIPLES-OF ASSESSMENT by rbalmagro
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENT
rbalmagro12 views

Preserving Privacy and Utility in Text Data Analysis

  • 1. Preserving Privacy and Utility in Text Data Analysis Tom Diethe, Oluwaseyi Feyisetan, Thomas Drake, Borja Balle {sey,tdiethe,draket}@amazon.com borja.balle@gmail.com PrivateNLP Workshop, WSDM February 7 2020
  • 2. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 1 / 41
  • 3. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 2 / 41
  • 4. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
  • 5. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences How do we ... create robust and efficient AI systems? maintain the privacy of customer data? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
  • 6. Failure Modes Unintentional failures: ML system produces a formally correct but completely unsafe outcome Outliers/anomalies Dataset shift Limited memory Intentional failures: failure is caused by an active adversary attempting to subvert the system to attain her goals, such as to: misclassify the result infer private training data steal the underlying algorithm Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 4 / 41
  • 7. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 5 / 41
  • 8. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 9. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 10. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender UK IT £##### 1980-1985 - Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 11. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identifiers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentification William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
  • 12. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identifiers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentification William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
  • 13. Anonymized Data Isn’t Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period Netflix “anonymized” the data before releasing it by removing usernames, but assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends Reidentification Researchers used this information to uniquely identify individual Netflix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
  • 14. Anonymized Data Isn’t Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period Netflix “anonymized” the data before releasing it by removing usernames, but assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends Reidentification Researchers used this information to uniquely identify individual Netflix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
  • 15. Differential Privacy A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
  • 16. Differential Privacy A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Mechanisms: Randomised response −→ plausible deniability Laplace mechanism: e.g. ˜µ = µ + ξ, ξ ∼ Lap 1 n Output perturbation ... Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
  • 17. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 flip a coin 2 if tails, respond truthfully with x 3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
  • 18. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 flip a coin 2 if tails, respond truthfully with x 3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails Claim: Above algorithm satisfies (log 3)-differential privacy Pr[Response = Yes|x = Yes] Pr[Response = Yes|x = No] = 1/2 × 1 + 1/2 × 1/2 1/2 × 0 + 1/2 × 1/2 = 3/4 1/4 = 3 =⇒ e = 3 Same for Pr[Response=No|x=Yes] Pr[Response=No|x=No] . Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
  • 19. Important Properties Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is ( n i=1 i , n i=1 δi )-DP Protects against arbitrary side knowledge Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 11 / 41
  • 20. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 12 / 41
  • 21. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 22. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 23. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 24. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 25. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 26. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 27. Desired Functionality Intent Query x Modified Query x GetWeather Will it be colder in Cleveland Will it be colder in Ohio PlayMusic Play Cantopop on lastfm Play C-pop on lastfm BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County SearchCreativeWork I want to watch Manthan film I want to watch Hindi film Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 14 / 41
  • 28. Word Embeddings Mapping from words into vectors of real numbers (many ways to do this!) e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText) Defines a mapping φ : W → Rn Nearest neigbours are often synonyms Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 15 / 41
  • 29. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 30. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 31. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 32. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 33. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 34. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 35. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 18 / 41
  • 36. Differential Privacy in the Space of Euclidean Word Embedding Adding noise to a location always produces a valid location — a point somewhere on the earth’s surface Adding noise to a word embedding produces a new point in the embedding space, but it’s A.S. not the location of a valid word embedding We perform approximate nearest neighbors find the nearest valid embedding Nearest valid embedding could be the original word itself: in that case, the original word is returned Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 19 / 41
  • 37. Practical Considerations To help choose , we define: Uncertainty statistics for the adversary over the outputs Indistinguishability statistics: plausible deniability Find a radius of high protection: guarantee on the likelihood of changing any word in the embedding vocabulary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 20 / 41
  • 38. Euclidean Experiments: Setup Dataset IMDb Enron InsuranceQA Task type Sentiment analysis Author identification Question answering Evaluation Metric accuracy accuracy MAP, MRR Training set size 25, 000 8, 517 12, 887 Test set size 25, 000 850 1, 800 Total word count 5, 958, 157 307, 639 92, 095 Vocabulary size 79, 428 15, 570 2, 745 Sentence length µ = 42.27 σ = 34.38 µ = 30.68 σ = 31.54 µ = 7.15 σ = 2.06 Scenario 1: Train time protection little access to public data (10%), but abundant access to private training data (90%); model training is done on the combined dataset (i.e. public subset + perturbed private subset) Scenario 2: Test time protection models trained on complete training set; evaluation on privatized version of the test sets We used 300-D GloVe word embeddings with biLSTM models Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 21 / 41
  • 39. Results IMDb reviews – Accuracy vs baseline for different values of ε 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at training time) Accuracy Baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at test time) Accuracy Baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 40. Results Enron emails – Accuracy vs baseline for different values of ε 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at training time) Accuracy Baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at test time) Accuracy Baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 41. Results InsuranceQA – MAP/MRR scores for different values of ε on the dev set 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Scores for dev at training time MAP on dev MRR on dev MAP baseline MRR baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Scores for dev at test time MAP on dev MRR on dev MAP baseline MRR baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 42. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 43. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 44. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 45. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 46. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 47. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 48. Machine Auditors Probabilistic record linkage auditing attack Objective: link a user in a public dataset, to a user in a (leaked) private dataset. Attack simulation: simulate public and “leaked” datasets by randomly splitting an initial dataset. The attack takes advantage of rare words and queries issued by users. A vector of word counts can be extracted from user queries and used to perform the linkage. Assumptions: attacker is able to narrow the attack set (using side knowledge) Evaluation: how many accurate links can the attacker reconstruct? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
  • 49. Machine Auditors Membership auditing attack [Shokri et al ’17, Song & Shmatikov ’18] Objective: identify whether an individual’s data (queries) were used in the training set of an ML model. Attack simulation: train ML model on queries from m users. Train “shadow” models using data from a different set of n users. The attack model is a classifier built using the output of the shadow models Assumptions: attacker is able to narrow the attack set (using side knowledge) Evaluation: can the attacker correctly detect m users inside and outside the model’s dataset Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
  • 50. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 25 / 41
  • 51. Hyperbolic Spaces (a) (b) (a) Projection of a point in the Lorentz model Hn to the Poincaré model (b) WebIsADb is-a relationships in GloVe vocabulary on B2 Poincaré disk Continuous analog of a tree structure Natural language captures hypernomy and hyponomy −→ embeddings require fewer dimensions Use models of Hyperbolic space - projections into Euclidean space Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 26 / 41
  • 52. Hyperbolic Differential Privacy Distances in n−dimensional Poincaré ball are given by: dBn (u, v) = arcosh 1 + 2 u − v 2 (1 − u 2 )(1 − v 2 ) Claim: dBn (u, v) is a valid metric. Proof (via Lorentzian model) in the paper Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 27 / 41
  • 53. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 54. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 55. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 56. Hyperbolic Privacy Experiments 1 Task: obfuscation vs. Koppel’s authorship attribution algorithm Datasets: TPAN@Clef tasks, correct author predictions (lower=better) Pan-11 Pan-12 small large set-A set-C set-D set-I 0.5 36 72 4 3 2 5 1 35 73 3 3 2 5 2 40 78 4 3 2 5 8 65 116 4 5 4 5 ∞ 147 259 6 6 6 12 Correct author predictions (lower is better) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 29 / 41
  • 57. Hyperbolic Privacy Experiments 2 Task: expected privacy vs Euclidean baseline Datasets: 100/200/300d GloVe embeddings expected value Nw ε worst-case Nw hyp-100 euc-100 euc-200 euc-300 0.125 134 1.25 38.54 39.66 39.88 0.5 148 1.62 42.48 43.62 43.44 1 172 2.07 48.80 50.26 53.82 2 297 3.92 92.42 93.75 90.90 8 960 140.67 602.21 613.11 587.68 Privacy comparisons (lower Nw is better) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 30 / 41
  • 58. Hyperbolic Utility Experiments 5 classification tasks: sentiment x2, product reviews, opinion polarity, question-type 3 natural language tasks: NL inference, paraphrase detection, semantic textual similarity baselines: utility results baselined using SentEval against random replacement hyp-100d original dataset random ε = 0.125 ε = 1 ε = 8 InferSent SkipThought fastText MR 58.19 58.38 63.56 74.52 81.10 79.40 78.20 CR 77.48 83.21∗∗ 83.92∗∗ 85.19∗∗ 86.30 83.1 80.20 MPQA 84.27 88.53∗ 88.62∗ 88.98∗ 90.20 89.30 88.00 SST-5 30.81 41.76 42.40 42.53 46.30 − 45.10 TREC-6 75.20 82.40 82.40 84.20∗ 88.20 88.40 83.40 SICK-E 79.20 81.00∗∗ 82.38∗∗ 82.34∗∗ 86.10 79.5 78.9 MRPC 69.86 74.78∗ 75.07∗ 75.01∗ 76.20 − 74.40 STS14 0.17/0.16 0.44/0.45 0.45/0.46∗ 0.52/0.53∗ 0.68/0.65 0.44/0.45 0.65/0.63 Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 31 / 41
  • 59. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 32 / 41
  • 60. UTILITYPRIVACY Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 33 / 41
  • 61. Example: Differentially Private SGD Algorithm 1: Differentially Private SGD Input: dataset z = (z1, . . . , zn) Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance σ2, clipping norm L Initialize w ← 0 for t ∈ [T] do for k ∈ [n/m] do Sample S ⊂ [n] with |S| = m uniformly at random Let g ← 1 m j∈S clipL( (zj , w)) + 2L m N(0, σ2I) Update w ← w − ηg return w 5+ hyper-parameters affecting both privacy and utility For deep learning applications we only have empirical utility (not analyitic) How do we find the hyperparameters that give us an optimal trade-off? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 34 / 41
  • 62. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 63. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 64. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 65. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 66. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 67. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 68. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 69. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 70. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 71. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 72. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 73. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 74. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 75. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 76. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 77. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 78. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 79. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 80. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 81. DPareto DPareto Repeat: 1 For each objective (privacy, utility): 1 Fit a surrogate model (Gaussian process (GP)) using the available dataset 2 Calculate the predictive distribution using the GP mean and variance functions 2 Use the posterior of the surrogate models to form an acquisition function 3 Collect the next point at the estimated global max. of the acquisition function until budget exhausted Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 37 / 41
  • 82. DPareto vs Random Sampling 28 ) 20 22 24 26 28 Sampled points 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 PFhypervolume Hypervolume Evolution MLP1 (RS) MLP1 (BO) MLP2 (RS) MLP2 (BO) 10−1 100 101 ε 0.0 0.2 0.4 0.6 0.8 1.0 Classificationerror MLP2 Pareto Fronts Initial +256 RS +256 BO 10−1 100 101 ε 0.16 0.18 0.20 0.22 0.24 Classificationerror LogReg+SGD Samples 1500 RS 256 BO Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 38 / 41
  • 83. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 39 / 41
  • 84. Summary: Privacy Enhancing Technologies Privacy Privacy risks can be counter-intuitive and tricky to formalize High-dimensional data and side knowledge make privacy hard Semantic guarantees (eg. DP) behave better than syntactic ones (eg. k-anonymization) Differential privacy is a mature privacy enhancing technology Metric DP provides local plausible deniability, accuracy can be good even in cases with an infinite number of outcomes Empirical privacy-utility trade-off evaluation enables application-specific decisions Bayesian optimization provides computationally efficient method to recover the Pareto front (esp. with large number of hyper-parameters) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 40 / 41
  • 85. Questions? tdiethe@amazon.com Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 41 / 41