Privacy-preserving Data Sharing
PyData Berlin 2018
Omar Ali Fdal - omar@statice.io
http://statice.io
Plan
● Privacy notion and expectation
● A brief history of privacy-preserving mechanisms
○ Pseudonymization
○ k-Anonymity
○ Differential privacy
○ Data synthesization
● Final notes
2
Privacy Notion and Expectation
3
Privacy
English dictionary definition:
- A state in which one is not observed or disturbed by other
people
● Lack of privacy => behavioral change
● Fundamental to a free society
○ Anonymous voting guarantees freedom of
choice
https://weburbanist.com/2014/05/15/real-life-panopticons-deserted-dystopian-prisons-in-cuba/
Panopticon 4
Privacy in the present
● Digital tracking everywhere
● Social circle, browsing habits, shopping
details, location tracking, emails, calls, ...
5
Privacy: all or nothing?
● Privacy is not necessarily complete non-disclosure
● Sharing sensitive information is common when it makes sense
○ With your doctor, tax accountant, etc.
6
Why share data in the first place?
● Society benefits from individuals sharing their data
○ Medical advances
○ Sociological research, understanding society dynamics
● Examples:
○ Tracking commute patterns to improve public transport
networks
○ Detect epidemia and act fast by looking at search engine disease
queries/medicine orders
7
Privacy: a challenging constraint
● Non-trivial constraint
○ Protecting privacy of individual people
8
A Brief History of Privacy Preservation Mechanisms
9
Illustration Dataset
10
Personally Identifying Information
11
Illustration: Cambridge Analytica
● Infamous leak involved Personally Identifiable Information of over
50 million people
https://www.theguardian.com/technology/2018/mar/17/facebook-cambridge-analytica-kogan-data-algorithm
12
Solution?
Remove all “Personally Identifying Information”
● A.k.a
○ Pseudonymization
○ Sanitization
○ De-identification
○ Anonymization
13
Information not unique to you: "quasi-identifiers"
14
Illustration: Massachussets Governor leak
Sweeney, Latanya. Weaving Technology and Policy Together to Maintain Confidentiality. Journal of Law, Medicine and
Ethics, Vol. 25 1997, p. 98-110
15
Fingerprint-like information
● On its own, a fingerprint seems cryptic
● Around 100 minutiae in a fingerprint
● Experts declare a fingerprint match if 12
minutiae match
● Precise identification is possible if
fingerprints are indexed and queryable 16
Illustration: Netflix Movie Preferences
Join
movie
ratings
Ratings of only 4-5 movies allowed successful identification
of a large number of users was possible
17Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. InSecurity and Privacy, 2008. SP 2008.
IEEE Symposium on 2008 May 18 (pp. 111-125). IEEE.
Illustration: Strava Running tracks
French Military Base in MaliHeatmap 30 million runners worldwide
Not that many in the Sahara
18
And many more ...
● Search queries
● Browser configuration
19
Common issue: Linkage Attack
● With an auxiliary dataset an attacker can “link” the pseudonymous
data
● Auxiliary dataset may or may not be obtained legally
● Auxiliary dataset may contain Personally Identifiable Information
20
Take-away: Pseudonymization is not enough
● “We do not share information of data in any personally identifiable
form” -- Serial Pseudonymizer
● Still the most used today in data release
21
Solution?
K-anonymity: avoiding unique joins
● Avoid a unique join based on the “quasi-identifiers” attributes
● Each combination of quasi-identifiers appears at least in k rows
● Using Suppression - Generalization - Binning - Top-coding
22
Sweeney L. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems. 2002 Oct;10(05):557-70.
K-anonymity: Example
23
Original dataset 3-anonymous dataset
K-anonymity: Linkage protection
24
Record to be linked
3-anonymous dataset
K-anonymity issues
25
Records to be linked
3-anonymous dataset
● Lack of diversity
● Background knowledge
K-anonymity: superficial linkage avoiding
● A syntactic guarantee
● Makes strong assumptions about attack scenario:
○ no external knowledge
● Curse of dimensionality
○ Hard to achieve in high dimensional spaces
26
Take-away: assume the worst
● Do not assume anything about the attack scenario
○ Resourceful attacker
○ Access to auxiliary information
● Any method trying to preserve privacy needs to take this into
account
27
Privacy Promise: Opt-out scenario
● My data must have no effect on any analysis carried on on the
dataset
● Problem: if nobody’s data has no effect on any analysis then there
will be no utility.
28
Privacy Promise: Inference Protection
● I have a secret, denoted secret(me)
● Given any result of any analysis carried on a dataset, I expect:
Pr(secret(me) | result) = Pr(secret(me))
● Problem:
○ if the secret does not concern only me, i.e. can be generally
learned from other people, then the result will affect me even
if my data is not part of the dataset
29
Privacy Promise: what can we expect?
● A tradeoff
○ With or without my data, any outcome of any analysis based
on a given dataset is almost equally likely
○ The impact on me sharing information in the dataset will be
limited to the general learnings not the specifics of my
information
30
Privacy Promise: differential privacy
31
Possible
world
including my
data (D')
Possible
world
excluding my
data (D)
Result R
Given a result R, it should not be
possible to guess whether my
data was included or not in the
dataset
Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. InTheory of
cryptography conference 2006 Mar 4 (pp. 265-284). Springer, Berlin, Heidelberg.
Differential Privacy implementation:
The Laplace Mechanism
Where is the global sensitivity of the function
It represents the maximum contribution of a single record in the
dataset.
The ratio of the distributions
is bounded by
32
Differential Privacy: advantages
● No assumptions about the attack scenario
● Formal theoretical definition
● Composition theorems to reason about complex computations
33
Differential Privacy: drawbacks
● Reason about multiple queries, setting a privacy budget
○ If a user asks the same query multiple times she might be
able to remove the noise and deduce the true answer
● Hard to devise a differentially private mechanism in complex
cases
● Limited adoption
34
Differential Privacy: Data Release?
● Release of aggregates (histograms)
○ US Census Bureau case
○ Challenge: maintain consistency across datasets
● What if we want to release data at the same granularity?
35
Data Synthesization: illustration
● US army anthropometric data
36
Data Synthesization: illustration
● US army synthetic anthropometric data
based on the learned joint distribution
37
Differentially Private Synthesization
● Estimate the joint distribution in
a differentially private way
38
Data Synthesization: Applications
● Test data in development phases
● Limit access to real data and replace it with synthetic data when
possible
● Share synthetic data for exploration purposes
39
Final Notes
● Pseudonymization is not anonymization, do not use it
● Legal constraints (e.g. GDPR) are a trigger for organizations to use better
privacy-preserving mechanisms
● Privacy-preserving data release is an open research question
40
References
● Sweeney, Latanya. Weaving Technology and Policy Together to Maintain
Confidentiality. Journal of Law, Medicine and Ethics, Vol. 25 1997, p. 98-110
● Sweeney L. k-anonymity: A model for protecting privacy. International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems. 2002 Oct;10(05):557-70.
● Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets.
InSecurity and Privacy, 2008. SP 2008. IEEE Symposium on 2008 May 18 (pp. 111-125).
IEEE.
● Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data
analysis. InTheory of cryptography conference 2006 Mar 4 (pp. 265-284). Springer,
Berlin, Heidelberg.
● Nissim K, Steinke T, Wood A, Altman M, Bembenek A, Bun M, Gaboardi M, O’Brien DR,
Vadhan S. Differential Privacy: A Primer for a Non-technical Audience. Working Group
Privacy Tools Sharing Res. Data, Harvard Univ., Boston, MA, USA, Tech. Rep.
TR-2017-03. 2017 May 7.
● Dwork C, Roth A. The algorithmic foundations of differential privacy. Foundations and
Trends® in Theoretical Computer Science. 2014 Aug 11;9(3–4):211-407. 41

Privacy preserving Data Sharing - PyData Berlin 2018

  • 1.
    Privacy-preserving Data Sharing PyDataBerlin 2018 Omar Ali Fdal - omar@statice.io http://statice.io
  • 2.
    Plan ● Privacy notionand expectation ● A brief history of privacy-preserving mechanisms ○ Pseudonymization ○ k-Anonymity ○ Differential privacy ○ Data synthesization ● Final notes 2
  • 3.
    Privacy Notion andExpectation 3
  • 4.
    Privacy English dictionary definition: -A state in which one is not observed or disturbed by other people ● Lack of privacy => behavioral change ● Fundamental to a free society ○ Anonymous voting guarantees freedom of choice https://weburbanist.com/2014/05/15/real-life-panopticons-deserted-dystopian-prisons-in-cuba/ Panopticon 4
  • 5.
    Privacy in thepresent ● Digital tracking everywhere ● Social circle, browsing habits, shopping details, location tracking, emails, calls, ... 5
  • 6.
    Privacy: all ornothing? ● Privacy is not necessarily complete non-disclosure ● Sharing sensitive information is common when it makes sense ○ With your doctor, tax accountant, etc. 6
  • 7.
    Why share datain the first place? ● Society benefits from individuals sharing their data ○ Medical advances ○ Sociological research, understanding society dynamics ● Examples: ○ Tracking commute patterns to improve public transport networks ○ Detect epidemia and act fast by looking at search engine disease queries/medicine orders 7
  • 8.
    Privacy: a challengingconstraint ● Non-trivial constraint ○ Protecting privacy of individual people 8
  • 9.
    A Brief Historyof Privacy Preservation Mechanisms 9
  • 10.
  • 11.
  • 12.
    Illustration: Cambridge Analytica ●Infamous leak involved Personally Identifiable Information of over 50 million people https://www.theguardian.com/technology/2018/mar/17/facebook-cambridge-analytica-kogan-data-algorithm 12
  • 13.
    Solution? Remove all “PersonallyIdentifying Information” ● A.k.a ○ Pseudonymization ○ Sanitization ○ De-identification ○ Anonymization 13
  • 14.
    Information not uniqueto you: "quasi-identifiers" 14
  • 15.
    Illustration: Massachussets Governorleak Sweeney, Latanya. Weaving Technology and Policy Together to Maintain Confidentiality. Journal of Law, Medicine and Ethics, Vol. 25 1997, p. 98-110 15
  • 16.
    Fingerprint-like information ● Onits own, a fingerprint seems cryptic ● Around 100 minutiae in a fingerprint ● Experts declare a fingerprint match if 12 minutiae match ● Precise identification is possible if fingerprints are indexed and queryable 16
  • 17.
    Illustration: Netflix MoviePreferences Join movie ratings Ratings of only 4-5 movies allowed successful identification of a large number of users was possible 17Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. InSecurity and Privacy, 2008. SP 2008. IEEE Symposium on 2008 May 18 (pp. 111-125). IEEE.
  • 18.
    Illustration: Strava Runningtracks French Military Base in MaliHeatmap 30 million runners worldwide Not that many in the Sahara 18
  • 19.
    And many more... ● Search queries ● Browser configuration 19
  • 20.
    Common issue: LinkageAttack ● With an auxiliary dataset an attacker can “link” the pseudonymous data ● Auxiliary dataset may or may not be obtained legally ● Auxiliary dataset may contain Personally Identifiable Information 20
  • 21.
    Take-away: Pseudonymization isnot enough ● “We do not share information of data in any personally identifiable form” -- Serial Pseudonymizer ● Still the most used today in data release 21
  • 22.
    Solution? K-anonymity: avoiding uniquejoins ● Avoid a unique join based on the “quasi-identifiers” attributes ● Each combination of quasi-identifiers appears at least in k rows ● Using Suppression - Generalization - Binning - Top-coding 22 Sweeney L. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 2002 Oct;10(05):557-70.
  • 23.
  • 24.
    K-anonymity: Linkage protection 24 Recordto be linked 3-anonymous dataset
  • 25.
    K-anonymity issues 25 Records tobe linked 3-anonymous dataset ● Lack of diversity ● Background knowledge
  • 26.
    K-anonymity: superficial linkageavoiding ● A syntactic guarantee ● Makes strong assumptions about attack scenario: ○ no external knowledge ● Curse of dimensionality ○ Hard to achieve in high dimensional spaces 26
  • 27.
    Take-away: assume theworst ● Do not assume anything about the attack scenario ○ Resourceful attacker ○ Access to auxiliary information ● Any method trying to preserve privacy needs to take this into account 27
  • 28.
    Privacy Promise: Opt-outscenario ● My data must have no effect on any analysis carried on on the dataset ● Problem: if nobody’s data has no effect on any analysis then there will be no utility. 28
  • 29.
    Privacy Promise: InferenceProtection ● I have a secret, denoted secret(me) ● Given any result of any analysis carried on a dataset, I expect: Pr(secret(me) | result) = Pr(secret(me)) ● Problem: ○ if the secret does not concern only me, i.e. can be generally learned from other people, then the result will affect me even if my data is not part of the dataset 29
  • 30.
    Privacy Promise: whatcan we expect? ● A tradeoff ○ With or without my data, any outcome of any analysis based on a given dataset is almost equally likely ○ The impact on me sharing information in the dataset will be limited to the general learnings not the specifics of my information 30
  • 31.
    Privacy Promise: differentialprivacy 31 Possible world including my data (D') Possible world excluding my data (D) Result R Given a result R, it should not be possible to guess whether my data was included or not in the dataset Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference 2006 Mar 4 (pp. 265-284). Springer, Berlin, Heidelberg.
  • 32.
    Differential Privacy implementation: TheLaplace Mechanism Where is the global sensitivity of the function It represents the maximum contribution of a single record in the dataset. The ratio of the distributions is bounded by 32
  • 33.
    Differential Privacy: advantages ●No assumptions about the attack scenario ● Formal theoretical definition ● Composition theorems to reason about complex computations 33
  • 34.
    Differential Privacy: drawbacks ●Reason about multiple queries, setting a privacy budget ○ If a user asks the same query multiple times she might be able to remove the noise and deduce the true answer ● Hard to devise a differentially private mechanism in complex cases ● Limited adoption 34
  • 35.
    Differential Privacy: DataRelease? ● Release of aggregates (histograms) ○ US Census Bureau case ○ Challenge: maintain consistency across datasets ● What if we want to release data at the same granularity? 35
  • 36.
    Data Synthesization: illustration ●US army anthropometric data 36
  • 37.
    Data Synthesization: illustration ●US army synthetic anthropometric data based on the learned joint distribution 37
  • 38.
    Differentially Private Synthesization ●Estimate the joint distribution in a differentially private way 38
  • 39.
    Data Synthesization: Applications ●Test data in development phases ● Limit access to real data and replace it with synthetic data when possible ● Share synthetic data for exploration purposes 39
  • 40.
    Final Notes ● Pseudonymizationis not anonymization, do not use it ● Legal constraints (e.g. GDPR) are a trigger for organizations to use better privacy-preserving mechanisms ● Privacy-preserving data release is an open research question 40
  • 41.
    References ● Sweeney, Latanya.Weaving Technology and Policy Together to Maintain Confidentiality. Journal of Law, Medicine and Ethics, Vol. 25 1997, p. 98-110 ● Sweeney L. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 2002 Oct;10(05):557-70. ● Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. InSecurity and Privacy, 2008. SP 2008. IEEE Symposium on 2008 May 18 (pp. 111-125). IEEE. ● Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference 2006 Mar 4 (pp. 265-284). Springer, Berlin, Heidelberg. ● Nissim K, Steinke T, Wood A, Altman M, Bembenek A, Bun M, Gaboardi M, O’Brien DR, Vadhan S. Differential Privacy: A Primer for a Non-technical Audience. Working Group Privacy Tools Sharing Res. Data, Harvard Univ., Boston, MA, USA, Tech. Rep. TR-2017-03. 2017 May 7. ● Dwork C, Roth A. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science. 2014 Aug 11;9(3–4):211-407. 41