연구실 내부 세미나 자료. 원시자료(raw data) 공개시 개인정보 보호 문제는 늘 화두입니다. 식별자를 익명화(deidentified)한다고 하여도 관측치 자체를 다른 기록과 연결시켜 재식별하는 시도(record linkage)가 여럿 성공하여 우려를 불러일으킨 바 있습니다. 익명화된 관측치에 적당한 잡음(noise)를 추가하거나 요약된(aggregated) 자료를 공개하는 방법도 대안이 될 수 있습니다. 본 슬라이드에서는 개인정보가 얼마나 보호되는지를 이론적으로 정량화하는 방법인 개인정보 차등보호 (differential privacy) 및 예제를 소개합니다.
3. 𝜖-differential privacy
• 𝑋1 and 𝑋2 : 𝑛 by 𝑝 data.frames
• Both can be arbitrary but differ in only one row
• 𝜅 : user-query on 𝑋 + noise
e.g. Let 𝑝 = 1 and 𝑋 be a 𝑛 by 1 binary data.
consider sum(𝑋) + random Laplace 𝑝 𝑥 ∝ exp(−|𝑥|/𝜆).
• Probability 𝑃(⋅) comes from the randomness 𝑝 ⋅ of the noise.
• Other queries : counting, histogram, first name..
• Interactive setting vs. non-interactive setting
[1] S´anchez, Domingo-Ferrer, and Mart’Inez. Improving the Utility of Differential Privacy via Univariate Microaggregation. LNCS 2014.
2016-08-17 3
4. Why consider differential privacy?
• Data Cannot be Fully Anonymized and Remain Useful.
• (de-identified) medical encounter data + (publicly available) voter
registration records) Massachussetts = re-identification
• (de-identified) movie records published by Netflix + (publicly avaiable)
the Internet Movie Database (IMDb) = re-identification
• Re-Identification of “Anonymized” Records is Not the Only
Risk.
• membership disclosure + harm (small number of compliant?
diagnose?)
• Queries Over Large Sets are Not Protective.
• e.g. if it is known that Mr. X is in a certain medical database,
"How many people in the database have disease A?" +
"How many people, not named X, in the database have disease A?”
yield the A-status of Mr. X.
Chapter 1.1 of Dwork and Roth (2014), The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science Vol. 9, Nos.
3–4, 211–407.
2016-08-17 4
5. Why consider differential privacy?
• Query Auditing Is Problematic.
• refusing itself can be disclosive
• query auditing can be computationally infeasible
• Summary Statistics are Not “Safe.”
• (GWAS, SNP) the National Institutes of Health and Wellcome Trust
terminated public access to aggregate frequency data from the
studies they fund.
• “Ordinary” Facts are Not “OK.”
• [frequent bread purchase for long time -> infrequent purchase] +
type II diabetes?
• “Just a Few.”
• “Just a few” philosophy can involve ethnic issues.
2016-08-17 5
Chapter 1.1 of Dwork and Roth (2014), The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science Vol. 9, Nos.
3–4, 211–407.
6. Why consider differential privacy?
• What differential privacy promises
• “the probability of harm is not significantly increased by their choice to
participate.”
• What differential privacy does not promise
• NOT : what one believes to be one’s secrets will remain secret.
“Differential privacy promises that the behavior of an algorithm will
be roughly unchanged even if a single entry in the database is
modified.”
• Summary
• de-identification is not sufficient
• need to contaminate(?) not only (demographic) keys but also all the
other attributes.
• need to apply this philosophy in the stage of responding to user-query.
2016-08-17 6
Chapter 2.3 of Dwork and Roth (2014), The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science Vol. 9, Nos.
3–4, 211–407.
7. Some remarks on differential privacy
• Another definition ((𝜖, 𝛿)-differential privacy)
• Algorithms can be combined
2016-08-17 7
Dwork and Roth (2014), The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science Vol. 9, Nos. 3–4, 211–407.
8. Some remarks on differential privacy
• Algorithms can be combined (continued)
Differential privacy, Wikipedia webpage
Dwork and Roth (2014), The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science Vol. 9, Nos. 3–4, 211–407.
2016-08-17 8
9. [1] Univariate microaggregation
• “Microaggregation” : an alternative to histogram queries
• Procedure
• Assume 𝑛 by 𝑝 table 𝑋 with numeric attributes only
• Input : 𝑋 and 𝑘 ∈ ℕ
• For 𝑗 = (from 1 to 𝑝)
1. Divide X[ ,j] into [𝑛/𝑘] clusters of 𝑘 consecutive values.
2. Calculate the average for each of the clusters.
3. Perturb the averages by independent Laplace noises
4. Replace the original values in X[ ,j]by the values in 3.
• Guarantee : The output is 𝑝𝜖-differentially private.
[1] S´anchez, Domingo-Ferrer, and Mart’Inez. Improving the Utility of Differential Privacy via Univariate Microaggregation. LNCS 2014.
2016-08-17 9
10. [2] Exponential random graphs
• 𝑋 : (observed) exponential random graph
• Non-interactive query
• Want to give a differentially-private random graph 𝑌 based on 𝑋
• Randomized response for edges
[2] Karwa, Slavkovi'c, and Krivitsky. Differentially Private Exponential Random Graphs. LNCS 2014.
2016-08-17 10
11. [2] Exponential random graphs
2016-08-17 11
[2] Karwa, Slavkovi'c, and Krivitsky. Differentially Private Exponential Random Graphs. LNCS 2014.
12. [2] Exponential random graphs
• Of interest : likelihood-based inference on 𝜃
• Given 𝑌 : if 𝜋 (𝛾 below) is known, MCMC can apply.
[2] Karwa, Slavkovi'c, and Krivitsky. Differentially Private Exponential Random Graphs. LNCS 2014.
2016-08-17 12
13. [3] 𝑘 𝑚
-Anonymity for Continuous Data
• How to guarantee 𝑘 𝑚
-anonymity with minimal information
loss?
[3] Gkountouna, Angeli, Zigomitros, Terrovitis, and Vassiliou. 𝑘 𝑚-Anonymity for Continuous Data Using Dynamic Hierarchies, LNCS 2014.
2016-08-17 13
14. [3] 𝑘 𝑚
-Anonymity for Continuous Data
• Illustration of the proposed algorithm
[3] Gkountouna, Angeli, Zigomitros, Terrovitis, and Vassiliou. 𝑘 𝑚-Anonymity for Continuous Data Using Dynamic Hierarchies, LNCS 2014.
2016-08-17 14
15. [3] 𝑘 𝑚
-Anonymity for Continuous Data
• 𝑘-anonymity vs. 𝑘 𝑚
-anonymity
[3] Gkountouna, Angeli, Zigomitros, Terrovitis, and Vassiliou. 𝑘 𝑚-Anonymity for Continuous Data Using Dynamic Hierarchies, LNCS 2014.
2016-08-17 15
16. [4] Logistic regression + elastic net
• Query : 𝜃 (penalized regression coeffecients) from (𝑋, 𝑌)
stored in a DB. Tentatively 𝑋 is a SNP dataset and 𝑌 = ±1 .
[4] Yu, Rybar, Uhler, and Fienberg. Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases. LNCS 2014.
2016-08-17 16
17. [4] Logistic regression + elastic net
• If we select a tuning parameter from cross-validation, will the
fitted model from the CV be also differentially private?
• Introduce complicated notations
[4] Yu, Rybar, Uhler, and Fienberg. Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases. LNCS 2014.
2016-08-17 17
18. [4] Logistic regression + elastic net
• If 𝑞 is (𝛽1, 𝛽2, 𝛿)-stable and 𝒯 is 𝜖-differentially private,
• [5] : an ordinary CV-procedure (with some ‘randomization’ during
validation with ‘randomness’ 𝜖’) is (𝜖 + 𝜖’)-differentially private.
• [5] assumed the regularization function is differentiable.
[4] extended the assumption to non-differentiable but convex
penalties.
• [4] applied this general framework to [logistic regression +
elastic net].
• Application to GWAS : usually two-step (screening -> logistic reg.)
• The first-stage screening in differentially private manner was
developed in the literature.
• This paper focused on the second-stage only.
[5] Chaudhuri and Vinterbo (2013). A stability-based validation procedure for differentially private machine learning. Advances in Neural Information Processing
Systems, 1-19.
[4] Yu, Rybar, Uhler, and Fienberg. Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases. LNCS 2014.
2016-08-17 18