This slide presents a novel technique, anatomy, for publishing sensitive data. Anatomy releases all the quasi-identifier and sensitivevalues directly in two separate tables. Combined with a grouping mechanism, this approach protects the privacy and captures a large amount of correlation in the microdata.
2. Privacy preserving data publishing
Microdata
• Purposes:
– Allow researchers to effectively study the correlation
between various attributes
– Protect the privacy of every patient
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
3. A naïve solution
• It does not work. See next.
publish
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
4. Generalization
A generalized table
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
• Transform each QI value into a less specific form
How much generalization do we need?
5. l-diversity
• A QI-group with m tuples is l-diverse, iff each sensitive
value appears no more than m / l times in the QI-group.
• A table is l-diverse, iff all of its QI-groups are l-diverse.
• The above table is 2-diverse.
2 QI-groups
Quasi-identifier (QI) attributes Sensitive attribute
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
6. What l-diversity guarantees
• From an l-diverse generalized table, an adversary
(without any prior knowledge) can infer the sensitive value
of each individual with confidence at most 1/l
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
A 2-diverse generalized table
7. Defect of generalization
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
• Estimated answer: 2 * p, where p is the probability that each of the
two tuples satisfies the query conditions
8. Defect of generalization (cont.)
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• p = Area( R1 ∩ Q) / Area( R1 ) = 0.05
• Estimated answer for query A: 2 * p = 0.1
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] pneumonia
9. Defect of generalization (cont.)
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• Estimated answer from the generalized table: 0.1
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
• The exact answer should be: 1
10. Contributions
1. We propose an alternative technique for
generalization called Anatomy, which
allows much more accurate data
analysis while still preserving privacy.
2. We develop an algorithm for computing
anatomized tables that
• runs in linear I/Os
• (nearly) minimizes information loss
11. Outline
• Basic Idea of Anatomy
• Preserving Correlation
• Algorithm for Anatomy
• Experimental Results
12. Basic Idea of Anatomy
• For a given microdata table, Anatomy releases a quasi-
identifier table (QIT) and a sensitive table (ST)
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Quasi-identifier Table (QIT)
Sensitive Table (ST)
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
microdata
13. Basic Idea of Anatomy (cont.)
1. Select a partition of the tuples
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
QI group 1
QI group 2
a 2-diverse partition
14. Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive
table (ST) based on the selected partition
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
Age Sex Zipcode
23 M 11000
27 M 13000
35 M 59000
59 M 12000
61 F 54000
65 F 25000
65 F 25000
70 F 30000
group 1
group 2
quasi-identifier table (QIT) sensitive table (ST)
15. Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive
table (ST) based on the selected partition
Group-ID Disease
1 pneumonia
1 dyspepsia
1 dyspepsia
1 pneumonia
2 flu
2 gastritis
2 flu
2 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT) sensitive table (ST)
16. Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive
table (ST) based on the selected partition
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
17. Privacy Preservation
• From a pair of QIT and ST generated from an l-diverse
partition, the adversary can infer the sensitive value of
each individual with confidence at most 1/l
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
Name Age Sex Zipcode
Bob 23 M 11000
18. Accuracy of Data Analysis
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
19. Accuracy of Data Analysis (cont.)
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• 2 patients have contracted pneumonia
• 2 out of 4 patients satisfies the query condition on Age and
Zipcode
• Estimated answer for query A: 2 * 2 / 4 = 1, which is also the
actual result from the original microdata
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1
t2
t3
t4
20. Preserving Correlation
• Let us first examine the correlation between Age and
Disease in our running example
• Each tuple in the microdata can be mapped to a point in
the (Age, Disease) domain
• The above tuple can be mapped to (23, pneumonia).
Age Sex Zipcode Disease
23 M 11000 pneumonia
.... … … …
t1
23. Anatomize
• An algorithm for computing anatomized
tables that
– runs in I/O cost linear to the cardinality n of
the microdata table
– minimizes the RCE when n is a multiple of l,
otherwise achieves an RCE that is higher
than the lower-bound by a factor of at most
1 + 1/n
25. Summary
• Anatomy outperforms generalization by allowing
much more accurate data analysis on the
published data.
• Anatomized tables (with nearly optimal quality
guarantee) can be computed in I/O cost linear to
the database cardinality.