Anatomy: Simple and Effective Privacy Preservation

Jitendra Kuldeep
Information Technology

Privacy preserving data publishing
Microdata
• Purposes:
– Allow researchers to effectively study the correlation
between various attributes
– Protect the privacy of every patient
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis

A naïve solution
• It does not work. See next.
publish
Jane 61 F 54000 flu
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis

Generalization
A generalized table
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
• Transform each QI value into a less specific form
How much generalization do we need?

l-diversity
• A QI-group with m tuples is l-diverse, iff each sensitive
value appears no more than m / l times in the QI-group.
• A table is l-diverse, iff all of its QI-groups are l-diverse.
• The above table is 2-diverse.
2 QI-groups
Quasi-identifier (QI) attributes Sensitive attribute
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis

What l-diversity guarantees
• From an l-diverse generalized table, an adversary
(without any prior knowledge) can infer the sensitive value
of each individual with confidence at most 1/l
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
Bob 23 M 11000
A 2-diverse generalized table

Defect of generalization
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
• Estimated answer: 2 * p, where p is the probability that each of the
two tuples satisfies the query conditions

Defect of generalization (cont.)
• p = Area( R1 ∩ Q) / Area( R1 ) = 0.05
• Estimated answer for query A: 2 * p = 0.1
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] pneumonia

Defect of generalization (cont.)
• Estimated answer from the generalized table: 0.1
Jane 61 F 54000 flu
• The exact answer should be: 1

Contributions
1. We propose an alternative technique for
generalization called Anatomy, which
allows much more accurate data
analysis while still preserving privacy.
2. We develop an algorithm for computing
anatomized tables that
• runs in linear I/Os
• (nearly) minimizes information loss

Outline
• Basic Idea of Anatomy
• Preserving Correlation
• Algorithm for Anatomy
• Experimental Results

Basic Idea of Anatomy
• For a given microdata table, Anatomy releases a quasi-
identifier table (QIT) and a sensitive table (ST)
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Quasi-identifier Table (QIT)
Sensitive Table (ST)
61 F 54000 flu
65 F 25000 flu
microdata

Basic Idea of Anatomy (cont.)
1. Select a partition of the tuples
61 F 54000 flu
65 F 25000 flu
QI group 1
QI group 2
a 2-diverse partition

2. Generate a quasi-idnetifier table (QIT) and a sensitive
table (ST) based on the selected partition
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
Age Sex Zipcode
23 M 11000
27 M 13000
35 M 59000
59 M 12000
61 F 54000
65 F 25000
65 F 25000
70 F 30000
group 1
group 2
quasi-identifier table (QIT) sensitive table (ST)

Group-ID Disease
1 pneumonia
1 dyspepsia
1 dyspepsia
1 pneumonia
2 flu
2 gastritis
2 flu
2 bronchitis
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT) sensitive table (ST)

1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)

Privacy Preservation
• From a pair of QIT and ST generated from an l-diverse
partition, the adversary can infer the sensitive value of
each individual with confidence at most 1/l
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Bob 23 M 11000

Accuracy of Data Analysis
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2

Accuracy of Data Analysis (cont.)
• 2 patients have contracted pneumonia
• 2 out of 4 patients satisfies the query condition on Age and
Zipcode
• Estimated answer for query A: 2 * 2 / 4 = 1, which is also the
actual result from the original microdata
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1
t2
t3
t4

Preserving Correlation
• Let us first examine the correlation between Age and
Disease in our running example
• Each tuple in the microdata can be mapped to a point in
the (Age, Disease) domain
• The above tuple can be mapped to (23, pneumonia).
.... … … …
t1

Preserving Correlation (cont.)
• We model this tuple using a probability density function
(pdf):

Preserving Correlation (cont.)

Anatomize
• An algorithm for computing anatomized
tables that
– runs in I/O cost linear to the cardinality n of
the microdata table
– minimizes the RCE when n is a multiple of l,
otherwise achieves an RCE that is higher
than the lower-bound by a factor of at most
1 + 1/n

Summary
• Anatomy outperforms generalization by allowing
much more accurate data analysis on the
published data.
• Anatomized tables (with nearly optimal quality
guarantee) can be computed in I/O cost linear to
the database cardinality.

Anatomy: Simple and Effective Privacy Preservation

Anatomy: Simple and Effective Privacy Preservation

Recommended

Recommended

More Related Content

Similar to Anatomy: Simple and Effective Privacy Preservation

Similar to Anatomy: Simple and Effective Privacy Preservation (20)

Recently uploaded

Recently uploaded (20)

Anatomy: Simple and Effective Privacy Preservation