Distinct l diversity anonymization of set valued data

Distinct l-diversity
Anonymization of Set Valued
Data
Submitted by,
Khude Rohan Ravindra
Abhishek Puligudla
Abhilash Namdev
Guidance by,
B. K. Tripathy

Contents
• Basic
• Abstract
• Introduction
• Literature survey
• Proposed algorithm
• Conclusion and Future work
• References

Set Valued data
There are two ways of representing data in the table
• Singular valued data
• Set Valued data
“Cancer”, “Blood Pressure”
“Blood Pressure”, “Heart disease”
“Hemorrhoids”, “Blood Pressure”
“Heart disease”, “Blood Pressure”,
“Diabetes”, “Hemorrhoids”, “Cancer”
Blood Pressure
Heart disease
Hemorrhoids
Hemorrhoids

Anonymization
• Data anonymization is type of information sanitization whose intent is
privacy protection or privacy preservation
• It is the process of either encrypting or removing personally
identifiable information from data sets
• So that the people whom the data describe remain anonymous.

Privacy Preservation
• Privacy here means the logical security of data, NOT the traditional
security of data e.g. access control, theft, hacking etc.
• Here, adversary uses legitimate methods
• Various databases are published e.g. Census data, Hospital records
• Allows researchers to effectively study the correlation between various
attributes

Need for Privacy
• Suppose a hospital has some person-specific patient data which it
wants to publish
• It wants to publish such that:
• Information remains practically useful
• Identity of an individual cannot be determined
• Adversary might infer the secret/sensitive data from the published
database

Need for Privacy
Non-Sensitive Data Sensitive Data
# Zip Age Nationality Condition
1 13053 28 Indian Heart Disease
2 13067 29 American Heart Disease
3 13053 35 Canadian Viral Infection
4 13067 36 Japanese Cancer
# Name Zip Age Nationality
1 John 13053 28 American
2 Bob 13067 29 American
3 Chris 13053 23 American
Published
Data
Voter List
Data leak!

Classification of Attributes
Key attributes
Name, address, phone number - uniquely identifying!
Always removed before release
Quasi-identifier –
Attribute values which can uniquely identify an individual
{ zip-code, nationality, age }
Sensitive-identifier -
information corresponding to Individuals.
{medical condition, salary, location}

Abstract
• The privacy preserving of the set-valued data is important to avoid the
tampering
• Anonymising implies that adversary not able to identify who’s individual
data it is
• The use of k-anonymity fails in situations like Homogeneity and
Background Knowledge attack
• L-diversity overcome the drawbacks of k-anonymity
• In this paper we are proposing the use of l-diversity which uses the
sensitivity for generalizing the data when anonymized
• And also a algorithm for hiding the sensitive attribute information which
reveals identity of one’s individual in situations like homogeneity and
background knowledge

Types of anonymization on set valued data
• K anonymity
• Top-Down, Local Generalization
• Recoding
• L-diversity(we are using distinct l-diversity)

K - anonymity
• The information for each person contained in the released table
cannot be distinguished from at least k-1 individuals whose
information also appears in the release
• anonymity - the condition of being anonymous
• Change data in such a way that for each tuple in the resulting table
there are atleast (k-1) other tuples with the same value for the quasi-
identifier

Techniques for k-anonymization
• Generalization
-Replace the original value by a semantically consistent but less
specific value
• Suppression
-Data not released at all
-Can be Cell-Level or (more commonly) Tuple-Level

Techniques for anonymization
# Zip Age Nationality Condition
1 130** < 40 * Heart Disease
2 130** < 40 * Heart Disease
3 130** < 40 * Viral Infection
4 130** < 40 * Cancer
Generalization Suppression (cell-level)

Generalization Hierarchies
ZIP Age Nationality
1305813053
1305
130

1306713063
1306
2928
< 30
< 40
*
3536
3*
USCanadian
American
JapaneseIndian
Asian
*
• Generalization Hierarchies: Data owner defines how values
can be generalized
• Table Generalization: A table generalization is created by
generalizing all values in a column to a
specific level of generalization

K-Anonymity Drawbacks
• K-anonymity alone does not provide full privacy!
• There are two types of attacks that affect K-Anonymity. They
are
• Homogeneity Attacks and
• Background Knowledge Attacks

Homogeneity attacks
Original Table 4-anonymous tables
Since Alice and Bob’s are both neighbors, Alice knows that Bob age is a 31-year-old male from America who’s 13053
is a zip code where he lives. Hence, Alice knows that record number of Bob’s is 9,10,11, or 12. She can also
understand from the data that Bob has disease cancer.
Umeko
Matches
here
Bob
Matches
here
Bob has Cancer!

Background Knowledge Attacks
Original Table 4-anonymous tables
Alice knows that Umeko is a 21 year-old female living in zip code 13068 from Japanese. Depending on this
information, Alice identified that record number 1,2,3, or 4 Umeko’s information is contained. With suppl-ementary
information can predict that Umeko being Japanese and Alice knows that Japanese have an extremely low
occurrences of heart diseases, Alice can concluded with proximate certainty that Umeko has a viral infection.
Umeko
Matches
here
Bob
Matches
here
Bob has Cancer!
Umeko has Viral Infection!

Distinct L-diversity
• An equivalence class is said to have l-diversity if there are at least “l
well represented” values for the sensitive attribute.
• A table is said to have l-diversity if every equivalence class of the table
has l-diversity.
• To obtain “l well represented” values, Each equivalence class has at
least l distinct values for the sensitive field. This is called Distinct L-
diversity.

Applying Algorithm on Sensitive data

CONCLUSION AND FUTURE WORK
• Our algorithm is efficient enough to hide the sensitive attribute
information.
• Which might be used to reveal the identity of one’s individual in
situations like homogeneity and background knowledge.
• So we have generalized the sensitive attribute after obtaining diverse
clustered data.
• The anonymization technique which we have proposed will just serve
to make privacy breaches more difficult.
• Still it is not clear how to de-anoymize.
• Also our algorithm can be further extended to anonymize datasets
which will have more than one sensitive attributes.

References
[1] H. Yeye, “Anonymization of SetValued Data via TopDown, Local Generalization
Anonymizing Set-Valued Social Data,” ACM, August 24-28,2009.
[2] B. K. Tripathy, A. Mitra, “An Algorithm to achieve k-anonymity and l-diversity
anonymization in Social Networks,” vol. 65, IEEE 2012.
[3] S. Wang, Y. Tsai, H. Kao, T. Hong, “Anonymizing Set-Valued Social Data,” vol. 2 Issue 1,
2010 IEEE/ACM.
[4]T. Manolis, M. Nikos, P. Kalnis, “Privacy preserving Anonymization of Set valued Data,”
Volume 1 Issue 1, ACM, August 2008.
[5] D. K. Arora, D. Bansal and S. Sofat, “Comparative Analysis of Anonymization
Techniques,” Int. J. of Electronics and Electrical Eng., vol. (7), pp. 773-778, 2014
[6] S. Vinogradov, A. Pastsyak, “Evaluation of Data Anonymization Tools,” IARIA, 2012.

Distinct l diversity anonymization of set valued data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Distinct l diversity anonymization of set valued data

Similar to Distinct l diversity anonymization of set valued data (20)

Recently uploaded

Recently uploaded (20)

Distinct l diversity anonymization of set valued data

Editor's Notes