SlideShare a Scribd company logo
1 of 50
Download to read offline
Implementation of
Laplacian Differential
Privacy with varying
epsilon
A thesis presented for the degree of
Masters of Technology in Communication
and Information Technology
by
Jaibran Mohammad
Enrollment Number: 2019MECECI001
under the supervision of
Associate Professor Farida khursheed
...
Department of Electronics
National Institute of Technology Srinagar
2021
Contents
Abstract 6
1 Introduction 7
2 Knowledge on Privacy Preserving 9
2.1 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Privacy Preserving Data publishing Techniques . . . . . . . . . . 10
2.2.1 k-anonymity . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 l-diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 t-closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Privacy Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 NYC Taxicab Data . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 The Netflix Prize . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Massachusetts Group Insurance Commission . . . . . . . 19
2.4 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Literature survey on Differential privacy 22
3.1 Randomized response . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Differential privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Properties of Differential Privacy . . . . . . . . . . . . . . . . . . 29
3.3.1 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Mechanism of Differential Privacy . . . . . . . . . . . . . . . . . 31
3.4.1 Laplacian Mechanism . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Counting Queries . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Challenges in differential privacy . . . . . . . . . . . . . . . . . . 34
3.5.1 Choosing Privacy Parameter ∈ . . . . . . . . . . . . . . . 34
4 Implementation Framework 35
4.1 Scenario Description . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Architecture: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 User Interface: . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 Privacy Preserving Mechanism . . . . . . . . . . . . . . . 37
4.2.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Database Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Programming Language and Editors . . . . . . . . . . . . . . . . 41
5 Result 42
5.1 Utility Of Data and Level of Privacy . . . . . . . . . . . . . . . . 42
6 Conclusion 46
7 Future Work 47
1
Certificate
Certified that Project Dissertation work entitled Implementation of Differ-
ential Privacy with varying epsilon is bonafide work carried out by Jaibran
Mohammad, Enrollment Number: 2019MECECI001, and class Roll Num-
ber: 001, in partial fulfillment for the award of Masters of Technology degree in
Communication and Information Technology from National Institute of Tech-
nolgy,Srinagar
Thesis Supervisor
Head of the Department
2
Acknowledgment
I would like to express my deep, sincere gratitude to my advisor Associate
Professor Farida Khursheed for her relentless patience, motivation, unending
support and most importantly for many insightful conversations we had during
the development of ideas in this thesis.
3
List of Figures
1. Illustration of Netflix Attack .......14
2. Model of Differential privacy ........21
3. Laplacian Distribution .............. 28
4. Experimental Framework .............. 32
5. Counting Query ...................... 36
6. Noisy Distribution .................. 37
7. Original Distribution ............... 38
4
List of Tables
1. Anonymity Table ............... 8
2. Anonymity Table2 ............. 9
3. UCI’s Dataset ................. 35,36
4. Varying ∈ ............... 32
5
Abstract
Significant advances in computing technology are bring-
ing many benefits to societies, with changes and financial
opportunities created in health care, transportation, ed-
ucation, law enforcement, security, commerce, and social
interactions.Today its a known fact that data is every-
where. The world has become information centric,and
it is not a surprise, especially in this time when data
storage is cheap and accessible. Various areas of human
endeavour use this data to conduct research, track the
behavior of users, recommend products or evaluate na-
tional security risks and many more.Many of these ben-
efits, however, involve the use of sensitive data, and thus
raise concerns about privacy. Methods that allow the
extraction of knowledge from data, while preserving pri-
vacy of users, are known as privacy-preserving data min-
ing (PPDM) techniques.Differential privacy is the one
the best privacy preserving data mining technique.In this
thesis we have implemented one of the mechanism of dif-
ferential privacy called the laplacian differential privacy
and verified the claims made by differential privacy that
smaller the value of epsilon, better is the privacy.
6
Chapter 1
1 Introduction
More and more information is collected by electronic de-
vices in electronic form and also this information is avail-
able on the Web,now with powerful data mining tools
being developed and put into use, there are more and
more concerns that data mining poses a threat to our
privacy and data security. However, it is very impor-
tant to note that many data mining applications do not
even touch personal data.Various examples include ap-
plications involving natural resources, the prediction of
floods, meteorology, astronomy, geography, geology, biol-
ogy, and other scientific and engineering data.Moreover,
most studies in data mining research focus on the de-
velopment of best reliable algorithms and do not in-
volve accessing personal data. The data mining applica-
tions that do involve personal data, in many cases, sim-
ple methods such as removing sensitive attributes from
data may protect the privacy of most individuals. Nev-
ertheless, privacy concern still exists wherever person-
ally information is collected and stored in digital form,
and data mining programs are able to access such data,
even during data preparation[1].To protect privacy viola-
tion, privacy preservation methods have been developed
to protect owner’s exposure, by modifying the original
data [2], [3]. However, transforming the data into other
form may also reduce its utility, resulting in inaccurate
7
extraction of knowledge through data mining. This is
the paradigm known as Privacy-Preserving Data Mining
(PPDM). Various PPDM models are designed to guaran-
tee some level of privacy, while maximising the utility of
the data, such that data mining can still be performed on
the transformed data efficiently.We will discuss various
privacy preserving data mining techniques with problems
associated with them in this thesis and finally we will
show how differential privacy is the best PPDM tech-
nique and we will implement one of the mechanisms of
Differential privacy called laplacian differential privacy
on an adult dataset. We will also see different frame-
work associated with Differential privacy.
8
Chapter 2
2 Knowledge on Privacy Preserving
2.1 Privacy
Although everyone has a some concept of privacy in their
mind, but there is no universally accepted standard def-
inition [4]. Privacy has been recognized as a right in the
Universal Declaration of Human Rights [5]. The diffi-
culty in defining privacy has several consequence of the
broadness of areas to which privacy applies. The scope
of privacy can be divided into four categories [6]: infor-
mation, which concerns the maintenance and collection
of personal data; bodily, which relates to physical harms
from inappropiate procedures; communications, which
refers to any form of communication; territorial, which
concerns the invasion of physical boundaries. In this the-
ses we will focus on the information category, which com-
prises of systems that collect, analyse and publish data.In
the information scope, Westin [7] defined privacy as “the
claim of individuals, groups, or institutions to determine
for themselves when, how, and to what extent informa-
tion about them is communicated to others”, or in other
words, the right to control the handling of one’s infor-
mation.Other authors define privacy as “the right of an
individual to be secure from unauthorised disclosure of
information about oneself that is contained in an elec-
tronic repository. Thus, one can conclude that the main
idea of information privacy is to have control over the
collection and handling of one’s personal data. An addi-
9
tional data privacy model developed by Dalenius in 1977
[8] articulated the privacy goal of databases: any infor-
mation about a particular individual that can be learned
from the database should be determined without the ac-
cess to the database. An important component of this
notion is making sure that the measurement of the ad-
versary’s before and after beliefs of a particular data is
small. This type of privacy, however, cannot be achieved.
Dwork demonstrated that such assurance is impossible
because of the Knowledge of background information ex-
ists. Thus, A new approach to privacy protection was
adopted by Dwork: The risk of one’s privacy, or in gen-
eral, any risk, such as the risk of being denied automobile
insurance, should not substantially increase due to par-
ticipation in a database [8].
Formerly, many works have been done to protect pri-
vacy; we will discuss the privacy models that use pri-
vacy preserving data publishing such as k-anonymity, l-
diversity, t-closeness, and about differential privacy on
the next sections.
2.2 Privacy Preserving Data publishing Techniques
PPDM at data publishing is also known as Privacy Pre-
serving Data Publishing (PPDP). It has been shown that
only removing attributes that explicitly identify users
(known as explicit identifiers) is not an effective method
[9]. Users can still be identified by pseudo or quasi-
identifiers (QIDs) and by sensitive attributes. A QID is a
non-sensitive attribute (or a set of attributes) that do not
explicitly identify a user, but can be combined with data
10
from other public sources to de-anonymize the owner of
a record,these types of attacks are known as linkage at-
tacks [9]. Sensitive attributes are person-specific private
attributes that must not be publicly disclosed, and that
may be also linked to identify individuals (e.g. diseases
in medical records).Various techniques of PPDP are ex-
plained below
2.2.1 k-anonymity
Sweeny and samarta introduced a notion of data pri-
vacy, known as k-anonymity [10]. Consider a dataset
with attributes that differ in meaning. Name and So-
cial Security number are some of the identifiers. These
would be completely removed from the dataset. Besides
sensitive pseudo-identifiers, there are also non-sensitive
pseudo-identifiers They can be used to identify an indi-
vidual. In the example given below, these would include
date of birth, ZIP code, and sex. Finally, we have the
sensitive attributes, such as the condition – these should
remain in the dataset, as they are. k-anonymous datasets
are those in which, for any row of pseudo-identifiers The
pseudo-identifiers of at least k-1 other rows contain the
same content. Examples are given in Table 1 and Table
2 below, where Table 1 is 4-anonymous and Table 2 is
6-anonymous.
11
Non-Sensitive Sensitive
zipcode Age Nationality Condition
1 130** <30 * Aids
2 130** <30 * HeartDisease
3 130** <30 * ViralInfection
4 130** <30 * ViralInfection
5 130** >=40 * Cancer
6 130** >=40 * HeartDisease
7 130** >=40 * ViralInfection
8 130** >=40 * ViralInfection
9 130** 3* * Cancer
10 130** 3* * Cancer
11 130** 3* * Cancer
12 130** 3* * Cancer
Table 1
Non-Sensitive Sensitive
zipcode Age Nationality Condition
1 130** <35 * Aids
2 130** <35 * Tuberculosis
3 130** <35 * Flu
4 130** <35 * Tuberculosis
5 130** >=35 * Cancer
6 130** >=35 * Cancer
7 130** >=40 * Cancer
8 130** >=40 * Cancer
9 130** 3* * Cancer
10 130** 3* * Tuberculosis
11 130** 3* * ViralInfection
12 130** 3* * ViralInfection
Table 2
12
Several vulnerabilities exist in this strategy, Imagine a
friend of mine who is 35 years old visited the hospital
corresponding to Table 1. Then we would be able to
conclude that he has cancer. Also, suppose someone I
know who is 28 years old visited both hospitals thul will
be present in both tables: we can infer that they have
AIDS. Such issues were highlighted by Kasiviswanathan,
Ganta and Smith [11].
2.2.2 l-diversity
The k-anonymity model is extended by requiring that
every equivalence class adhere to the l-diversity principle
. L-diverse equivalence classes consist of a set of entries
such that the sensitive attributes have at least l ‘well-
represented’ values. A table is l-diverse if all existing
equivalence classes are l-diverse.
‘’Well-represented” values do not have a concrete mean-
ing. Instead, there are different versions of the l-diversity
principle, differing on this particular definition. An ex-
ample of a simple instantiation considers that the ’Well-
represented sensitive attributes’ are those which have at
least l distinct values in an equivalence class, what is
known as distinct l-diversity. In these conditions, there
are at least l records in the l-diverse equivalence class
(since l distinct values are required), and satisfies k-
anonymity with k = l. A stronger principle of l-diversity
is the definition of entropy l-diverse, defined as follows.
Equivalence classes are entropy-l-diverse if their sensitive
attribute value distribution is at least
log(l). That is:
13
− X
s∈S
P(QID, s)log(P(QID, s)) >= log(l)
where s is a possible value for the sensitive attribute S,
and P(QID; s) is the fraction of records in a QID equiv-
alence group, that have the s value for the S attribute.
2.2.3 t-closeness
In order to apply this model, sensitive values in each
equivalence class to be as close as possible to the corre-
sponding distribution in the original table, Close is the
upper limit of the threshold t. That is, the measure of
distance between the distribution of a sensitive attribute
in the original table and the distribution of the same
attribute in any equivalence class is less or equal to t
(t-closeness principle). Formally, and using the notation
found in [49], this principle may be written as follows.
Let Q D (q1; q2; : : : ; qm) be the distribution of the
values for the sensitive attribute in the original table and
P D (p1; p2; : : : ; pm) be the distribution of the same
attribute in an equivalence class. This class satisfies t-
closeness if the following inequation is true:
Dist(P; Q) <= t
14
2.3 Privacy Failures
2.3.1 NYC Taxicab Data
In 2014, the NYC Taxi Limo Commission was quite
active on Twitter, sharing visualizations of taxi usage
statistics. This quickly caught the attention of several In-
ternet users, who inquired where the data was sourced.Taxi
Limo Commission responded that the data is available,
however one must file a Freedom of Information Law
(FOIL) request. Freedom of Information laws allow cit-
izens to request data from government agencies. In the
US, this is governed at the federal level.In Canada, there
are similar laws, including the Freedom of Information
and Protection of Privacy Act (FIPPA) and the Munic-
ipal Freedom of Information and Protection of Privacy
Act (MFIPPA). Chris Whong submitted a FOIL request
and released the dataset online [Who14].
He obtained a dataset of all taxi fares and trips in NYC
during 2013 - totaling 19 GB.Thus In New York City, we
would know every taxi driver’s location and income. This
is information we generally consider sensitive, and a taxi
driver might prefer to keep private... It is not surprising
that the Commission used a form of anonymization to
obscure information.
A typical row in the trips dataset looks like the following:
6B111958A39B24140C973B262EA9FEA5, D3B035A03C8A
34DA17488129DA581EE7, VTS, 5, ,2013-12-03 15:46:00,
2013-12-03 16:47:00,1, 3660,22.71, -73.813927,40.698135,-
74.093307,40.829346
15
These fields are:
medallion, hack license, vendor id, rate code, store and
fwd flag, pickup datetime, dropoff datetime, passenger
count, trip time in secs, trip distance, pickup longitude,
pickup latitude, dropoff longitude, dropoff latitude
Although most of these fields are self-explanatory, such
as the time and location fields, we are primarily inter-
ested in the first two. In particular, they indicate a
taxi driver’s medallion and license number. However, the
standard format for these fields is rather different than
what is provided – it appears that the dataset has been
somehow anonymized, to mask these values.Upon inspec-
tion, Jason Hall posted on Reddit [Hal14] that someone
with the medallion number CFCD208495D565E is using
the dataset.
F66E7DFF9F98764DA had a number of unusually prof-
itable days, earning farmore than a taxi driver could hope
to make. Vijay Pandurangan dug a bit deeper on this,
and made the following discovery[Pan14].He gave string
’0’ to md5 hash algorithm and got the identifier which is
given above.Pandurangan hypothesized that this identi-
fier corresponded to instances when the medallion num-
ber wasn’t available, but he took this as a hint: all the
medallion and license numbers were simply the plain-
text hashed via MD5.Because the license and medallion
numbers are only a few characters long, he was able to
compute the MD5 hashes of all possible combinations
and obtain the pre-hashing values for the entire dataset.
Using other publicly available data, he was further able
16
to match these with driver names, thus matching real-life
names of drivers with incomes and locations: a massive
privacy violation! A side-information attack goes beyond
revealing information about only the drivers. Imagine
saying goodbye to a co-worker as they leave for the day
- as you wave them off, you record location and pickup
time of the pickup. It is possible to reference this dataset
in the future and discover their home address (as well as
whether they’re a generous tipper or not).
2.3.2 The Netflix Prize
Netflix Prize competition is another case study involv-
ing data anonymization gone wrong. A big part of Net-
flix’s strategy is data analysis and statistical thinking:
Their hit TV shows are conceived based on user data,
and Its recommendations are tuned to maximize engage-
ment with users. Between 2006 and 2009, they held a
contest challenging researchers to improve their recom-
mendation engine. The prize of this challenge was a
highly-publicized US $1,000,000,it was won by a team
named BellKor’s Pragmatic Chaos, based on matrix fac-
torization techniques. Netflix provided a training dataset
of user data to help teams design their strategies Each
datapoint consisted of an (anonymized) user ID, movie
rating, ID, and date. Netflix gave assurance to users that
the data was de-anonymized to protect individual pri-
vacy. Indeed, the Video Privacy Protection Act of 1988
requires them to do this. Media consumption history
is generally considered sensitive or private information,
because one might consume media associated with cer-
17
tain minority groups (including of a political or sexual
nature).
Sadly, Narayanan and Shmatikov demonstrated that
this form of anonymization was Insufficient to maintain
user privacy[9]. Their approach is illustrated in Figure
1. A Netflix dataset was used, and cross-referenced it
with public information from the online movie database
IMDb, with over a hundred million movie reviews.
Particularly, they attempted to match users between
the two datasets by finding users who rated the movie
similarly at the same time.
A few weak matches turned out to be sufficient to
reidentify many users, As a result, these users’ movie
watching history was revealed, which they did not re-
veal publicly. A class action lawsuit was filed against
Netflix in response to this discovery, and this resulted in
cancellation of a sequel competition.
The example illustrates why de-anonymization is in-
sufficient to guarantee privacy, particularly when side-
information is present.
18
Figure 1
2.3.3 Massachusetts Group Insurance Commission
In the mid-1990s, the Massachusetts Group Insurance
Commission Initiated a program in which researchers
could obtain hospital visit records for every state em-
ployee, at no cost. Due to the sensitive nature of this
information, the dataset was anonymized. During his
time as governor of Massachusetts (and 2020 Republican
presidential candidate), William Weld made a promised
that patient privacy was protected. Specifically, while
the original dataset included information such as an in-
19
dividual’s SSN, Name, sex, date of birth, ZIP code, and
condition, this was anonymized by removing identifying
features such as the individual’s name and SSN. As a
computer science graduate student, Latanya Sweeney,
bought the voter rolls from the city of Cambridge, which
were then available for $20. These records contained
every registered voter’s address, name, ZIP code, sex,
and date of birth. It has been proven that 87Thus, by
mapping these two datasets, Large-scale reidentification
of individuals in hospital visitation datasets is straight-
forward. Sweeney made a point about this, by sending
Governor Weld his own medical records.
2.4 Differential Privacy
Differential privacy is a powerful standard for data pri-
vacy proposed by Dwork [10].It is based on the concept
that the outcome of a statistical analysis is essentially In-
dependent of whether or not anyone joins the database
Either way, one learns approximately the same thing [12].
It ensures, the adversary’s ability to cause harm or ben-
efit to a group of participants is essentially the same
regardless of whether the adversary is in the dataset or
not. Differential privacy achieves this by adding noise
to the query results, So that any differences in output
due to the presence or absence of a single person will be
covered up.[8]
In theory, differential privacy offers a robust privacy
measurement despite the adversaries worst possible back-
ground knowledge [8]. As a result,all linkage attacks and
statistical attacks are neutralized Due to its strong pri-
20
vacy protection from the worst case background knowl-
edge attacks of the adversaries, differentiating privacy
has been called an effective privacy preservation tech-
nique.
Therefore, throughout this thesis, we will try to de-
scribe its properties and analyze a selected case study
on it .
21
Chapter 3
3 Literature survey on Differential privacy
In this section, we will introduce differential privacy. We
start with the first differentially private algorithm algo-
rithm, by Warner from 1965 [13].
3.1 Randomized response
We work in a very simple environment. Assume you are
an educator of a large classroom having an important
exam. You suspect that many students in the class have
cheated, However, you are unsure. What is the best way
to figure out how many students cheated? Naturally,
cheating would not be likely to be admitted honestly by
students.
Being more precise: there are n people, and individual
i has a sensitive bit Xi ∈ {0, 1}. Their goal is to prevent
anyone else from learning about Xi . The analyst receives
messages Yi from each person , which may depend on Xi
and some random numbers generated by an individual.
Based on these Yi’s, the analyst would like to get an
estimate of
p =
1
n
n
X
i=11
Xi
We can first start with the most obvious approach: Yi
equal to the sensitive bit Xi is send by the individual
Yi =
(
Xi with probability 1
1 − Xi with probability 0
(1)
22
It is clear that the analyst can simply obtain p =
P∞
i=1 Yi
. In other words, the result is perfectly accurate. How-
ever, the analyst sees Yi , which is equal to Xi , and thus
he learns the individual’s private bit exactly: there is no
privacy.
Consider an alternate strategy, as follows:
Yi =
(
Xi with probability 12
1 − Xi with probability12
(2)
In this case, Yi is perfectly private: in fact, it is a uni-
form bit which does not depend on Xi at all, so the
curator could not infer anything about Xi. But at the
same time, every bit of accuracy is lost in this approach:
Z = 1
n
Pn
i=1 Yi is distributed as 1/n Binomial(n, 1/2),
which is completely independent of the statistic Z.
At this point, we have two approaches: one which is
perfectly private but not accurate, and one which is per-
fectly accurate but not accurate. The right approach
will be to choose a middle between these two extremes
cases. Consider the following strategy, which we will call
Randomized Response, parameterized by some γ ∈ [0,
1/2]:
Yi =
(
Xi with probability 12 + γ
1 − Xi with probability12 - γ
(3)
How private is this message Yi , with respect to the true
message Xi? Note that γ = 1/2 corresponds to the first
strategy, and γ = 0 is the second strategy. What if
we choose a γ in the middle, such as γ = 1/4? Then
23
there will be a certain level of “plausible deniability”
associated with the individual’s disclosure of private bit:
while Yi = Xi with probability 3/4, it could be that
their true bit was 1 − Y1, and this event happened with
probability 1/4. Informally speaking, how “deniable”
their response is corresponds to the level of privacy they
are afforded. In this way, they get a stronger privacy
guarantee as γ approaches 0. Observe that
E[Yi] = 2γXi + 1/2 − γ (4)
and thus
E

1
2γ
(Yi − 1/2 + γ)

= Xi (5)
This leads to the following natural estimator:
p̃ =
1
n
X
1=1
n

1
2γ
(Yi − 1/2γ

(6)
It has been proven that the error |p − p̃| is as follows:
|p − p̃| ≤ 

1
γ
√
n

(7)
As n → ∞ , this error goes to 0. We can also say that:
if we want to have additive error , we require n = 
( 1
α2 )
samples. Note that as γ gets closer 0 (corresponding to
stronger privacy), the error increase. This is natural: the
stronger the privacy guarantee. In order to move further
in quantifying the level of privacy, we must (finally) in-
troduce differential privacy.
Differential privacy is a formalization of this previously
mentioned notion of “plausible deniability.”
24
3.2 Differential privacy
“Differential privacy” refers to a promise, made by a data
holder, or curator, to a data owner “There will be no
adverse effects on you, if you allow your data to be used
in any study or analysis,
no matter what other studies, data sets, or information
sources, are available.” Differentially private database
mechanisms are at their best accurate data analysis, from
the confidential data widely made available without hav-
ing to resort to data clean rooms, data protection plans,
data usage agreements or restricted views.
Differentiated privacy deals with the paradox of learn-
ing useful information about a population while learn-
ing nothing about an individual. We may learn from
a medical database that smoking causes cancer , which
might affect insurance company’s view of an individ-
ual smoker’s long-term medical costs. Has the analysis
harmed the smoker? Perhaps — he will see the rise in his
insurance premium, if the insurer knows he smokes. He
may also be helped — learning of his health risks, he can
enter a smoking cessation program. Has the smoker’s pri-
vacy been violated? It’s true that we know more about
him now than we did before, but was his information
“leaked”? Differential privacy will argue that it wasn’t,
with the rationale that Independent of whether or not
the smoker participated in the study, the effects are the
same. It is the conclusions of the study that impact the
smoker, not his presence or absence in the data set.
It is ensured by the Differential privacy that the same
conclusions, for example, smoking causes cancer, will be
25
output, independent of whether any individuals take part
in the medical study or not. Specifically, it ensures that
any sequence of outputs (responses to queries) is “essen-
tially” equally likely to occur, independent of the pres-
ence or absence of any Individual in the data set. Here,
the probabilities are taken over random choices made
by the privacy mechanism (something controlled by the
data curator), and the term “essentially” is captured by
a parameter, ∈. A smaller ∈ will give better privacy (and
less accurate responses).
Differential privacy is a definition, not an algorithm. There
can be many differentially private algorithms for achiev-
ing a computational task T given the value of ∈ in an
∈-differentially private manner. Some will have better
accuracy than others. When is small, finding a highly
accurate-differentially private algorithm for T can be dif-
ficult, much as finding a numerically stable algorithm for
a specific computational task can require effort.
We now define the environment for differential privacy,
sometimes called central differential privacy model. We
imagine there are n individuals, X1 through Xn, who
each have their own datapoint in the dataset. They send
this data-point to a “trusted curator” – all individuals
trust this curator with their raw datapoint, but no one
else. Given their data, the curator runs an algorithm M,
and publicly outputs the result of this computation. Dif-
ferential privacy is a property of this algorithm M, which
says that no individual’s data has a large impact on the
output of the algorithm. The idea is given in the Figure
2 given below:
26
Figure 2
More formally, suppose we have an algorithmM : Xn
→ Y .
Consider any two datasets X, X 0
∈ Xn
which differ in
exactly one entry. We call these neighbouring datasets,
and sometimes denote this byX ∼ X 0
. We say that
M is ∈ −(pure)differentiallyprivate(∈ −(pure)DP) if,
for all neighbouring X, X 0
, and all T ⊆ Y , we have
Pr[M(X) ∈ T] ≤ e∈
Pr[M(x 0
) ∈ T] (8)
where the randomness is over the choices made by M
This definition was given by Dwork, McSherry, Nissim,
and Smith in their seminal paper in 2006 [14]. It is now
widely accepted as a strong and rigorous notion of data
privacy. Various technical points about Differential pri-
vacy are given below:
• Differential privacy has a quantitative in nature. A
small ∈ means strong privacy, and it will degrade as
increases.
27
• ∈ should be thought of as a small constant. Any-
thing between 0.1 and 3 might be a reasonable level
privacy guarantee, and one should be skeptical of
claims which are significantly outside this range.
• This is a worst-case guarantee, over all neighbouring
datasets X and X 0
. Even if we expect our data to
be randomly generated, we still require privacy for
all possible datasets no matter what.
• By changing a single point in the dataset,the defini-
tion bounds the multiplicative increase in the prob-
ability of M’s output satisfying any event.
• The use of a multiplicative e∈
in the probability
might seem unnatural. For small ∈, a Taylor ex-
pansion allows us to treat this as ≈ (1+ ∈). The
given definition is convenient because of the fact that
e∈1
e∈2
= e∈1+∈2
, which is useful in terms of Mathe-
matical convenience.
• While the definition may not look symmetric, it is
not: one can simply swap the role of X and X ’.
• Any non-trivial differentially private algorithm must
be randomized.
• We will generally use the term of “neighbouring datasets”
where one point in X is changed to obtain X 0
.Some-
times this is called “bounded” differential privacy, in
contrast to “unbounded” differential privacy, where
a data point is either added or removed. The for-
mer definition is usually more convenient mathemat-
ically.
28
• It prevents many of the types of attacks we have seen
before. The linkage attacks that we have observed
are essentially ruled out – if such an attack were ef-
fective with your data in the dataset, it would be al-
most as effective without.Reconstruction attacks are
also prevented.
• The definition of differential privacy is information
theoretic in nature. That is,if an adversary has un-
limited computational power and background infor-
mation,he will still unable do any harm. This is in
contrast to cryptography, where the focus is on com-
putationally bounded adversaries.
3.3 Properties of Differential Privacy
3.3.1 Sensitivity
Sensitivity quantifies how much noise is required in
the differential privacy mechanism. Currently two
sensitivities are used most often, which are Global
and local Sensitivity.
Global Sensitivity It shows how much the maximal dif-
ferences is between the query results of the neighbor-
ing databases that are going to be used in one of the
differentially private mechanisms. The formal defi-
nition:
Definition 1: Let f : Xn
→ Rk
.The l1 − sensitivity
of f is:
29
4(f)
= maxX,X 0 k f(X) − f(X 0
) k1
where X and X 0
are neighbouring databases.
In the process of releasing data while using queries
such as count or sum that has low sensitivity work
well with global sensitivity. We could take the case
of count query as an example which has 4f
= 1 that
is smaller than the true answer. However, when it
comes to queries like median, average the global sen-
sitivity is much higher.
Local Sensitivity The extent of noise included by the
Laplace mechanism rely upon GS(F) and the privacy
parameter ∈, but not on the database D. The pro-
cess of adding noise to the most of the functions ap-
plied to yield a much higher noise not resonating the
function’s general insensitivity to individual’s input.
Thus, Nissim proposed a local sensitivity that sat-
isfies differential privacy by adjusting the difference
between query results on the neighboring databases.
The formal definition:
Definition 2: (Local Sensitivity) [15]
For f : Dn
→ Rk 0
and 0
x ∈ Dn
the local sensitivity of f:
LS(f) = maxD2
k f(D1) − f(D2) k1
In here, we have to observe that the global sensitiv-
ity from the definition 1 is
GS(f) = maxD1
LS(f)(D1)
30
which creates less noise for queries with higher sen-
sitivity . For queries such as count or range, the
local sensitivity is identical to the global sensitiv-
ity. From the above definition, one could observe
that on many inputs, every differentially private al-
gorithm must add a noise at least as large as the local
sensitivity. However, finding algorithms whose error
matches the local sensitivity is not straightforward:
an algorithm that releases f with noise magnitude
proportional to LS(f) on input D1 is not, in general,
differentially private [15], since the noise magnitude
itself can leak information.
3.4 Mechanism of Differential Privacy
3.4.1 Laplacian Mechanism
Definition 1: The probability density function of
the Laplace distribution with location and scale pa-
rameters of 0 and b, respectively, is as follows:
pr(X) = 1
2bexp

− |x|
b

The Variance of this distribution is 2b2
.The graph of
the laplacian distribution is shown below in the fig-
ure.It is somtimes called double exponential distribu-
tion because it can be seen symmetric to exponential
distribution which is only supported on x ∈ [0, ∞)
and has a density proportional to exp(−cx), while
as laplacian distribution is supported on x ∈ R and
31
has a density proportional to exp(−c|x|.
Figure 3
Definition 2: Let f : xn
→ rk
. The laplacian mech-
anism is defined as:
M(X) = f(X) + (Y1, ........, Yk)
32
where Yi are independent laplace

4
∈

variables.
Theorm 1: The Laplacian mechanism is ∈ −differentially
private
3.4.2 Counting Queries
Counting queries is one of the application of lapla-
cian mechanism of differential privacy. In counting
queries.One can ask the question “How many rows
in the dataset have property P?” If we just ask one
question like this, the analysis can be done as follows.
Each individual will have a bit Xi ∈ {0, 1} indicat-
ing whether or not this is true about the row in the
database, and the function f we consider is their sum.
The sensitivity is 1, and thus an ∈ −differential pri-
vatization of this data would be f(X) + Laplace(1
∈).
This introduces error to this query on the order ofO(1
∈),
independent of the size of the database.
If we wanted to ask many queries, the laplacian mech-
anism of differential privacy would go like follows.Suppose
we had k counting queries f = (f1, ..., fk).We would
simply output the vector f(X)+Y , where the Y 0
i s
are identically distributive and independent lapla-
cian random variables. The l1 sensitivity of this sce-
nario will be ’k’.With this sensitivity bound 4 = k
in hand, we can add Yi ≈ Laplace(k
∈) noise to each
coordinate, answering each counting query with er-
ror of magnitude O(k/ ∈).
33
3.5 Challenges in differential privacy
3.5.1 Choosing Privacy Parameter ∈
Since the introduction of the privacy parameter, there
has been a question about how to set the Privacy
parameter ∈. However, choosing the right value of
∈ has not been addressed adequately. In the usual
sense, the parameter in ∈-Differential privacy does
not indicate what is revealed about a person rather
it limits the influence an individual has on the out-
come.
For queries,that try to retrieve more general prop-
erties of data the influence of ∈ on an individual
is less clear. For queries that ask for specific infor-
mation, e.g. ”Is Mr. Y in the database?” directly
relates to the disclosure of the information. Lee and
Clifton showed [16] for a given setting of ∈ An ad-
versary’s ability to monitor a particular individual
in a database would vary depending on the values in
the data, queries and even on values that are not in
the data.A privacy breach is caused by an improper
value of ∈, even for the same value of ∈ the degree
of protection provided by the ∈-differential privacy
mechanism varies based on the values of the domain
attributes and the type of queries.
34
Chapter 4
4 Implementation Framework
In this section we are going to present the frame-
work used for the experimenting with the laplacian
differential privacy mechanism, what tools we used
like programming language, libraries , database, Ed-
itors, data etc
4.1 Scenario Description
Currently, differential privacy can be implemented
in several ways with various kinds of settings. Thus,
some assumptions must be made. This thesis aims to
follow a basic model or architecture where a secured
server is connected to a data store that provides dif-
ferential privacy mechanisms(which are going to be
shown in the next section.) while being efficient.
Actual scenario is when a data owner places datasets
in a secured system for the purpose of some data an-
alyst to use the information for a particular purpose
while providing data privacy by using differential pri-
vacy methods.
From various mechanisms of differential privacy, we
have implemented one of the primary methods of dif-
ferential privacy that is essential in protecting sensi-
tive data. The method is Laplace mechanism where
the algorithms used is described in the next section.
35
4.2 Architecture:
The architecture in the figure below depicts how we
are going to implement one of the existing mecha-
nisms of differential privacy called laplacian mech-
anism.The elements of the architecture will be de-
scribed in the coming sections.
Figure 4
36
4.2.1 User Interface:
The user interface will be used by the data analyst to
request an execution of the query from the database
and the result which will be given to the data analyst
is the noisy version of the query result generated by
one of the mechanisms of differential privacy. Thus
this user interface is the way for the data analyst to
do his job.
4.2.2 Privacy Preserving Mechanism
The privacy preserving mechanism we used in this
experiment is laplacian differential privacy mecha-
nism.We have already discussed the laplacian differ-
ential privacy mechanism in the previous sections.
We saw in case of the counting queries,if the number
of queries is one then the sensitivity taken for the al-
gorithm is ’1’ i.e 4 = 1, we will take different values
of epsilon and repeat the experiment 50 times and
calculate different values of the result which will be
depicted in the result section.
4.2.3 Database
The Database used in the experiment is the post-
gresql which is the open source and free relational
database enabling us to store and manage the data(adult
dataset) used in the experiment.
37
4.3 Dataset
We have used one of the famous datasets called UCI’s
adult dataset which was collected from US census
in 1994 and donated in 1996.It contains more than
350000 rows(customer records) with 15 columns as
follows:
38
Attribute Type Values
Age Numerical
Workclass Nominal Private, Self-emp-not-inc,
Self-emp-inc,
Federal-gov, Local-gov,
State-gov,
Without-pay, Never-
worked.
Fnlgwt Numerical
Education Nominal Bachelors,
Some-college, 11th, HS-
grad, Prof-school,
Assoc-acdm, Assoc-voc,
9th,
7th-8th, 12th, Masters, 1st-
4th, 10th, Doctorate, 5th-
6th, Preschool
Education-
num
Numerical
Marital-
Status
Nominal Married-civ-spouse, Di-
vorced, Never-married,
Separated,,
Widowed, Married-spouse-
absent, Married-AF-spouse,
Occupation Nominal Tech-support,
Craft-repair, Other-service,
Sales,
Exec-managerial, Prof-
specialty,
Handlers-cleaners,
Machine-op-inspct,
Adm-clerical, Farming-
fishing,
Transport-moving, Priv-
house-serv
39
Attribute Type Values
Relationship Nominal
Wife,
Own-child, Husband, Not-
in-family, Other-relative,
Unmarried
Race Nominal White, Asian-Pac-Islander,
Amer-Indian-Eskimo,
Other, Black
Sex Nominal Female, Male
Capital-
gain
Numerical
Capital-
loss
Numerical
Hours-per-
week
Numerical
Native-
country
Nominal United-States, Cambo-
dia, England, Puerto-
Rico, Canada, Germany,
Outlying-US(Guam-USVI-
etc), India, Japan, Greece,
South, China, Cuba, Iran,
Honduras, Philippines,
Italy, Poland, Jamaica,
Vietnam, Mexico, Por-
tugal, Ireland, France,
Dominican-Republic,
Laos, Ecuador, Taiwan,
Haiti, Columbia, Hungary,
Guatemala, Nicaragua,
Scotland, Thailand, Yu-
goslavia, El-Salvador,
Trinadad Tobago, Peru,
Hong, Holand-Netherlands
40
Attribute Type Values
Income Nominal =50K, 50k
Table 3
4.4 Database Query
The query used in the experiment is counting query
which will give the number of years a person has
spent in education by income and race.The query is
shown in the figure below:
Figure 5
4.5 Programming Language and Editors
The programming language that is used in this ex-
periment is python associated with various libraries
of python like Numpy,pandas,matplotlib,psycopg2 etc.
The editor used in this experiment is Sublime editor.
41
Chapter 5
5 Result
5.1 Utility Of Data and Level of Privacy
The query distribution in the original form and in the
form when laplacian differential privacy mechanism
is applied is given below in two figure’s. The ep-
silon value was taken as ’1’.As seen from the figures
the distributions are almost same, thus it verifies the
claim of differential privacy that result of the query
would be same no matter whether the individual is
present is the database or not.
Figure 6
42
Figure 7
The utility of data and level of privacy depends sig-
nificantly on ∈, earlier when we had taken ∈= 1, we
have seen a little change in the distribution which
isn’t easily visible by looking at the graph, now as
we have seen in previous section that smaller the
value of ∈ greater will be the privacy, we will see
this in action for the counting query we used in our
experiment and the experiment will be repeated for
varying value of ∈ specifically epsilon = 0.001, ep-
silon = 0.01, epsilon = 0.1 epsilon = 0.5, epsilon =1,
epsilon = 2. The experiment will be repeated for the
sake of accuracy 50 times for each value of epsilon.
The below table will show the result collected:
43
∈= 0.001 ∈= 0.01 ∈= 0.1 ∈= 0.5 ∈= 1 ∈= 2 True
value
27.81249 35.39051 37.32668 38.00431 37.94904 38.14261 38
114.259 129.8866 128.2117 128.9040 128.98781 128.9992 129
376.4223 264.8106 265.7166 266.9909 267.0048 266.9134 267
508.4103 521.4273 514.2764 515.4326 515.1102 515.0179 515
318.2986 367.7279 380.4541 381.2337 381.0258 380.9488 381
633.4579 710.7502 708.2796 707.9570 708.0018 707.9361 708
905.9031 912.2191 927.7372 926.6723 927.0431 927.0012 927
229.2557 310.3037 307.2509 308.2056 308.1023 308.0594 308
7361.250 7363.948 7365.089 7361.968 7362.020 7361.984 7362
4766.829 4986.021 4951.782 4951.468 4951.640 4951.942 4952
999.5070 889.2168 874.2032 873.7415 874.0083 874.0150 874
668.4595 680.4382 678.6779 679.9279 679.9711 680.0249 680
2745.33 2675.99 2667.15 2666.96 2666.99 2666.95 2667
643.886 672.841 665.517 666.025 666.140 666.002 666
93.3477 141.3640 131.3123 131.9790 132.0313 132.035 132
114.49 89.435 92.744 93.117 93.166 92.967 93
Table 4
In total we have 16 values of education number in the
counting query we have seen earlier and for those 16
values of education number true value in the table is
the number of people having race ’white’ who have
spent ’x’ years in education,’x’ represents the value
of education number without applying laplacian dif-
ferential privacy.In the table we have other columns
with different values of epsilon, those columns are
representing the change in the true value by apply-
ing differential privacy. Observing the above table
44
as value of ∈ becomes smaller the noise added by
laplacian mechanism becomes larger, thus verifying
the claims of theorists and experimentalists of differ-
ential privacy.
45
Chapter 6
6 Conclusion
Differential privacy is the most used privacy pre-
serving mechanism today in the world.I have imple-
mented one of the mechanisms of Differential privacy
called laplacian differential privacy in python and us-
ing a counting query I have verified that the parame-
ter ∈ is the most important factor in determining the
utilization of data and level of privacy preserved.We
have shown how by taking different values of ∈ the
result got effected. We have verified the claim[15][16]
that smaller value of epsilon will result in better pri-
vacy.
46
Chapter 7
7 Future Work
Although some work has been done about what will
be the value of ∈ but there is no universally accepted
protocol to determine the value of ∈ for particular
type of situations.Thus more work needs to be done
in this area.
References
[1] Data Mining Concepts and Techniques Third
Edition Jiawei Han University of Illinois at Ur-
bana–Champaign Micheline Kamber Jian Pei Si-
mon Fraser University
[2] C. C. Aggarwal and P. S. Yu, “A general survey
of privacy-preserving data mining models and
algorithms,” in Privacy-Preserving Data Min-
ing. New York, NY, USA: Springer, 2008, pp.
1152.
[3] C. C. Aggarwal, Data Mining: The Textbook.
New York, NY, USA: Springer, 2015.
[4] M. Langheinrich, “Privacy in ubiquitous com-
puting,” in Ubiquitous Computing Fundamen-
tals. Boca Raton, FL, USA: CRC Press, 2009,
ch. 3, pp. 95159.
[5] Universal Declaration of Human Rights, United
Nation General Assembly, New York, NY,
47
USA, 1948, pp. 1 6. [Online]. Available:
http://www.un.org/en/documents/ udhr/
[6] D. Banisar et al., “Privacy and human rights:
An international survey of privacy laws and
practice,” Global Internet Liberty Campaign,
London, U.K., Tech. Rep., 1999.
[7] A. F. Westin, “Privacy and freedom,” Washing-
ton Lee Law Rev., vol. 25, no. 1, p. 166, 1968.
[8] A. Blum and Y. Monsour. Learning, regret min-
imization, and equilibria, 2007.
[9] A. Narayanan and V. Shmatikov. (2006). “How
to break anonymity of the Netix prize dataset.”
[Online]. Available: https://arxiv.org/abs/
cs/0610105
[10] Pierangela Samarati and Latanya Sweeney. Gen-
eralizing data to provide anonymity when dis-
closing information. In Proceedings of the 20th
ACM SIGMOD-SIGACTSIGART Symposium
on Principles of Database Systems, PODS ’98,
page 188, New York, NY, USA, 1998. ACM.
[11] Srivatsava Ranjit Ganta, Shiva Prasad Ka-
siviswanathan, and Adam Smith. Composition
attacks and auxiliary information in data pri-
vacy. In Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Discov-
ery and Data Mining, KDD ’08, pages 265–273,
New York, NY, USA, 2008. ACM
[12] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum,
and Sergey Yekhanin. Pan-private streaming al-
48
gorithms. In Proceedings of International Con-
ference on Super Computing. 2010.
[13] Stanley L. Warner. Randomized response: A
survey technique for eliminating evasive answer
bias. Journal of the American Statistical Asso-
ciation, 60(309):63–69, 1965.
[14] Cynthia Dwork, Frank McSherry, Kobbi Nissim,
and Adam Smith. Calibrating noise to sensi-
tivity in private data analysis. In Proceedings
of the 3rd Conference on Theory of Cryptogra-
phy, TCC ’06, pages 265–284, Berlin, Heidel-
berg, 2006. Springer.
[15] Kobbi Nissim, Sofya Raskhodnikova, and Adam
Smith. “Smooth Sensitivity and Sampling in
Private Data Analysis”. In: Proceedings of the
Thirty-ninth Annual ACM Symposium on The-
ory of Computing. STOC ’07. San Diego, Cali-
fornia, USA: ACM, 2007, pp. 75–84. ISBN: 978-
1-59593-631-8.
[16] Jaewoo Lee and Chris Clifton. “How Much Is
Enough? Choosing for Differential Privacy”. In:
Information Security, 14th International Confer-
ence, ISC 2011, Xi’an, China, October 26-29,
2011. Proceedings. 2011, pp. 325–340.
49

More Related Content

Similar to Implementation_of_laplacian_differential_privacy_with_varying_epsilonv3.pdf

A Review on Privacy Preservation in Data Mining
A Review on Privacy Preservation in Data MiningA Review on Privacy Preservation in Data Mining
A Review on Privacy Preservation in Data Miningijujournal
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65Med labbi
 
A survey on privacy preserving data publishing
A survey on privacy preserving data publishingA survey on privacy preserving data publishing
A survey on privacy preserving data publishingijcisjournal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Significant features for steganography techniques using deoxyribonucleic acid...
Significant features for steganography techniques using deoxyribonucleic acid...Significant features for steganography techniques using deoxyribonucleic acid...
Significant features for steganography techniques using deoxyribonucleic acid...nooriasukmaningtyas
 
A Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face Detection
A Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face DetectionA Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face Detection
A Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face DetectionIRJET Journal
 
iGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportiGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportNandu B Rajan
 
WRIGHT_JEREMY_1000738685-1
WRIGHT_JEREMY_1000738685-1WRIGHT_JEREMY_1000738685-1
WRIGHT_JEREMY_1000738685-1Jeremy Wright
 
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETSSELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETSСветла Иванова
 
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining ApproachPrivacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining ApproachIRJET Journal
 
A de minimis rule for personal data breach notifications in the GDPR
A de minimis rule for personal data breach notifications in the GDPRA de minimis rule for personal data breach notifications in the GDPR
A de minimis rule for personal data breach notifications in the GDPRLiberty Global
 
Digital Forensics Assignment One UEL and Unicaf
Digital Forensics Assignment One UEL and UnicafDigital Forensics Assignment One UEL and Unicaf
Digital Forensics Assignment One UEL and UnicafDamaineFranklinMScBE
 
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACYTHE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACYIRJET Journal
 
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMPRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMIJNSA Journal
 
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS cscpconf
 

Similar to Implementation_of_laplacian_differential_privacy_with_varying_epsilonv3.pdf (20)

A Review on Privacy Preservation in Data Mining
A Review on Privacy Preservation in Data MiningA Review on Privacy Preservation in Data Mining
A Review on Privacy Preservation in Data Mining
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
 
A survey on privacy preserving data publishing
A survey on privacy preserving data publishingA survey on privacy preserving data publishing
A survey on privacy preserving data publishing
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Significant features for steganography techniques using deoxyribonucleic acid...
Significant features for steganography techniques using deoxyribonucleic acid...Significant features for steganography techniques using deoxyribonucleic acid...
Significant features for steganography techniques using deoxyribonucleic acid...
 
Aina_final
Aina_finalAina_final
Aina_final
 
A Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face Detection
A Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face DetectionA Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face Detection
A Privacy-Preserving Deep Learning Framework for CNN-Based Fake Face Detection
 
iGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - ReportiGUARD: An Intelligent Way To Secure - Report
iGUARD: An Intelligent Way To Secure - Report
 
WRIGHT_JEREMY_1000738685-1
WRIGHT_JEREMY_1000738685-1WRIGHT_JEREMY_1000738685-1
WRIGHT_JEREMY_1000738685-1
 
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETSSELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
SELF-STUDY MATERIAL FOR THE USERS OF EUROSTAT MICRODATA SETS
 
journal paper
journal paperjournal paper
journal paper
 
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining ApproachPrivacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
 
MSc_Thesis
MSc_ThesisMSc_Thesis
MSc_Thesis
 
A de minimis rule for personal data breach notifications in the GDPR
A de minimis rule for personal data breach notifications in the GDPRA de minimis rule for personal data breach notifications in the GDPR
A de minimis rule for personal data breach notifications in the GDPR
 
Digital Forensics Assignment One UEL and Unicaf
Digital Forensics Assignment One UEL and UnicafDigital Forensics Assignment One UEL and Unicaf
Digital Forensics Assignment One UEL and Unicaf
 
Ej24856861
Ej24856861Ej24856861
Ej24856861
 
Data attribute security and privacy in Collaborative distributed database Pub...
Data attribute security and privacy in Collaborative distributed database Pub...Data attribute security and privacy in Collaborative distributed database Pub...
Data attribute security and privacy in Collaborative distributed database Pub...
 
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACYTHE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
 
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMPRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM
 
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS
 

Recently uploaded

Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfPondicherry University
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17Celine George
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptxJoelynRubio1
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of PlayPooky Knightsmith
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfNirmal Dwivedi
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Model Attribute _rec_name in the Odoo 17
Model Attribute _rec_name in the Odoo 17Model Attribute _rec_name in the Odoo 17
Model Attribute _rec_name in the Odoo 17Celine George
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 

Recently uploaded (20)

Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
Our Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdfOur Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdf
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of Play
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Model Attribute _rec_name in the Odoo 17
Model Attribute _rec_name in the Odoo 17Model Attribute _rec_name in the Odoo 17
Model Attribute _rec_name in the Odoo 17
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 

Implementation_of_laplacian_differential_privacy_with_varying_epsilonv3.pdf

  • 1. Implementation of Laplacian Differential Privacy with varying epsilon A thesis presented for the degree of Masters of Technology in Communication and Information Technology by Jaibran Mohammad Enrollment Number: 2019MECECI001 under the supervision of Associate Professor Farida khursheed ... Department of Electronics National Institute of Technology Srinagar 2021
  • 2. Contents Abstract 6 1 Introduction 7 2 Knowledge on Privacy Preserving 9 2.1 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Privacy Preserving Data publishing Techniques . . . . . . . . . . 10 2.2.1 k-anonymity . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 l-diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 t-closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Privacy Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 NYC Taxicab Data . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 The Netflix Prize . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Massachusetts Group Insurance Commission . . . . . . . 19 2.4 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Literature survey on Differential privacy 22 3.1 Randomized response . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Differential privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Properties of Differential Privacy . . . . . . . . . . . . . . . . . . 29 3.3.1 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Mechanism of Differential Privacy . . . . . . . . . . . . . . . . . 31 3.4.1 Laplacian Mechanism . . . . . . . . . . . . . . . . . . . . 31 3.4.2 Counting Queries . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Challenges in differential privacy . . . . . . . . . . . . . . . . . . 34 3.5.1 Choosing Privacy Parameter ∈ . . . . . . . . . . . . . . . 34 4 Implementation Framework 35 4.1 Scenario Description . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Architecture: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 User Interface: . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Privacy Preserving Mechanism . . . . . . . . . . . . . . . 37 4.2.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Database Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Programming Language and Editors . . . . . . . . . . . . . . . . 41 5 Result 42 5.1 Utility Of Data and Level of Privacy . . . . . . . . . . . . . . . . 42 6 Conclusion 46 7 Future Work 47 1
  • 3. Certificate Certified that Project Dissertation work entitled Implementation of Differ- ential Privacy with varying epsilon is bonafide work carried out by Jaibran Mohammad, Enrollment Number: 2019MECECI001, and class Roll Num- ber: 001, in partial fulfillment for the award of Masters of Technology degree in Communication and Information Technology from National Institute of Tech- nolgy,Srinagar Thesis Supervisor Head of the Department 2
  • 4. Acknowledgment I would like to express my deep, sincere gratitude to my advisor Associate Professor Farida Khursheed for her relentless patience, motivation, unending support and most importantly for many insightful conversations we had during the development of ideas in this thesis. 3
  • 5. List of Figures 1. Illustration of Netflix Attack .......14 2. Model of Differential privacy ........21 3. Laplacian Distribution .............. 28 4. Experimental Framework .............. 32 5. Counting Query ...................... 36 6. Noisy Distribution .................. 37 7. Original Distribution ............... 38 4
  • 6. List of Tables 1. Anonymity Table ............... 8 2. Anonymity Table2 ............. 9 3. UCI’s Dataset ................. 35,36 4. Varying ∈ ............... 32 5
  • 7. Abstract Significant advances in computing technology are bring- ing many benefits to societies, with changes and financial opportunities created in health care, transportation, ed- ucation, law enforcement, security, commerce, and social interactions.Today its a known fact that data is every- where. The world has become information centric,and it is not a surprise, especially in this time when data storage is cheap and accessible. Various areas of human endeavour use this data to conduct research, track the behavior of users, recommend products or evaluate na- tional security risks and many more.Many of these ben- efits, however, involve the use of sensitive data, and thus raise concerns about privacy. Methods that allow the extraction of knowledge from data, while preserving pri- vacy of users, are known as privacy-preserving data min- ing (PPDM) techniques.Differential privacy is the one the best privacy preserving data mining technique.In this thesis we have implemented one of the mechanism of dif- ferential privacy called the laplacian differential privacy and verified the claims made by differential privacy that smaller the value of epsilon, better is the privacy. 6
  • 8. Chapter 1 1 Introduction More and more information is collected by electronic de- vices in electronic form and also this information is avail- able on the Web,now with powerful data mining tools being developed and put into use, there are more and more concerns that data mining poses a threat to our privacy and data security. However, it is very impor- tant to note that many data mining applications do not even touch personal data.Various examples include ap- plications involving natural resources, the prediction of floods, meteorology, astronomy, geography, geology, biol- ogy, and other scientific and engineering data.Moreover, most studies in data mining research focus on the de- velopment of best reliable algorithms and do not in- volve accessing personal data. The data mining applica- tions that do involve personal data, in many cases, sim- ple methods such as removing sensitive attributes from data may protect the privacy of most individuals. Nev- ertheless, privacy concern still exists wherever person- ally information is collected and stored in digital form, and data mining programs are able to access such data, even during data preparation[1].To protect privacy viola- tion, privacy preservation methods have been developed to protect owner’s exposure, by modifying the original data [2], [3]. However, transforming the data into other form may also reduce its utility, resulting in inaccurate 7
  • 9. extraction of knowledge through data mining. This is the paradigm known as Privacy-Preserving Data Mining (PPDM). Various PPDM models are designed to guaran- tee some level of privacy, while maximising the utility of the data, such that data mining can still be performed on the transformed data efficiently.We will discuss various privacy preserving data mining techniques with problems associated with them in this thesis and finally we will show how differential privacy is the best PPDM tech- nique and we will implement one of the mechanisms of Differential privacy called laplacian differential privacy on an adult dataset. We will also see different frame- work associated with Differential privacy. 8
  • 10. Chapter 2 2 Knowledge on Privacy Preserving 2.1 Privacy Although everyone has a some concept of privacy in their mind, but there is no universally accepted standard def- inition [4]. Privacy has been recognized as a right in the Universal Declaration of Human Rights [5]. The diffi- culty in defining privacy has several consequence of the broadness of areas to which privacy applies. The scope of privacy can be divided into four categories [6]: infor- mation, which concerns the maintenance and collection of personal data; bodily, which relates to physical harms from inappropiate procedures; communications, which refers to any form of communication; territorial, which concerns the invasion of physical boundaries. In this the- ses we will focus on the information category, which com- prises of systems that collect, analyse and publish data.In the information scope, Westin [7] defined privacy as “the claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent informa- tion about them is communicated to others”, or in other words, the right to control the handling of one’s infor- mation.Other authors define privacy as “the right of an individual to be secure from unauthorised disclosure of information about oneself that is contained in an elec- tronic repository. Thus, one can conclude that the main idea of information privacy is to have control over the collection and handling of one’s personal data. An addi- 9
  • 11. tional data privacy model developed by Dalenius in 1977 [8] articulated the privacy goal of databases: any infor- mation about a particular individual that can be learned from the database should be determined without the ac- cess to the database. An important component of this notion is making sure that the measurement of the ad- versary’s before and after beliefs of a particular data is small. This type of privacy, however, cannot be achieved. Dwork demonstrated that such assurance is impossible because of the Knowledge of background information ex- ists. Thus, A new approach to privacy protection was adopted by Dwork: The risk of one’s privacy, or in gen- eral, any risk, such as the risk of being denied automobile insurance, should not substantially increase due to par- ticipation in a database [8]. Formerly, many works have been done to protect pri- vacy; we will discuss the privacy models that use pri- vacy preserving data publishing such as k-anonymity, l- diversity, t-closeness, and about differential privacy on the next sections. 2.2 Privacy Preserving Data publishing Techniques PPDM at data publishing is also known as Privacy Pre- serving Data Publishing (PPDP). It has been shown that only removing attributes that explicitly identify users (known as explicit identifiers) is not an effective method [9]. Users can still be identified by pseudo or quasi- identifiers (QIDs) and by sensitive attributes. A QID is a non-sensitive attribute (or a set of attributes) that do not explicitly identify a user, but can be combined with data 10
  • 12. from other public sources to de-anonymize the owner of a record,these types of attacks are known as linkage at- tacks [9]. Sensitive attributes are person-specific private attributes that must not be publicly disclosed, and that may be also linked to identify individuals (e.g. diseases in medical records).Various techniques of PPDP are ex- plained below 2.2.1 k-anonymity Sweeny and samarta introduced a notion of data pri- vacy, known as k-anonymity [10]. Consider a dataset with attributes that differ in meaning. Name and So- cial Security number are some of the identifiers. These would be completely removed from the dataset. Besides sensitive pseudo-identifiers, there are also non-sensitive pseudo-identifiers They can be used to identify an indi- vidual. In the example given below, these would include date of birth, ZIP code, and sex. Finally, we have the sensitive attributes, such as the condition – these should remain in the dataset, as they are. k-anonymous datasets are those in which, for any row of pseudo-identifiers The pseudo-identifiers of at least k-1 other rows contain the same content. Examples are given in Table 1 and Table 2 below, where Table 1 is 4-anonymous and Table 2 is 6-anonymous. 11
  • 13. Non-Sensitive Sensitive zipcode Age Nationality Condition 1 130** <30 * Aids 2 130** <30 * HeartDisease 3 130** <30 * ViralInfection 4 130** <30 * ViralInfection 5 130** >=40 * Cancer 6 130** >=40 * HeartDisease 7 130** >=40 * ViralInfection 8 130** >=40 * ViralInfection 9 130** 3* * Cancer 10 130** 3* * Cancer 11 130** 3* * Cancer 12 130** 3* * Cancer Table 1 Non-Sensitive Sensitive zipcode Age Nationality Condition 1 130** <35 * Aids 2 130** <35 * Tuberculosis 3 130** <35 * Flu 4 130** <35 * Tuberculosis 5 130** >=35 * Cancer 6 130** >=35 * Cancer 7 130** >=40 * Cancer 8 130** >=40 * Cancer 9 130** 3* * Cancer 10 130** 3* * Tuberculosis 11 130** 3* * ViralInfection 12 130** 3* * ViralInfection Table 2 12
  • 14. Several vulnerabilities exist in this strategy, Imagine a friend of mine who is 35 years old visited the hospital corresponding to Table 1. Then we would be able to conclude that he has cancer. Also, suppose someone I know who is 28 years old visited both hospitals thul will be present in both tables: we can infer that they have AIDS. Such issues were highlighted by Kasiviswanathan, Ganta and Smith [11]. 2.2.2 l-diversity The k-anonymity model is extended by requiring that every equivalence class adhere to the l-diversity principle . L-diverse equivalence classes consist of a set of entries such that the sensitive attributes have at least l ‘well- represented’ values. A table is l-diverse if all existing equivalence classes are l-diverse. ‘’Well-represented” values do not have a concrete mean- ing. Instead, there are different versions of the l-diversity principle, differing on this particular definition. An ex- ample of a simple instantiation considers that the ’Well- represented sensitive attributes’ are those which have at least l distinct values in an equivalence class, what is known as distinct l-diversity. In these conditions, there are at least l records in the l-diverse equivalence class (since l distinct values are required), and satisfies k- anonymity with k = l. A stronger principle of l-diversity is the definition of entropy l-diverse, defined as follows. Equivalence classes are entropy-l-diverse if their sensitive attribute value distribution is at least log(l). That is: 13
  • 15. − X s∈S P(QID, s)log(P(QID, s)) >= log(l) where s is a possible value for the sensitive attribute S, and P(QID; s) is the fraction of records in a QID equiv- alence group, that have the s value for the S attribute. 2.2.3 t-closeness In order to apply this model, sensitive values in each equivalence class to be as close as possible to the corre- sponding distribution in the original table, Close is the upper limit of the threshold t. That is, the measure of distance between the distribution of a sensitive attribute in the original table and the distribution of the same attribute in any equivalence class is less or equal to t (t-closeness principle). Formally, and using the notation found in [49], this principle may be written as follows. Let Q D (q1; q2; : : : ; qm) be the distribution of the values for the sensitive attribute in the original table and P D (p1; p2; : : : ; pm) be the distribution of the same attribute in an equivalence class. This class satisfies t- closeness if the following inequation is true: Dist(P; Q) <= t 14
  • 16. 2.3 Privacy Failures 2.3.1 NYC Taxicab Data In 2014, the NYC Taxi Limo Commission was quite active on Twitter, sharing visualizations of taxi usage statistics. This quickly caught the attention of several In- ternet users, who inquired where the data was sourced.Taxi Limo Commission responded that the data is available, however one must file a Freedom of Information Law (FOIL) request. Freedom of Information laws allow cit- izens to request data from government agencies. In the US, this is governed at the federal level.In Canada, there are similar laws, including the Freedom of Information and Protection of Privacy Act (FIPPA) and the Munic- ipal Freedom of Information and Protection of Privacy Act (MFIPPA). Chris Whong submitted a FOIL request and released the dataset online [Who14]. He obtained a dataset of all taxi fares and trips in NYC during 2013 - totaling 19 GB.Thus In New York City, we would know every taxi driver’s location and income. This is information we generally consider sensitive, and a taxi driver might prefer to keep private... It is not surprising that the Commission used a form of anonymization to obscure information. A typical row in the trips dataset looks like the following: 6B111958A39B24140C973B262EA9FEA5, D3B035A03C8A 34DA17488129DA581EE7, VTS, 5, ,2013-12-03 15:46:00, 2013-12-03 16:47:00,1, 3660,22.71, -73.813927,40.698135,- 74.093307,40.829346 15
  • 17. These fields are: medallion, hack license, vendor id, rate code, store and fwd flag, pickup datetime, dropoff datetime, passenger count, trip time in secs, trip distance, pickup longitude, pickup latitude, dropoff longitude, dropoff latitude Although most of these fields are self-explanatory, such as the time and location fields, we are primarily inter- ested in the first two. In particular, they indicate a taxi driver’s medallion and license number. However, the standard format for these fields is rather different than what is provided – it appears that the dataset has been somehow anonymized, to mask these values.Upon inspec- tion, Jason Hall posted on Reddit [Hal14] that someone with the medallion number CFCD208495D565E is using the dataset. F66E7DFF9F98764DA had a number of unusually prof- itable days, earning farmore than a taxi driver could hope to make. Vijay Pandurangan dug a bit deeper on this, and made the following discovery[Pan14].He gave string ’0’ to md5 hash algorithm and got the identifier which is given above.Pandurangan hypothesized that this identi- fier corresponded to instances when the medallion num- ber wasn’t available, but he took this as a hint: all the medallion and license numbers were simply the plain- text hashed via MD5.Because the license and medallion numbers are only a few characters long, he was able to compute the MD5 hashes of all possible combinations and obtain the pre-hashing values for the entire dataset. Using other publicly available data, he was further able 16
  • 18. to match these with driver names, thus matching real-life names of drivers with incomes and locations: a massive privacy violation! A side-information attack goes beyond revealing information about only the drivers. Imagine saying goodbye to a co-worker as they leave for the day - as you wave them off, you record location and pickup time of the pickup. It is possible to reference this dataset in the future and discover their home address (as well as whether they’re a generous tipper or not). 2.3.2 The Netflix Prize Netflix Prize competition is another case study involv- ing data anonymization gone wrong. A big part of Net- flix’s strategy is data analysis and statistical thinking: Their hit TV shows are conceived based on user data, and Its recommendations are tuned to maximize engage- ment with users. Between 2006 and 2009, they held a contest challenging researchers to improve their recom- mendation engine. The prize of this challenge was a highly-publicized US $1,000,000,it was won by a team named BellKor’s Pragmatic Chaos, based on matrix fac- torization techniques. Netflix provided a training dataset of user data to help teams design their strategies Each datapoint consisted of an (anonymized) user ID, movie rating, ID, and date. Netflix gave assurance to users that the data was de-anonymized to protect individual pri- vacy. Indeed, the Video Privacy Protection Act of 1988 requires them to do this. Media consumption history is generally considered sensitive or private information, because one might consume media associated with cer- 17
  • 19. tain minority groups (including of a political or sexual nature). Sadly, Narayanan and Shmatikov demonstrated that this form of anonymization was Insufficient to maintain user privacy[9]. Their approach is illustrated in Figure 1. A Netflix dataset was used, and cross-referenced it with public information from the online movie database IMDb, with over a hundred million movie reviews. Particularly, they attempted to match users between the two datasets by finding users who rated the movie similarly at the same time. A few weak matches turned out to be sufficient to reidentify many users, As a result, these users’ movie watching history was revealed, which they did not re- veal publicly. A class action lawsuit was filed against Netflix in response to this discovery, and this resulted in cancellation of a sequel competition. The example illustrates why de-anonymization is in- sufficient to guarantee privacy, particularly when side- information is present. 18
  • 20. Figure 1 2.3.3 Massachusetts Group Insurance Commission In the mid-1990s, the Massachusetts Group Insurance Commission Initiated a program in which researchers could obtain hospital visit records for every state em- ployee, at no cost. Due to the sensitive nature of this information, the dataset was anonymized. During his time as governor of Massachusetts (and 2020 Republican presidential candidate), William Weld made a promised that patient privacy was protected. Specifically, while the original dataset included information such as an in- 19
  • 21. dividual’s SSN, Name, sex, date of birth, ZIP code, and condition, this was anonymized by removing identifying features such as the individual’s name and SSN. As a computer science graduate student, Latanya Sweeney, bought the voter rolls from the city of Cambridge, which were then available for $20. These records contained every registered voter’s address, name, ZIP code, sex, and date of birth. It has been proven that 87Thus, by mapping these two datasets, Large-scale reidentification of individuals in hospital visitation datasets is straight- forward. Sweeney made a point about this, by sending Governor Weld his own medical records. 2.4 Differential Privacy Differential privacy is a powerful standard for data pri- vacy proposed by Dwork [10].It is based on the concept that the outcome of a statistical analysis is essentially In- dependent of whether or not anyone joins the database Either way, one learns approximately the same thing [12]. It ensures, the adversary’s ability to cause harm or ben- efit to a group of participants is essentially the same regardless of whether the adversary is in the dataset or not. Differential privacy achieves this by adding noise to the query results, So that any differences in output due to the presence or absence of a single person will be covered up.[8] In theory, differential privacy offers a robust privacy measurement despite the adversaries worst possible back- ground knowledge [8]. As a result,all linkage attacks and statistical attacks are neutralized Due to its strong pri- 20
  • 22. vacy protection from the worst case background knowl- edge attacks of the adversaries, differentiating privacy has been called an effective privacy preservation tech- nique. Therefore, throughout this thesis, we will try to de- scribe its properties and analyze a selected case study on it . 21
  • 23. Chapter 3 3 Literature survey on Differential privacy In this section, we will introduce differential privacy. We start with the first differentially private algorithm algo- rithm, by Warner from 1965 [13]. 3.1 Randomized response We work in a very simple environment. Assume you are an educator of a large classroom having an important exam. You suspect that many students in the class have cheated, However, you are unsure. What is the best way to figure out how many students cheated? Naturally, cheating would not be likely to be admitted honestly by students. Being more precise: there are n people, and individual i has a sensitive bit Xi ∈ {0, 1}. Their goal is to prevent anyone else from learning about Xi . The analyst receives messages Yi from each person , which may depend on Xi and some random numbers generated by an individual. Based on these Yi’s, the analyst would like to get an estimate of p = 1 n n X i=11 Xi We can first start with the most obvious approach: Yi equal to the sensitive bit Xi is send by the individual Yi = ( Xi with probability 1 1 − Xi with probability 0 (1) 22
  • 24. It is clear that the analyst can simply obtain p = P∞ i=1 Yi . In other words, the result is perfectly accurate. How- ever, the analyst sees Yi , which is equal to Xi , and thus he learns the individual’s private bit exactly: there is no privacy. Consider an alternate strategy, as follows: Yi = ( Xi with probability 12 1 − Xi with probability12 (2) In this case, Yi is perfectly private: in fact, it is a uni- form bit which does not depend on Xi at all, so the curator could not infer anything about Xi. But at the same time, every bit of accuracy is lost in this approach: Z = 1 n Pn i=1 Yi is distributed as 1/n Binomial(n, 1/2), which is completely independent of the statistic Z. At this point, we have two approaches: one which is perfectly private but not accurate, and one which is per- fectly accurate but not accurate. The right approach will be to choose a middle between these two extremes cases. Consider the following strategy, which we will call Randomized Response, parameterized by some γ ∈ [0, 1/2]: Yi = ( Xi with probability 12 + γ 1 − Xi with probability12 - γ (3) How private is this message Yi , with respect to the true message Xi? Note that γ = 1/2 corresponds to the first strategy, and γ = 0 is the second strategy. What if we choose a γ in the middle, such as γ = 1/4? Then 23
  • 25. there will be a certain level of “plausible deniability” associated with the individual’s disclosure of private bit: while Yi = Xi with probability 3/4, it could be that their true bit was 1 − Y1, and this event happened with probability 1/4. Informally speaking, how “deniable” their response is corresponds to the level of privacy they are afforded. In this way, they get a stronger privacy guarantee as γ approaches 0. Observe that E[Yi] = 2γXi + 1/2 − γ (4) and thus E 1 2γ (Yi − 1/2 + γ) = Xi (5) This leads to the following natural estimator: p̃ = 1 n X 1=1 n 1 2γ (Yi − 1/2γ (6) It has been proven that the error |p − p̃| is as follows: |p − p̃| ≤ 1 γ √ n (7) As n → ∞ , this error goes to 0. We can also say that: if we want to have additive error , we require n = ( 1 α2 ) samples. Note that as γ gets closer 0 (corresponding to stronger privacy), the error increase. This is natural: the stronger the privacy guarantee. In order to move further in quantifying the level of privacy, we must (finally) in- troduce differential privacy. Differential privacy is a formalization of this previously mentioned notion of “plausible deniability.” 24
  • 26. 3.2 Differential privacy “Differential privacy” refers to a promise, made by a data holder, or curator, to a data owner “There will be no adverse effects on you, if you allow your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.” Differentially private database mechanisms are at their best accurate data analysis, from the confidential data widely made available without hav- ing to resort to data clean rooms, data protection plans, data usage agreements or restricted views. Differentiated privacy deals with the paradox of learn- ing useful information about a population while learn- ing nothing about an individual. We may learn from a medical database that smoking causes cancer , which might affect insurance company’s view of an individ- ual smoker’s long-term medical costs. Has the analysis harmed the smoker? Perhaps — he will see the rise in his insurance premium, if the insurer knows he smokes. He may also be helped — learning of his health risks, he can enter a smoking cessation program. Has the smoker’s pri- vacy been violated? It’s true that we know more about him now than we did before, but was his information “leaked”? Differential privacy will argue that it wasn’t, with the rationale that Independent of whether or not the smoker participated in the study, the effects are the same. It is the conclusions of the study that impact the smoker, not his presence or absence in the data set. It is ensured by the Differential privacy that the same conclusions, for example, smoking causes cancer, will be 25
  • 27. output, independent of whether any individuals take part in the medical study or not. Specifically, it ensures that any sequence of outputs (responses to queries) is “essen- tially” equally likely to occur, independent of the pres- ence or absence of any Individual in the data set. Here, the probabilities are taken over random choices made by the privacy mechanism (something controlled by the data curator), and the term “essentially” is captured by a parameter, ∈. A smaller ∈ will give better privacy (and less accurate responses). Differential privacy is a definition, not an algorithm. There can be many differentially private algorithms for achiev- ing a computational task T given the value of ∈ in an ∈-differentially private manner. Some will have better accuracy than others. When is small, finding a highly accurate-differentially private algorithm for T can be dif- ficult, much as finding a numerically stable algorithm for a specific computational task can require effort. We now define the environment for differential privacy, sometimes called central differential privacy model. We imagine there are n individuals, X1 through Xn, who each have their own datapoint in the dataset. They send this data-point to a “trusted curator” – all individuals trust this curator with their raw datapoint, but no one else. Given their data, the curator runs an algorithm M, and publicly outputs the result of this computation. Dif- ferential privacy is a property of this algorithm M, which says that no individual’s data has a large impact on the output of the algorithm. The idea is given in the Figure 2 given below: 26
  • 28. Figure 2 More formally, suppose we have an algorithmM : Xn → Y . Consider any two datasets X, X 0 ∈ Xn which differ in exactly one entry. We call these neighbouring datasets, and sometimes denote this byX ∼ X 0 . We say that M is ∈ −(pure)differentiallyprivate(∈ −(pure)DP) if, for all neighbouring X, X 0 , and all T ⊆ Y , we have Pr[M(X) ∈ T] ≤ e∈ Pr[M(x 0 ) ∈ T] (8) where the randomness is over the choices made by M This definition was given by Dwork, McSherry, Nissim, and Smith in their seminal paper in 2006 [14]. It is now widely accepted as a strong and rigorous notion of data privacy. Various technical points about Differential pri- vacy are given below: • Differential privacy has a quantitative in nature. A small ∈ means strong privacy, and it will degrade as increases. 27
  • 29. • ∈ should be thought of as a small constant. Any- thing between 0.1 and 3 might be a reasonable level privacy guarantee, and one should be skeptical of claims which are significantly outside this range. • This is a worst-case guarantee, over all neighbouring datasets X and X 0 . Even if we expect our data to be randomly generated, we still require privacy for all possible datasets no matter what. • By changing a single point in the dataset,the defini- tion bounds the multiplicative increase in the prob- ability of M’s output satisfying any event. • The use of a multiplicative e∈ in the probability might seem unnatural. For small ∈, a Taylor ex- pansion allows us to treat this as ≈ (1+ ∈). The given definition is convenient because of the fact that e∈1 e∈2 = e∈1+∈2 , which is useful in terms of Mathe- matical convenience. • While the definition may not look symmetric, it is not: one can simply swap the role of X and X ’. • Any non-trivial differentially private algorithm must be randomized. • We will generally use the term of “neighbouring datasets” where one point in X is changed to obtain X 0 .Some- times this is called “bounded” differential privacy, in contrast to “unbounded” differential privacy, where a data point is either added or removed. The for- mer definition is usually more convenient mathemat- ically. 28
  • 30. • It prevents many of the types of attacks we have seen before. The linkage attacks that we have observed are essentially ruled out – if such an attack were ef- fective with your data in the dataset, it would be al- most as effective without.Reconstruction attacks are also prevented. • The definition of differential privacy is information theoretic in nature. That is,if an adversary has un- limited computational power and background infor- mation,he will still unable do any harm. This is in contrast to cryptography, where the focus is on com- putationally bounded adversaries. 3.3 Properties of Differential Privacy 3.3.1 Sensitivity Sensitivity quantifies how much noise is required in the differential privacy mechanism. Currently two sensitivities are used most often, which are Global and local Sensitivity. Global Sensitivity It shows how much the maximal dif- ferences is between the query results of the neighbor- ing databases that are going to be used in one of the differentially private mechanisms. The formal defi- nition: Definition 1: Let f : Xn → Rk .The l1 − sensitivity of f is: 29
  • 31. 4(f) = maxX,X 0 k f(X) − f(X 0 ) k1 where X and X 0 are neighbouring databases. In the process of releasing data while using queries such as count or sum that has low sensitivity work well with global sensitivity. We could take the case of count query as an example which has 4f = 1 that is smaller than the true answer. However, when it comes to queries like median, average the global sen- sitivity is much higher. Local Sensitivity The extent of noise included by the Laplace mechanism rely upon GS(F) and the privacy parameter ∈, but not on the database D. The pro- cess of adding noise to the most of the functions ap- plied to yield a much higher noise not resonating the function’s general insensitivity to individual’s input. Thus, Nissim proposed a local sensitivity that sat- isfies differential privacy by adjusting the difference between query results on the neighboring databases. The formal definition: Definition 2: (Local Sensitivity) [15] For f : Dn → Rk 0 and 0 x ∈ Dn the local sensitivity of f: LS(f) = maxD2 k f(D1) − f(D2) k1 In here, we have to observe that the global sensitiv- ity from the definition 1 is GS(f) = maxD1 LS(f)(D1) 30
  • 32. which creates less noise for queries with higher sen- sitivity . For queries such as count or range, the local sensitivity is identical to the global sensitiv- ity. From the above definition, one could observe that on many inputs, every differentially private al- gorithm must add a noise at least as large as the local sensitivity. However, finding algorithms whose error matches the local sensitivity is not straightforward: an algorithm that releases f with noise magnitude proportional to LS(f) on input D1 is not, in general, differentially private [15], since the noise magnitude itself can leak information. 3.4 Mechanism of Differential Privacy 3.4.1 Laplacian Mechanism Definition 1: The probability density function of the Laplace distribution with location and scale pa- rameters of 0 and b, respectively, is as follows: pr(X) = 1 2bexp − |x| b The Variance of this distribution is 2b2 .The graph of the laplacian distribution is shown below in the fig- ure.It is somtimes called double exponential distribu- tion because it can be seen symmetric to exponential distribution which is only supported on x ∈ [0, ∞) and has a density proportional to exp(−cx), while as laplacian distribution is supported on x ∈ R and 31
  • 33. has a density proportional to exp(−c|x|. Figure 3 Definition 2: Let f : xn → rk . The laplacian mech- anism is defined as: M(X) = f(X) + (Y1, ........, Yk) 32
  • 34. where Yi are independent laplace 4 ∈ variables. Theorm 1: The Laplacian mechanism is ∈ −differentially private 3.4.2 Counting Queries Counting queries is one of the application of lapla- cian mechanism of differential privacy. In counting queries.One can ask the question “How many rows in the dataset have property P?” If we just ask one question like this, the analysis can be done as follows. Each individual will have a bit Xi ∈ {0, 1} indicat- ing whether or not this is true about the row in the database, and the function f we consider is their sum. The sensitivity is 1, and thus an ∈ −differential pri- vatization of this data would be f(X) + Laplace(1 ∈). This introduces error to this query on the order ofO(1 ∈), independent of the size of the database. If we wanted to ask many queries, the laplacian mech- anism of differential privacy would go like follows.Suppose we had k counting queries f = (f1, ..., fk).We would simply output the vector f(X)+Y , where the Y 0 i s are identically distributive and independent lapla- cian random variables. The l1 sensitivity of this sce- nario will be ’k’.With this sensitivity bound 4 = k in hand, we can add Yi ≈ Laplace(k ∈) noise to each coordinate, answering each counting query with er- ror of magnitude O(k/ ∈). 33
  • 35. 3.5 Challenges in differential privacy 3.5.1 Choosing Privacy Parameter ∈ Since the introduction of the privacy parameter, there has been a question about how to set the Privacy parameter ∈. However, choosing the right value of ∈ has not been addressed adequately. In the usual sense, the parameter in ∈-Differential privacy does not indicate what is revealed about a person rather it limits the influence an individual has on the out- come. For queries,that try to retrieve more general prop- erties of data the influence of ∈ on an individual is less clear. For queries that ask for specific infor- mation, e.g. ”Is Mr. Y in the database?” directly relates to the disclosure of the information. Lee and Clifton showed [16] for a given setting of ∈ An ad- versary’s ability to monitor a particular individual in a database would vary depending on the values in the data, queries and even on values that are not in the data.A privacy breach is caused by an improper value of ∈, even for the same value of ∈ the degree of protection provided by the ∈-differential privacy mechanism varies based on the values of the domain attributes and the type of queries. 34
  • 36. Chapter 4 4 Implementation Framework In this section we are going to present the frame- work used for the experimenting with the laplacian differential privacy mechanism, what tools we used like programming language, libraries , database, Ed- itors, data etc 4.1 Scenario Description Currently, differential privacy can be implemented in several ways with various kinds of settings. Thus, some assumptions must be made. This thesis aims to follow a basic model or architecture where a secured server is connected to a data store that provides dif- ferential privacy mechanisms(which are going to be shown in the next section.) while being efficient. Actual scenario is when a data owner places datasets in a secured system for the purpose of some data an- alyst to use the information for a particular purpose while providing data privacy by using differential pri- vacy methods. From various mechanisms of differential privacy, we have implemented one of the primary methods of dif- ferential privacy that is essential in protecting sensi- tive data. The method is Laplace mechanism where the algorithms used is described in the next section. 35
  • 37. 4.2 Architecture: The architecture in the figure below depicts how we are going to implement one of the existing mecha- nisms of differential privacy called laplacian mech- anism.The elements of the architecture will be de- scribed in the coming sections. Figure 4 36
  • 38. 4.2.1 User Interface: The user interface will be used by the data analyst to request an execution of the query from the database and the result which will be given to the data analyst is the noisy version of the query result generated by one of the mechanisms of differential privacy. Thus this user interface is the way for the data analyst to do his job. 4.2.2 Privacy Preserving Mechanism The privacy preserving mechanism we used in this experiment is laplacian differential privacy mecha- nism.We have already discussed the laplacian differ- ential privacy mechanism in the previous sections. We saw in case of the counting queries,if the number of queries is one then the sensitivity taken for the al- gorithm is ’1’ i.e 4 = 1, we will take different values of epsilon and repeat the experiment 50 times and calculate different values of the result which will be depicted in the result section. 4.2.3 Database The Database used in the experiment is the post- gresql which is the open source and free relational database enabling us to store and manage the data(adult dataset) used in the experiment. 37
  • 39. 4.3 Dataset We have used one of the famous datasets called UCI’s adult dataset which was collected from US census in 1994 and donated in 1996.It contains more than 350000 rows(customer records) with 15 columns as follows: 38
  • 40. Attribute Type Values Age Numerical Workclass Nominal Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never- worked. Fnlgwt Numerical Education Nominal Bachelors, Some-college, 11th, HS- grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st- 4th, 10th, Doctorate, 5th- 6th, Preschool Education- num Numerical Marital- Status Nominal Married-civ-spouse, Di- vorced, Never-married, Separated,, Widowed, Married-spouse- absent, Married-AF-spouse, Occupation Nominal Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof- specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming- fishing, Transport-moving, Priv- house-serv 39
  • 41. Attribute Type Values Relationship Nominal Wife, Own-child, Husband, Not- in-family, Other-relative, Unmarried Race Nominal White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black Sex Nominal Female, Male Capital- gain Numerical Capital- loss Numerical Hours-per- week Numerical Native- country Nominal United-States, Cambo- dia, England, Puerto- Rico, Canada, Germany, Outlying-US(Guam-USVI- etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Por- tugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yu- goslavia, El-Salvador, Trinadad Tobago, Peru, Hong, Holand-Netherlands 40
  • 42. Attribute Type Values Income Nominal =50K, 50k Table 3 4.4 Database Query The query used in the experiment is counting query which will give the number of years a person has spent in education by income and race.The query is shown in the figure below: Figure 5 4.5 Programming Language and Editors The programming language that is used in this ex- periment is python associated with various libraries of python like Numpy,pandas,matplotlib,psycopg2 etc. The editor used in this experiment is Sublime editor. 41
  • 43. Chapter 5 5 Result 5.1 Utility Of Data and Level of Privacy The query distribution in the original form and in the form when laplacian differential privacy mechanism is applied is given below in two figure’s. The ep- silon value was taken as ’1’.As seen from the figures the distributions are almost same, thus it verifies the claim of differential privacy that result of the query would be same no matter whether the individual is present is the database or not. Figure 6 42
  • 44. Figure 7 The utility of data and level of privacy depends sig- nificantly on ∈, earlier when we had taken ∈= 1, we have seen a little change in the distribution which isn’t easily visible by looking at the graph, now as we have seen in previous section that smaller the value of ∈ greater will be the privacy, we will see this in action for the counting query we used in our experiment and the experiment will be repeated for varying value of ∈ specifically epsilon = 0.001, ep- silon = 0.01, epsilon = 0.1 epsilon = 0.5, epsilon =1, epsilon = 2. The experiment will be repeated for the sake of accuracy 50 times for each value of epsilon. The below table will show the result collected: 43
  • 45. ∈= 0.001 ∈= 0.01 ∈= 0.1 ∈= 0.5 ∈= 1 ∈= 2 True value 27.81249 35.39051 37.32668 38.00431 37.94904 38.14261 38 114.259 129.8866 128.2117 128.9040 128.98781 128.9992 129 376.4223 264.8106 265.7166 266.9909 267.0048 266.9134 267 508.4103 521.4273 514.2764 515.4326 515.1102 515.0179 515 318.2986 367.7279 380.4541 381.2337 381.0258 380.9488 381 633.4579 710.7502 708.2796 707.9570 708.0018 707.9361 708 905.9031 912.2191 927.7372 926.6723 927.0431 927.0012 927 229.2557 310.3037 307.2509 308.2056 308.1023 308.0594 308 7361.250 7363.948 7365.089 7361.968 7362.020 7361.984 7362 4766.829 4986.021 4951.782 4951.468 4951.640 4951.942 4952 999.5070 889.2168 874.2032 873.7415 874.0083 874.0150 874 668.4595 680.4382 678.6779 679.9279 679.9711 680.0249 680 2745.33 2675.99 2667.15 2666.96 2666.99 2666.95 2667 643.886 672.841 665.517 666.025 666.140 666.002 666 93.3477 141.3640 131.3123 131.9790 132.0313 132.035 132 114.49 89.435 92.744 93.117 93.166 92.967 93 Table 4 In total we have 16 values of education number in the counting query we have seen earlier and for those 16 values of education number true value in the table is the number of people having race ’white’ who have spent ’x’ years in education,’x’ represents the value of education number without applying laplacian dif- ferential privacy.In the table we have other columns with different values of epsilon, those columns are representing the change in the true value by apply- ing differential privacy. Observing the above table 44
  • 46. as value of ∈ becomes smaller the noise added by laplacian mechanism becomes larger, thus verifying the claims of theorists and experimentalists of differ- ential privacy. 45
  • 47. Chapter 6 6 Conclusion Differential privacy is the most used privacy pre- serving mechanism today in the world.I have imple- mented one of the mechanisms of Differential privacy called laplacian differential privacy in python and us- ing a counting query I have verified that the parame- ter ∈ is the most important factor in determining the utilization of data and level of privacy preserved.We have shown how by taking different values of ∈ the result got effected. We have verified the claim[15][16] that smaller value of epsilon will result in better pri- vacy. 46
  • 48. Chapter 7 7 Future Work Although some work has been done about what will be the value of ∈ but there is no universally accepted protocol to determine the value of ∈ for particular type of situations.Thus more work needs to be done in this area. References [1] Data Mining Concepts and Techniques Third Edition Jiawei Han University of Illinois at Ur- bana–Champaign Micheline Kamber Jian Pei Si- mon Fraser University [2] C. C. Aggarwal and P. S. Yu, “A general survey of privacy-preserving data mining models and algorithms,” in Privacy-Preserving Data Min- ing. New York, NY, USA: Springer, 2008, pp. 1152. [3] C. C. Aggarwal, Data Mining: The Textbook. New York, NY, USA: Springer, 2015. [4] M. Langheinrich, “Privacy in ubiquitous com- puting,” in Ubiquitous Computing Fundamen- tals. Boca Raton, FL, USA: CRC Press, 2009, ch. 3, pp. 95159. [5] Universal Declaration of Human Rights, United Nation General Assembly, New York, NY, 47
  • 49. USA, 1948, pp. 1 6. [Online]. Available: http://www.un.org/en/documents/ udhr/ [6] D. Banisar et al., “Privacy and human rights: An international survey of privacy laws and practice,” Global Internet Liberty Campaign, London, U.K., Tech. Rep., 1999. [7] A. F. Westin, “Privacy and freedom,” Washing- ton Lee Law Rev., vol. 25, no. 1, p. 166, 1968. [8] A. Blum and Y. Monsour. Learning, regret min- imization, and equilibria, 2007. [9] A. Narayanan and V. Shmatikov. (2006). “How to break anonymity of the Netix prize dataset.” [Online]. Available: https://arxiv.org/abs/ cs/0610105 [10] Pierangela Samarati and Latanya Sweeney. Gen- eralizing data to provide anonymity when dis- closing information. In Proceedings of the 20th ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, PODS ’98, page 188, New York, NY, USA, 1998. ACM. [11] Srivatsava Ranjit Ganta, Shiva Prasad Ka- siviswanathan, and Adam Smith. Composition attacks and auxiliary information in data pri- vacy. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, KDD ’08, pages 265–273, New York, NY, USA, 2008. ACM [12] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum, and Sergey Yekhanin. Pan-private streaming al- 48
  • 50. gorithms. In Proceedings of International Con- ference on Super Computing. 2010. [13] Stanley L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Asso- ciation, 60(309):63–69, 1965. [14] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensi- tivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptogra- phy, TCC ’06, pages 265–284, Berlin, Heidel- berg, 2006. Springer. [15] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. “Smooth Sensitivity and Sampling in Private Data Analysis”. In: Proceedings of the Thirty-ninth Annual ACM Symposium on The- ory of Computing. STOC ’07. San Diego, Cali- fornia, USA: ACM, 2007, pp. 75–84. ISBN: 978- 1-59593-631-8. [16] Jaewoo Lee and Chris Clifton. “How Much Is Enough? Choosing for Differential Privacy”. In: Information Security, 14th International Confer- ence, ISC 2011, Xi’an, China, October 26-29, 2011. Proceedings. 2011, pp. 325–340. 49