1. Implementation of
Laplacian Differential
Privacy with varying
epsilon
A thesis presented for the degree of
Masters of Technology in Communication
and Information Technology
by
Jaibran Mohammad
Enrollment Number: 2019MECECI001
under the supervision of
Associate Professor Farida khursheed
...
Department of Electronics
National Institute of Technology Srinagar
2021
3. Certificate
Certified that Project Dissertation work entitled Implementation of Differ-
ential Privacy with varying epsilon is bonafide work carried out by Jaibran
Mohammad, Enrollment Number: 2019MECECI001, and class Roll Num-
ber: 001, in partial fulfillment for the award of Masters of Technology degree in
Communication and Information Technology from National Institute of Tech-
nolgy,Srinagar
Thesis Supervisor
Head of the Department
2
4. Acknowledgment
I would like to express my deep, sincere gratitude to my advisor Associate
Professor Farida Khursheed for her relentless patience, motivation, unending
support and most importantly for many insightful conversations we had during
the development of ideas in this thesis.
3
5. List of Figures
1. Illustration of Netflix Attack .......14
2. Model of Differential privacy ........21
3. Laplacian Distribution .............. 28
4. Experimental Framework .............. 32
5. Counting Query ...................... 36
6. Noisy Distribution .................. 37
7. Original Distribution ............... 38
4
7. Abstract
Significant advances in computing technology are bring-
ing many benefits to societies, with changes and financial
opportunities created in health care, transportation, ed-
ucation, law enforcement, security, commerce, and social
interactions.Today its a known fact that data is every-
where. The world has become information centric,and
it is not a surprise, especially in this time when data
storage is cheap and accessible. Various areas of human
endeavour use this data to conduct research, track the
behavior of users, recommend products or evaluate na-
tional security risks and many more.Many of these ben-
efits, however, involve the use of sensitive data, and thus
raise concerns about privacy. Methods that allow the
extraction of knowledge from data, while preserving pri-
vacy of users, are known as privacy-preserving data min-
ing (PPDM) techniques.Differential privacy is the one
the best privacy preserving data mining technique.In this
thesis we have implemented one of the mechanism of dif-
ferential privacy called the laplacian differential privacy
and verified the claims made by differential privacy that
smaller the value of epsilon, better is the privacy.
6
8. Chapter 1
1 Introduction
More and more information is collected by electronic de-
vices in electronic form and also this information is avail-
able on the Web,now with powerful data mining tools
being developed and put into use, there are more and
more concerns that data mining poses a threat to our
privacy and data security. However, it is very impor-
tant to note that many data mining applications do not
even touch personal data.Various examples include ap-
plications involving natural resources, the prediction of
floods, meteorology, astronomy, geography, geology, biol-
ogy, and other scientific and engineering data.Moreover,
most studies in data mining research focus on the de-
velopment of best reliable algorithms and do not in-
volve accessing personal data. The data mining applica-
tions that do involve personal data, in many cases, sim-
ple methods such as removing sensitive attributes from
data may protect the privacy of most individuals. Nev-
ertheless, privacy concern still exists wherever person-
ally information is collected and stored in digital form,
and data mining programs are able to access such data,
even during data preparation[1].To protect privacy viola-
tion, privacy preservation methods have been developed
to protect owner’s exposure, by modifying the original
data [2], [3]. However, transforming the data into other
form may also reduce its utility, resulting in inaccurate
7
9. extraction of knowledge through data mining. This is
the paradigm known as Privacy-Preserving Data Mining
(PPDM). Various PPDM models are designed to guaran-
tee some level of privacy, while maximising the utility of
the data, such that data mining can still be performed on
the transformed data efficiently.We will discuss various
privacy preserving data mining techniques with problems
associated with them in this thesis and finally we will
show how differential privacy is the best PPDM tech-
nique and we will implement one of the mechanisms of
Differential privacy called laplacian differential privacy
on an adult dataset. We will also see different frame-
work associated with Differential privacy.
8
10. Chapter 2
2 Knowledge on Privacy Preserving
2.1 Privacy
Although everyone has a some concept of privacy in their
mind, but there is no universally accepted standard def-
inition [4]. Privacy has been recognized as a right in the
Universal Declaration of Human Rights [5]. The diffi-
culty in defining privacy has several consequence of the
broadness of areas to which privacy applies. The scope
of privacy can be divided into four categories [6]: infor-
mation, which concerns the maintenance and collection
of personal data; bodily, which relates to physical harms
from inappropiate procedures; communications, which
refers to any form of communication; territorial, which
concerns the invasion of physical boundaries. In this the-
ses we will focus on the information category, which com-
prises of systems that collect, analyse and publish data.In
the information scope, Westin [7] defined privacy as “the
claim of individuals, groups, or institutions to determine
for themselves when, how, and to what extent informa-
tion about them is communicated to others”, or in other
words, the right to control the handling of one’s infor-
mation.Other authors define privacy as “the right of an
individual to be secure from unauthorised disclosure of
information about oneself that is contained in an elec-
tronic repository. Thus, one can conclude that the main
idea of information privacy is to have control over the
collection and handling of one’s personal data. An addi-
9
11. tional data privacy model developed by Dalenius in 1977
[8] articulated the privacy goal of databases: any infor-
mation about a particular individual that can be learned
from the database should be determined without the ac-
cess to the database. An important component of this
notion is making sure that the measurement of the ad-
versary’s before and after beliefs of a particular data is
small. This type of privacy, however, cannot be achieved.
Dwork demonstrated that such assurance is impossible
because of the Knowledge of background information ex-
ists. Thus, A new approach to privacy protection was
adopted by Dwork: The risk of one’s privacy, or in gen-
eral, any risk, such as the risk of being denied automobile
insurance, should not substantially increase due to par-
ticipation in a database [8].
Formerly, many works have been done to protect pri-
vacy; we will discuss the privacy models that use pri-
vacy preserving data publishing such as k-anonymity, l-
diversity, t-closeness, and about differential privacy on
the next sections.
2.2 Privacy Preserving Data publishing Techniques
PPDM at data publishing is also known as Privacy Pre-
serving Data Publishing (PPDP). It has been shown that
only removing attributes that explicitly identify users
(known as explicit identifiers) is not an effective method
[9]. Users can still be identified by pseudo or quasi-
identifiers (QIDs) and by sensitive attributes. A QID is a
non-sensitive attribute (or a set of attributes) that do not
explicitly identify a user, but can be combined with data
10
12. from other public sources to de-anonymize the owner of
a record,these types of attacks are known as linkage at-
tacks [9]. Sensitive attributes are person-specific private
attributes that must not be publicly disclosed, and that
may be also linked to identify individuals (e.g. diseases
in medical records).Various techniques of PPDP are ex-
plained below
2.2.1 k-anonymity
Sweeny and samarta introduced a notion of data pri-
vacy, known as k-anonymity [10]. Consider a dataset
with attributes that differ in meaning. Name and So-
cial Security number are some of the identifiers. These
would be completely removed from the dataset. Besides
sensitive pseudo-identifiers, there are also non-sensitive
pseudo-identifiers They can be used to identify an indi-
vidual. In the example given below, these would include
date of birth, ZIP code, and sex. Finally, we have the
sensitive attributes, such as the condition – these should
remain in the dataset, as they are. k-anonymous datasets
are those in which, for any row of pseudo-identifiers The
pseudo-identifiers of at least k-1 other rows contain the
same content. Examples are given in Table 1 and Table
2 below, where Table 1 is 4-anonymous and Table 2 is
6-anonymous.
11
13. Non-Sensitive Sensitive
zipcode Age Nationality Condition
1 130** <30 * Aids
2 130** <30 * HeartDisease
3 130** <30 * ViralInfection
4 130** <30 * ViralInfection
5 130** >=40 * Cancer
6 130** >=40 * HeartDisease
7 130** >=40 * ViralInfection
8 130** >=40 * ViralInfection
9 130** 3* * Cancer
10 130** 3* * Cancer
11 130** 3* * Cancer
12 130** 3* * Cancer
Table 1
Non-Sensitive Sensitive
zipcode Age Nationality Condition
1 130** <35 * Aids
2 130** <35 * Tuberculosis
3 130** <35 * Flu
4 130** <35 * Tuberculosis
5 130** >=35 * Cancer
6 130** >=35 * Cancer
7 130** >=40 * Cancer
8 130** >=40 * Cancer
9 130** 3* * Cancer
10 130** 3* * Tuberculosis
11 130** 3* * ViralInfection
12 130** 3* * ViralInfection
Table 2
12
14. Several vulnerabilities exist in this strategy, Imagine a
friend of mine who is 35 years old visited the hospital
corresponding to Table 1. Then we would be able to
conclude that he has cancer. Also, suppose someone I
know who is 28 years old visited both hospitals thul will
be present in both tables: we can infer that they have
AIDS. Such issues were highlighted by Kasiviswanathan,
Ganta and Smith [11].
2.2.2 l-diversity
The k-anonymity model is extended by requiring that
every equivalence class adhere to the l-diversity principle
. L-diverse equivalence classes consist of a set of entries
such that the sensitive attributes have at least l ‘well-
represented’ values. A table is l-diverse if all existing
equivalence classes are l-diverse.
‘’Well-represented” values do not have a concrete mean-
ing. Instead, there are different versions of the l-diversity
principle, differing on this particular definition. An ex-
ample of a simple instantiation considers that the ’Well-
represented sensitive attributes’ are those which have at
least l distinct values in an equivalence class, what is
known as distinct l-diversity. In these conditions, there
are at least l records in the l-diverse equivalence class
(since l distinct values are required), and satisfies k-
anonymity with k = l. A stronger principle of l-diversity
is the definition of entropy l-diverse, defined as follows.
Equivalence classes are entropy-l-diverse if their sensitive
attribute value distribution is at least
log(l). That is:
13
15. − X
s∈S
P(QID, s)log(P(QID, s)) >= log(l)
where s is a possible value for the sensitive attribute S,
and P(QID; s) is the fraction of records in a QID equiv-
alence group, that have the s value for the S attribute.
2.2.3 t-closeness
In order to apply this model, sensitive values in each
equivalence class to be as close as possible to the corre-
sponding distribution in the original table, Close is the
upper limit of the threshold t. That is, the measure of
distance between the distribution of a sensitive attribute
in the original table and the distribution of the same
attribute in any equivalence class is less or equal to t
(t-closeness principle). Formally, and using the notation
found in [49], this principle may be written as follows.
Let Q D (q1; q2; : : : ; qm) be the distribution of the
values for the sensitive attribute in the original table and
P D (p1; p2; : : : ; pm) be the distribution of the same
attribute in an equivalence class. This class satisfies t-
closeness if the following inequation is true:
Dist(P; Q) <= t
14
16. 2.3 Privacy Failures
2.3.1 NYC Taxicab Data
In 2014, the NYC Taxi Limo Commission was quite
active on Twitter, sharing visualizations of taxi usage
statistics. This quickly caught the attention of several In-
ternet users, who inquired where the data was sourced.Taxi
Limo Commission responded that the data is available,
however one must file a Freedom of Information Law
(FOIL) request. Freedom of Information laws allow cit-
izens to request data from government agencies. In the
US, this is governed at the federal level.In Canada, there
are similar laws, including the Freedom of Information
and Protection of Privacy Act (FIPPA) and the Munic-
ipal Freedom of Information and Protection of Privacy
Act (MFIPPA). Chris Whong submitted a FOIL request
and released the dataset online [Who14].
He obtained a dataset of all taxi fares and trips in NYC
during 2013 - totaling 19 GB.Thus In New York City, we
would know every taxi driver’s location and income. This
is information we generally consider sensitive, and a taxi
driver might prefer to keep private... It is not surprising
that the Commission used a form of anonymization to
obscure information.
A typical row in the trips dataset looks like the following:
6B111958A39B24140C973B262EA9FEA5, D3B035A03C8A
34DA17488129DA581EE7, VTS, 5, ,2013-12-03 15:46:00,
2013-12-03 16:47:00,1, 3660,22.71, -73.813927,40.698135,-
74.093307,40.829346
15
17. These fields are:
medallion, hack license, vendor id, rate code, store and
fwd flag, pickup datetime, dropoff datetime, passenger
count, trip time in secs, trip distance, pickup longitude,
pickup latitude, dropoff longitude, dropoff latitude
Although most of these fields are self-explanatory, such
as the time and location fields, we are primarily inter-
ested in the first two. In particular, they indicate a
taxi driver’s medallion and license number. However, the
standard format for these fields is rather different than
what is provided – it appears that the dataset has been
somehow anonymized, to mask these values.Upon inspec-
tion, Jason Hall posted on Reddit [Hal14] that someone
with the medallion number CFCD208495D565E is using
the dataset.
F66E7DFF9F98764DA had a number of unusually prof-
itable days, earning farmore than a taxi driver could hope
to make. Vijay Pandurangan dug a bit deeper on this,
and made the following discovery[Pan14].He gave string
’0’ to md5 hash algorithm and got the identifier which is
given above.Pandurangan hypothesized that this identi-
fier corresponded to instances when the medallion num-
ber wasn’t available, but he took this as a hint: all the
medallion and license numbers were simply the plain-
text hashed via MD5.Because the license and medallion
numbers are only a few characters long, he was able to
compute the MD5 hashes of all possible combinations
and obtain the pre-hashing values for the entire dataset.
Using other publicly available data, he was further able
16
18. to match these with driver names, thus matching real-life
names of drivers with incomes and locations: a massive
privacy violation! A side-information attack goes beyond
revealing information about only the drivers. Imagine
saying goodbye to a co-worker as they leave for the day
- as you wave them off, you record location and pickup
time of the pickup. It is possible to reference this dataset
in the future and discover their home address (as well as
whether they’re a generous tipper or not).
2.3.2 The Netflix Prize
Netflix Prize competition is another case study involv-
ing data anonymization gone wrong. A big part of Net-
flix’s strategy is data analysis and statistical thinking:
Their hit TV shows are conceived based on user data,
and Its recommendations are tuned to maximize engage-
ment with users. Between 2006 and 2009, they held a
contest challenging researchers to improve their recom-
mendation engine. The prize of this challenge was a
highly-publicized US $1,000,000,it was won by a team
named BellKor’s Pragmatic Chaos, based on matrix fac-
torization techniques. Netflix provided a training dataset
of user data to help teams design their strategies Each
datapoint consisted of an (anonymized) user ID, movie
rating, ID, and date. Netflix gave assurance to users that
the data was de-anonymized to protect individual pri-
vacy. Indeed, the Video Privacy Protection Act of 1988
requires them to do this. Media consumption history
is generally considered sensitive or private information,
because one might consume media associated with cer-
17
19. tain minority groups (including of a political or sexual
nature).
Sadly, Narayanan and Shmatikov demonstrated that
this form of anonymization was Insufficient to maintain
user privacy[9]. Their approach is illustrated in Figure
1. A Netflix dataset was used, and cross-referenced it
with public information from the online movie database
IMDb, with over a hundred million movie reviews.
Particularly, they attempted to match users between
the two datasets by finding users who rated the movie
similarly at the same time.
A few weak matches turned out to be sufficient to
reidentify many users, As a result, these users’ movie
watching history was revealed, which they did not re-
veal publicly. A class action lawsuit was filed against
Netflix in response to this discovery, and this resulted in
cancellation of a sequel competition.
The example illustrates why de-anonymization is in-
sufficient to guarantee privacy, particularly when side-
information is present.
18
20. Figure 1
2.3.3 Massachusetts Group Insurance Commission
In the mid-1990s, the Massachusetts Group Insurance
Commission Initiated a program in which researchers
could obtain hospital visit records for every state em-
ployee, at no cost. Due to the sensitive nature of this
information, the dataset was anonymized. During his
time as governor of Massachusetts (and 2020 Republican
presidential candidate), William Weld made a promised
that patient privacy was protected. Specifically, while
the original dataset included information such as an in-
19
21. dividual’s SSN, Name, sex, date of birth, ZIP code, and
condition, this was anonymized by removing identifying
features such as the individual’s name and SSN. As a
computer science graduate student, Latanya Sweeney,
bought the voter rolls from the city of Cambridge, which
were then available for $20. These records contained
every registered voter’s address, name, ZIP code, sex,
and date of birth. It has been proven that 87Thus, by
mapping these two datasets, Large-scale reidentification
of individuals in hospital visitation datasets is straight-
forward. Sweeney made a point about this, by sending
Governor Weld his own medical records.
2.4 Differential Privacy
Differential privacy is a powerful standard for data pri-
vacy proposed by Dwork [10].It is based on the concept
that the outcome of a statistical analysis is essentially In-
dependent of whether or not anyone joins the database
Either way, one learns approximately the same thing [12].
It ensures, the adversary’s ability to cause harm or ben-
efit to a group of participants is essentially the same
regardless of whether the adversary is in the dataset or
not. Differential privacy achieves this by adding noise
to the query results, So that any differences in output
due to the presence or absence of a single person will be
covered up.[8]
In theory, differential privacy offers a robust privacy
measurement despite the adversaries worst possible back-
ground knowledge [8]. As a result,all linkage attacks and
statistical attacks are neutralized Due to its strong pri-
20
22. vacy protection from the worst case background knowl-
edge attacks of the adversaries, differentiating privacy
has been called an effective privacy preservation tech-
nique.
Therefore, throughout this thesis, we will try to de-
scribe its properties and analyze a selected case study
on it .
21
23. Chapter 3
3 Literature survey on Differential privacy
In this section, we will introduce differential privacy. We
start with the first differentially private algorithm algo-
rithm, by Warner from 1965 [13].
3.1 Randomized response
We work in a very simple environment. Assume you are
an educator of a large classroom having an important
exam. You suspect that many students in the class have
cheated, However, you are unsure. What is the best way
to figure out how many students cheated? Naturally,
cheating would not be likely to be admitted honestly by
students.
Being more precise: there are n people, and individual
i has a sensitive bit Xi ∈ {0, 1}. Their goal is to prevent
anyone else from learning about Xi . The analyst receives
messages Yi from each person , which may depend on Xi
and some random numbers generated by an individual.
Based on these Yi’s, the analyst would like to get an
estimate of
p =
1
n
n
X
i=11
Xi
We can first start with the most obvious approach: Yi
equal to the sensitive bit Xi is send by the individual
Yi =
(
Xi with probability 1
1 − Xi with probability 0
(1)
22
24. It is clear that the analyst can simply obtain p =
P∞
i=1 Yi
. In other words, the result is perfectly accurate. How-
ever, the analyst sees Yi , which is equal to Xi , and thus
he learns the individual’s private bit exactly: there is no
privacy.
Consider an alternate strategy, as follows:
Yi =
(
Xi with probability 12
1 − Xi with probability12
(2)
In this case, Yi is perfectly private: in fact, it is a uni-
form bit which does not depend on Xi at all, so the
curator could not infer anything about Xi. But at the
same time, every bit of accuracy is lost in this approach:
Z = 1
n
Pn
i=1 Yi is distributed as 1/n Binomial(n, 1/2),
which is completely independent of the statistic Z.
At this point, we have two approaches: one which is
perfectly private but not accurate, and one which is per-
fectly accurate but not accurate. The right approach
will be to choose a middle between these two extremes
cases. Consider the following strategy, which we will call
Randomized Response, parameterized by some γ ∈ [0,
1/2]:
Yi =
(
Xi with probability 12 + γ
1 − Xi with probability12 - γ
(3)
How private is this message Yi , with respect to the true
message Xi? Note that γ = 1/2 corresponds to the first
strategy, and γ = 0 is the second strategy. What if
we choose a γ in the middle, such as γ = 1/4? Then
23
25. there will be a certain level of “plausible deniability”
associated with the individual’s disclosure of private bit:
while Yi = Xi with probability 3/4, it could be that
their true bit was 1 − Y1, and this event happened with
probability 1/4. Informally speaking, how “deniable”
their response is corresponds to the level of privacy they
are afforded. In this way, they get a stronger privacy
guarantee as γ approaches 0. Observe that
E[Yi] = 2γXi + 1/2 − γ (4)
and thus
E
1
2γ
(Yi − 1/2 + γ)
= Xi (5)
This leads to the following natural estimator:
p̃ =
1
n
X
1=1
n
1
2γ
(Yi − 1/2γ
(6)
It has been proven that the error |p − p̃| is as follows:
|p − p̃| ≤
1
γ
√
n
(7)
As n → ∞ , this error goes to 0. We can also say that:
if we want to have additive error , we require n =
( 1
α2 )
samples. Note that as γ gets closer 0 (corresponding to
stronger privacy), the error increase. This is natural: the
stronger the privacy guarantee. In order to move further
in quantifying the level of privacy, we must (finally) in-
troduce differential privacy.
Differential privacy is a formalization of this previously
mentioned notion of “plausible deniability.”
24
26. 3.2 Differential privacy
“Differential privacy” refers to a promise, made by a data
holder, or curator, to a data owner “There will be no
adverse effects on you, if you allow your data to be used
in any study or analysis,
no matter what other studies, data sets, or information
sources, are available.” Differentially private database
mechanisms are at their best accurate data analysis, from
the confidential data widely made available without hav-
ing to resort to data clean rooms, data protection plans,
data usage agreements or restricted views.
Differentiated privacy deals with the paradox of learn-
ing useful information about a population while learn-
ing nothing about an individual. We may learn from
a medical database that smoking causes cancer , which
might affect insurance company’s view of an individ-
ual smoker’s long-term medical costs. Has the analysis
harmed the smoker? Perhaps — he will see the rise in his
insurance premium, if the insurer knows he smokes. He
may also be helped — learning of his health risks, he can
enter a smoking cessation program. Has the smoker’s pri-
vacy been violated? It’s true that we know more about
him now than we did before, but was his information
“leaked”? Differential privacy will argue that it wasn’t,
with the rationale that Independent of whether or not
the smoker participated in the study, the effects are the
same. It is the conclusions of the study that impact the
smoker, not his presence or absence in the data set.
It is ensured by the Differential privacy that the same
conclusions, for example, smoking causes cancer, will be
25
27. output, independent of whether any individuals take part
in the medical study or not. Specifically, it ensures that
any sequence of outputs (responses to queries) is “essen-
tially” equally likely to occur, independent of the pres-
ence or absence of any Individual in the data set. Here,
the probabilities are taken over random choices made
by the privacy mechanism (something controlled by the
data curator), and the term “essentially” is captured by
a parameter, ∈. A smaller ∈ will give better privacy (and
less accurate responses).
Differential privacy is a definition, not an algorithm. There
can be many differentially private algorithms for achiev-
ing a computational task T given the value of ∈ in an
∈-differentially private manner. Some will have better
accuracy than others. When is small, finding a highly
accurate-differentially private algorithm for T can be dif-
ficult, much as finding a numerically stable algorithm for
a specific computational task can require effort.
We now define the environment for differential privacy,
sometimes called central differential privacy model. We
imagine there are n individuals, X1 through Xn, who
each have their own datapoint in the dataset. They send
this data-point to a “trusted curator” – all individuals
trust this curator with their raw datapoint, but no one
else. Given their data, the curator runs an algorithm M,
and publicly outputs the result of this computation. Dif-
ferential privacy is a property of this algorithm M, which
says that no individual’s data has a large impact on the
output of the algorithm. The idea is given in the Figure
2 given below:
26
28. Figure 2
More formally, suppose we have an algorithmM : Xn
→ Y .
Consider any two datasets X, X 0
∈ Xn
which differ in
exactly one entry. We call these neighbouring datasets,
and sometimes denote this byX ∼ X 0
. We say that
M is ∈ −(pure)differentiallyprivate(∈ −(pure)DP) if,
for all neighbouring X, X 0
, and all T ⊆ Y , we have
Pr[M(X) ∈ T] ≤ e∈
Pr[M(x 0
) ∈ T] (8)
where the randomness is over the choices made by M
This definition was given by Dwork, McSherry, Nissim,
and Smith in their seminal paper in 2006 [14]. It is now
widely accepted as a strong and rigorous notion of data
privacy. Various technical points about Differential pri-
vacy are given below:
• Differential privacy has a quantitative in nature. A
small ∈ means strong privacy, and it will degrade as
increases.
27
29. • ∈ should be thought of as a small constant. Any-
thing between 0.1 and 3 might be a reasonable level
privacy guarantee, and one should be skeptical of
claims which are significantly outside this range.
• This is a worst-case guarantee, over all neighbouring
datasets X and X 0
. Even if we expect our data to
be randomly generated, we still require privacy for
all possible datasets no matter what.
• By changing a single point in the dataset,the defini-
tion bounds the multiplicative increase in the prob-
ability of M’s output satisfying any event.
• The use of a multiplicative e∈
in the probability
might seem unnatural. For small ∈, a Taylor ex-
pansion allows us to treat this as ≈ (1+ ∈). The
given definition is convenient because of the fact that
e∈1
e∈2
= e∈1+∈2
, which is useful in terms of Mathe-
matical convenience.
• While the definition may not look symmetric, it is
not: one can simply swap the role of X and X ’.
• Any non-trivial differentially private algorithm must
be randomized.
• We will generally use the term of “neighbouring datasets”
where one point in X is changed to obtain X 0
.Some-
times this is called “bounded” differential privacy, in
contrast to “unbounded” differential privacy, where
a data point is either added or removed. The for-
mer definition is usually more convenient mathemat-
ically.
28
30. • It prevents many of the types of attacks we have seen
before. The linkage attacks that we have observed
are essentially ruled out – if such an attack were ef-
fective with your data in the dataset, it would be al-
most as effective without.Reconstruction attacks are
also prevented.
• The definition of differential privacy is information
theoretic in nature. That is,if an adversary has un-
limited computational power and background infor-
mation,he will still unable do any harm. This is in
contrast to cryptography, where the focus is on com-
putationally bounded adversaries.
3.3 Properties of Differential Privacy
3.3.1 Sensitivity
Sensitivity quantifies how much noise is required in
the differential privacy mechanism. Currently two
sensitivities are used most often, which are Global
and local Sensitivity.
Global Sensitivity It shows how much the maximal dif-
ferences is between the query results of the neighbor-
ing databases that are going to be used in one of the
differentially private mechanisms. The formal defi-
nition:
Definition 1: Let f : Xn
→ Rk
.The l1 − sensitivity
of f is:
29
31. 4(f)
= maxX,X 0 k f(X) − f(X 0
) k1
where X and X 0
are neighbouring databases.
In the process of releasing data while using queries
such as count or sum that has low sensitivity work
well with global sensitivity. We could take the case
of count query as an example which has 4f
= 1 that
is smaller than the true answer. However, when it
comes to queries like median, average the global sen-
sitivity is much higher.
Local Sensitivity The extent of noise included by the
Laplace mechanism rely upon GS(F) and the privacy
parameter ∈, but not on the database D. The pro-
cess of adding noise to the most of the functions ap-
plied to yield a much higher noise not resonating the
function’s general insensitivity to individual’s input.
Thus, Nissim proposed a local sensitivity that sat-
isfies differential privacy by adjusting the difference
between query results on the neighboring databases.
The formal definition:
Definition 2: (Local Sensitivity) [15]
For f : Dn
→ Rk 0
and 0
x ∈ Dn
the local sensitivity of f:
LS(f) = maxD2
k f(D1) − f(D2) k1
In here, we have to observe that the global sensitiv-
ity from the definition 1 is
GS(f) = maxD1
LS(f)(D1)
30
32. which creates less noise for queries with higher sen-
sitivity . For queries such as count or range, the
local sensitivity is identical to the global sensitiv-
ity. From the above definition, one could observe
that on many inputs, every differentially private al-
gorithm must add a noise at least as large as the local
sensitivity. However, finding algorithms whose error
matches the local sensitivity is not straightforward:
an algorithm that releases f with noise magnitude
proportional to LS(f) on input D1 is not, in general,
differentially private [15], since the noise magnitude
itself can leak information.
3.4 Mechanism of Differential Privacy
3.4.1 Laplacian Mechanism
Definition 1: The probability density function of
the Laplace distribution with location and scale pa-
rameters of 0 and b, respectively, is as follows:
pr(X) = 1
2bexp
− |x|
b
The Variance of this distribution is 2b2
.The graph of
the laplacian distribution is shown below in the fig-
ure.It is somtimes called double exponential distribu-
tion because it can be seen symmetric to exponential
distribution which is only supported on x ∈ [0, ∞)
and has a density proportional to exp(−cx), while
as laplacian distribution is supported on x ∈ R and
31
33. has a density proportional to exp(−c|x|.
Figure 3
Definition 2: Let f : xn
→ rk
. The laplacian mech-
anism is defined as:
M(X) = f(X) + (Y1, ........, Yk)
32
34. where Yi are independent laplace
4
∈
variables.
Theorm 1: The Laplacian mechanism is ∈ −differentially
private
3.4.2 Counting Queries
Counting queries is one of the application of lapla-
cian mechanism of differential privacy. In counting
queries.One can ask the question “How many rows
in the dataset have property P?” If we just ask one
question like this, the analysis can be done as follows.
Each individual will have a bit Xi ∈ {0, 1} indicat-
ing whether or not this is true about the row in the
database, and the function f we consider is their sum.
The sensitivity is 1, and thus an ∈ −differential pri-
vatization of this data would be f(X) + Laplace(1
∈).
This introduces error to this query on the order ofO(1
∈),
independent of the size of the database.
If we wanted to ask many queries, the laplacian mech-
anism of differential privacy would go like follows.Suppose
we had k counting queries f = (f1, ..., fk).We would
simply output the vector f(X)+Y , where the Y 0
i s
are identically distributive and independent lapla-
cian random variables. The l1 sensitivity of this sce-
nario will be ’k’.With this sensitivity bound 4 = k
in hand, we can add Yi ≈ Laplace(k
∈) noise to each
coordinate, answering each counting query with er-
ror of magnitude O(k/ ∈).
33
35. 3.5 Challenges in differential privacy
3.5.1 Choosing Privacy Parameter ∈
Since the introduction of the privacy parameter, there
has been a question about how to set the Privacy
parameter ∈. However, choosing the right value of
∈ has not been addressed adequately. In the usual
sense, the parameter in ∈-Differential privacy does
not indicate what is revealed about a person rather
it limits the influence an individual has on the out-
come.
For queries,that try to retrieve more general prop-
erties of data the influence of ∈ on an individual
is less clear. For queries that ask for specific infor-
mation, e.g. ”Is Mr. Y in the database?” directly
relates to the disclosure of the information. Lee and
Clifton showed [16] for a given setting of ∈ An ad-
versary’s ability to monitor a particular individual
in a database would vary depending on the values in
the data, queries and even on values that are not in
the data.A privacy breach is caused by an improper
value of ∈, even for the same value of ∈ the degree
of protection provided by the ∈-differential privacy
mechanism varies based on the values of the domain
attributes and the type of queries.
34
36. Chapter 4
4 Implementation Framework
In this section we are going to present the frame-
work used for the experimenting with the laplacian
differential privacy mechanism, what tools we used
like programming language, libraries , database, Ed-
itors, data etc
4.1 Scenario Description
Currently, differential privacy can be implemented
in several ways with various kinds of settings. Thus,
some assumptions must be made. This thesis aims to
follow a basic model or architecture where a secured
server is connected to a data store that provides dif-
ferential privacy mechanisms(which are going to be
shown in the next section.) while being efficient.
Actual scenario is when a data owner places datasets
in a secured system for the purpose of some data an-
alyst to use the information for a particular purpose
while providing data privacy by using differential pri-
vacy methods.
From various mechanisms of differential privacy, we
have implemented one of the primary methods of dif-
ferential privacy that is essential in protecting sensi-
tive data. The method is Laplace mechanism where
the algorithms used is described in the next section.
35
37. 4.2 Architecture:
The architecture in the figure below depicts how we
are going to implement one of the existing mecha-
nisms of differential privacy called laplacian mech-
anism.The elements of the architecture will be de-
scribed in the coming sections.
Figure 4
36
38. 4.2.1 User Interface:
The user interface will be used by the data analyst to
request an execution of the query from the database
and the result which will be given to the data analyst
is the noisy version of the query result generated by
one of the mechanisms of differential privacy. Thus
this user interface is the way for the data analyst to
do his job.
4.2.2 Privacy Preserving Mechanism
The privacy preserving mechanism we used in this
experiment is laplacian differential privacy mecha-
nism.We have already discussed the laplacian differ-
ential privacy mechanism in the previous sections.
We saw in case of the counting queries,if the number
of queries is one then the sensitivity taken for the al-
gorithm is ’1’ i.e 4 = 1, we will take different values
of epsilon and repeat the experiment 50 times and
calculate different values of the result which will be
depicted in the result section.
4.2.3 Database
The Database used in the experiment is the post-
gresql which is the open source and free relational
database enabling us to store and manage the data(adult
dataset) used in the experiment.
37
39. 4.3 Dataset
We have used one of the famous datasets called UCI’s
adult dataset which was collected from US census
in 1994 and donated in 1996.It contains more than
350000 rows(customer records) with 15 columns as
follows:
38
41. Attribute Type Values
Relationship Nominal
Wife,
Own-child, Husband, Not-
in-family, Other-relative,
Unmarried
Race Nominal White, Asian-Pac-Islander,
Amer-Indian-Eskimo,
Other, Black
Sex Nominal Female, Male
Capital-
gain
Numerical
Capital-
loss
Numerical
Hours-per-
week
Numerical
Native-
country
Nominal United-States, Cambo-
dia, England, Puerto-
Rico, Canada, Germany,
Outlying-US(Guam-USVI-
etc), India, Japan, Greece,
South, China, Cuba, Iran,
Honduras, Philippines,
Italy, Poland, Jamaica,
Vietnam, Mexico, Por-
tugal, Ireland, France,
Dominican-Republic,
Laos, Ecuador, Taiwan,
Haiti, Columbia, Hungary,
Guatemala, Nicaragua,
Scotland, Thailand, Yu-
goslavia, El-Salvador,
Trinadad Tobago, Peru,
Hong, Holand-Netherlands
40
42. Attribute Type Values
Income Nominal =50K, 50k
Table 3
4.4 Database Query
The query used in the experiment is counting query
which will give the number of years a person has
spent in education by income and race.The query is
shown in the figure below:
Figure 5
4.5 Programming Language and Editors
The programming language that is used in this ex-
periment is python associated with various libraries
of python like Numpy,pandas,matplotlib,psycopg2 etc.
The editor used in this experiment is Sublime editor.
41
43. Chapter 5
5 Result
5.1 Utility Of Data and Level of Privacy
The query distribution in the original form and in the
form when laplacian differential privacy mechanism
is applied is given below in two figure’s. The ep-
silon value was taken as ’1’.As seen from the figures
the distributions are almost same, thus it verifies the
claim of differential privacy that result of the query
would be same no matter whether the individual is
present is the database or not.
Figure 6
42
44. Figure 7
The utility of data and level of privacy depends sig-
nificantly on ∈, earlier when we had taken ∈= 1, we
have seen a little change in the distribution which
isn’t easily visible by looking at the graph, now as
we have seen in previous section that smaller the
value of ∈ greater will be the privacy, we will see
this in action for the counting query we used in our
experiment and the experiment will be repeated for
varying value of ∈ specifically epsilon = 0.001, ep-
silon = 0.01, epsilon = 0.1 epsilon = 0.5, epsilon =1,
epsilon = 2. The experiment will be repeated for the
sake of accuracy 50 times for each value of epsilon.
The below table will show the result collected:
43
45. ∈= 0.001 ∈= 0.01 ∈= 0.1 ∈= 0.5 ∈= 1 ∈= 2 True
value
27.81249 35.39051 37.32668 38.00431 37.94904 38.14261 38
114.259 129.8866 128.2117 128.9040 128.98781 128.9992 129
376.4223 264.8106 265.7166 266.9909 267.0048 266.9134 267
508.4103 521.4273 514.2764 515.4326 515.1102 515.0179 515
318.2986 367.7279 380.4541 381.2337 381.0258 380.9488 381
633.4579 710.7502 708.2796 707.9570 708.0018 707.9361 708
905.9031 912.2191 927.7372 926.6723 927.0431 927.0012 927
229.2557 310.3037 307.2509 308.2056 308.1023 308.0594 308
7361.250 7363.948 7365.089 7361.968 7362.020 7361.984 7362
4766.829 4986.021 4951.782 4951.468 4951.640 4951.942 4952
999.5070 889.2168 874.2032 873.7415 874.0083 874.0150 874
668.4595 680.4382 678.6779 679.9279 679.9711 680.0249 680
2745.33 2675.99 2667.15 2666.96 2666.99 2666.95 2667
643.886 672.841 665.517 666.025 666.140 666.002 666
93.3477 141.3640 131.3123 131.9790 132.0313 132.035 132
114.49 89.435 92.744 93.117 93.166 92.967 93
Table 4
In total we have 16 values of education number in the
counting query we have seen earlier and for those 16
values of education number true value in the table is
the number of people having race ’white’ who have
spent ’x’ years in education,’x’ represents the value
of education number without applying laplacian dif-
ferential privacy.In the table we have other columns
with different values of epsilon, those columns are
representing the change in the true value by apply-
ing differential privacy. Observing the above table
44
46. as value of ∈ becomes smaller the noise added by
laplacian mechanism becomes larger, thus verifying
the claims of theorists and experimentalists of differ-
ential privacy.
45
47. Chapter 6
6 Conclusion
Differential privacy is the most used privacy pre-
serving mechanism today in the world.I have imple-
mented one of the mechanisms of Differential privacy
called laplacian differential privacy in python and us-
ing a counting query I have verified that the parame-
ter ∈ is the most important factor in determining the
utilization of data and level of privacy preserved.We
have shown how by taking different values of ∈ the
result got effected. We have verified the claim[15][16]
that smaller value of epsilon will result in better pri-
vacy.
46
48. Chapter 7
7 Future Work
Although some work has been done about what will
be the value of ∈ but there is no universally accepted
protocol to determine the value of ∈ for particular
type of situations.Thus more work needs to be done
in this area.
References
[1] Data Mining Concepts and Techniques Third
Edition Jiawei Han University of Illinois at Ur-
bana–Champaign Micheline Kamber Jian Pei Si-
mon Fraser University
[2] C. C. Aggarwal and P. S. Yu, “A general survey
of privacy-preserving data mining models and
algorithms,” in Privacy-Preserving Data Min-
ing. New York, NY, USA: Springer, 2008, pp.
1152.
[3] C. C. Aggarwal, Data Mining: The Textbook.
New York, NY, USA: Springer, 2015.
[4] M. Langheinrich, “Privacy in ubiquitous com-
puting,” in Ubiquitous Computing Fundamen-
tals. Boca Raton, FL, USA: CRC Press, 2009,
ch. 3, pp. 95159.
[5] Universal Declaration of Human Rights, United
Nation General Assembly, New York, NY,
47
49. USA, 1948, pp. 1 6. [Online]. Available:
http://www.un.org/en/documents/ udhr/
[6] D. Banisar et al., “Privacy and human rights:
An international survey of privacy laws and
practice,” Global Internet Liberty Campaign,
London, U.K., Tech. Rep., 1999.
[7] A. F. Westin, “Privacy and freedom,” Washing-
ton Lee Law Rev., vol. 25, no. 1, p. 166, 1968.
[8] A. Blum and Y. Monsour. Learning, regret min-
imization, and equilibria, 2007.
[9] A. Narayanan and V. Shmatikov. (2006). “How
to break anonymity of the Netix prize dataset.”
[Online]. Available: https://arxiv.org/abs/
cs/0610105
[10] Pierangela Samarati and Latanya Sweeney. Gen-
eralizing data to provide anonymity when dis-
closing information. In Proceedings of the 20th
ACM SIGMOD-SIGACTSIGART Symposium
on Principles of Database Systems, PODS ’98,
page 188, New York, NY, USA, 1998. ACM.
[11] Srivatsava Ranjit Ganta, Shiva Prasad Ka-
siviswanathan, and Adam Smith. Composition
attacks and auxiliary information in data pri-
vacy. In Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Discov-
ery and Data Mining, KDD ’08, pages 265–273,
New York, NY, USA, 2008. ACM
[12] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum,
and Sergey Yekhanin. Pan-private streaming al-
48
50. gorithms. In Proceedings of International Con-
ference on Super Computing. 2010.
[13] Stanley L. Warner. Randomized response: A
survey technique for eliminating evasive answer
bias. Journal of the American Statistical Asso-
ciation, 60(309):63–69, 1965.
[14] Cynthia Dwork, Frank McSherry, Kobbi Nissim,
and Adam Smith. Calibrating noise to sensi-
tivity in private data analysis. In Proceedings
of the 3rd Conference on Theory of Cryptogra-
phy, TCC ’06, pages 265–284, Berlin, Heidel-
berg, 2006. Springer.
[15] Kobbi Nissim, Sofya Raskhodnikova, and Adam
Smith. “Smooth Sensitivity and Sampling in
Private Data Analysis”. In: Proceedings of the
Thirty-ninth Annual ACM Symposium on The-
ory of Computing. STOC ’07. San Diego, Cali-
fornia, USA: ACM, 2007, pp. 75–84. ISBN: 978-
1-59593-631-8.
[16] Jaewoo Lee and Chris Clifton. “How Much Is
Enough? Choosing for Differential Privacy”. In:
Information Security, 14th International Confer-
ence, ISC 2011, Xi’an, China, October 26-29,
2011. Proceedings. 2011, pp. 325–340.
49