Towards Statistical Queries over Distributed Private User Data

Towards Statistical Queries over
Distributed Private User Data
R.Chen, A.Reznichenko, P.Francis – MPI-SWS, Germany
J.Gehrke – Cornell University, USA
Serafeim Chatzopoulos
M1258
schatz@di.uoa.gr
MDE519 – Distributed Systems
Instructor: Mema Roussopoulou
May 31,
2013

User Privacy
Towards Statistical Queries over Distributed Private User Data 2
 User Data is exposed to organizations in many
ways.
 Users are aware of their data being exposed.
 Make a purchase in an online store.
 Update a profile on a social network.
 Users are unaware of their data exposure.
 Third party trackers.
 Smart phone Apps.

The “user-owned and operated” principle
 Personal data should be stored in a local host or a
cloud device under the user‟s control and is released
in a controlled, limited or noisy fashion.
Users must have the exclusive control of
their own data and must be able to share
data selectively or voluntarily.

Motivation and Problem
 Distributed private user data is important.
 Analyst could use such data to
 understand users‟ behaviors
 discover their statistic patterns
 evaluate proposed enhancements.
 How to make statistical queries over such distributed
private user data while still preserving privacy?

Related Work
 Anonymization
Removes well-known personally identifiable
information(PPI).
 Randomization
Adds random distortion values to user data.
 k-anonymity, l-diversity, t-closeness
 Differential Privacy

Differential Privacy
 Differential privacy adds noise to the output of a
computation (i.e., answer of query).
 Hides the presence or absence of a record in the
dataset.
 Makes no assumption about the adversary.
Some form of distributed differential privacy is
required…

Prior Distributed Differential Privacy Designs
 First design has a per-user computational load of
O(U).
Dwork et al. EUROCRYPT ‟06
 Poor scalability
 Following designs reduce per-user computational
load to O(1) by using expensive secret sharing
protocols.
Rastogi and Nath, SIGMOD ‟10 – Shi et al. NDSS ‟11
 Not tolerate churn
 Recent designs introduce two honest-but-curious
servers to collaboratively compute the query result.
Gotz and Nath, MSR-TR ‟11
 Even a single malicious user can substantially distort
the query result.

Practical Distributed Differential Privacy System
(PDDP)
 Goals:
 The differential private guarantee is always maintained for
every honest client.
 Puts tight bound to the extent to which a malicious user
can distort query results.
 The maximum absolute distortion in the final result is bounded
by the number of malicious users.
 Operates at a large scale.
 Millions of users.
 Tolerates churn.
 Not prevent results from being produced.

PDDP Components
 Analyst
 Makes queries to the system
and collects answers.
 Proxy
 Adds differential private noise
to client‟s answers to preserve
privacy
 Clients
 Locally maintain their own data
and answer queries.

Security Assumptions (1/2)
 General Assumptions
 Clients have the correct public keys for analyst and the
proxy.
 Analyst and the proxy have the correct public keys for
each other.
 Corresponding private keys are kept secure.
 Analyst is potentially malicious (violating users‟
privacy)
 Collude with other analysts.
 Pretend to be multiple distinct analysts.
 Take control of clients and use PDDP protocol to reveal
info.
 Publish its collected answers.
 Intercept and modify all messages.

Security Assumptions (2/2)
 Proxy is honest but curious (HbC)
 Follows the specified protocol.
 Tries to exploit additional info that can be learned in so
doing.
 Does not collude with other components.
 Clients are potentially malicious (distorting the
statistical results learned by analysts)
 Have churn characteristics.
 Limited resources for computation and data transmission.
 Generate false or illegitimate answers.
 Act as Sybils.

PDDP Key insights – Binary answer
 How to limit query result distortion?
 Split answer‟s value into buckets.
 Enforce a binary answer in each bucket.
 Goldwasser-Micali (GM) bit-cryptosystem.
Example:
Query: “SELECT age FROM info WHERE gender=„m‟”
 4 buckets: 0~12, 13~20, 21~59, and ≥60.
 Answers: „1‟ or „0‟ per bucket
 Malicious clients cannot substantially distort the query
result.

PDDP Key insights – Blind noise
 How to achieve differential privacy ?
 Honest-but-curious proxy
Generates additional binary answers in each bucket as
differentially private noise.
 If analyst publishes the final noisy result
 proxy knows the noise added
 can subtract noise from the publish result to get a noisy-free
result.
 Solution: Proxy can only blindly add noise!
 Proxy knows that the added noise is enough to achieve
differential privacy
 Proxy does not know the exact noise added.

PDDP Workflow – Step 1
 Query Initialization
Analyst first issues
a query to the
Proxy.
 Message consists of 4 items:
 Query: SELECT age FROM info WHERE gender=„m‟
 Buckets: 0∼12, 13∼20, 21∼59 and ≥60.
 # clients queried (c): 1000
 DP parameter (ε): 1.0
 Controls tradeoff between accuracy of computation and strength of
its privacy guarantee.

 Query Forwarding
Select clients and
send them the
query.
 Proxy:
 rejects the query if c is too low or too high.
 rejects the query if ε exceeds the max privacy level allowed.
 selects c unique clients and send them the query, under the one
of the following policies:
 Select c clients randomly and wait for them to connect.
 Select the first c clients that connect.

PDDP Workflow – Step 3 (1/2)
 Client Response
Clients execute
the query and
send answers.
 Client executes query over its local data and produces
answer:
 „1‟ or „0‟ per bucket.
 More than one bucket may contain a „1‟.
 Per-bucket answer value is individually encrypted with the
analyst‟s public key. (GM cryptosystem)

PDDP Workflow – Step 3 (2/2)
 Goldwasser-Micali (GM) cryptosystem
 Single-bit cryptosystem
 Enforces binary answer in each bucket.
 Very Efficient
 XOR – homomorphic
 E(a) * E(b) = E(a XOR b)

 Blind noise
addition
 The proxy maintains a pool of additional binary
answers called coins and adds them as noise to
each bucket.
 Coins must be unbiased.
 Coins are encrypted with the analyst‟s public key.
 In each bucket must be added n coins:
How to generate coins blindly?

Coin pool generation
 Straightforward approaches
 Proxy generates coins
 Curious proxy could know noise-free result
 Clients generate coins
 Malicious clients could generate biased coins

Collaborative coin generation
 Paper‟s approach
 Each online client periodically generates an encrypted
unbiased coin E(oc) and sends it to the proxy
 The proxy receives the coin and verifies the legitimacy of the
coin.
 The proxy blindly re-flips the coin E(oc) by multiplying it with a
proxy‟s locally generated unbiased coin E(op) plus a modulo
operation.
E(oc) * E(op) mod m = E(oc XOR op),
where m is part of the analyst’s public key
 The proxy stores the unbiased coin in the locally maintained
pool.
 Proxy doesn‟t know the actual value of the generated unbiased
coin.

 Noisy answers to
analyst
 Each bucket has clients answers + coins (noise)
 After random delay the proxy shuffles the c + n values.
 Prevents identification of a client based on the vector of „1‟ and „0‟ in its answer.
 Finally, analyst
 decrypts with its private key all encrypted binary values.
 sums the plaintext values obtained.
 obtains the noisy answer for the clients that fall within each bucket.

Practical Considerations (1/2)
 Utility of aggregate result
 Depends on the amount of added noise.
 The n coins added by the proxy and the analyst‟s adjustment on
the means of n/2 form a binomial distribution (approximation of
the normal distribution N(0, n/4) ).
 Example :
c =106 , ε = 1.0
Given normal distribution in each bucket
 68% probability that the noisy answer is 15.24 away from the true answer
 95% probability that the noisy answer is 30.48 away from the true answer
 99.7% probability that the noisy answer is 45.72 away from the true answer

Practical Considerations (2/2)
 Non-numeric Queries
 Map query into a numeric query.
Example:
“Which website do you visit most often?”
Map each website the analyst wishes to learn into a numeric
value.
Large number of buckets – limit the answer to 5000 buckets.
 Sybils
 Design susceptible to Sybil attacks (single client can
masquerades multiple clients).
 Proxy can limit the number of clients selected at a single IP
address for a given query.

Implementation and Deployment (1/2)
 Client
 Firefox add-on
 9600 lines of Java code
 Information is stored in local SQLite storage
 Web browsing activities
 Certain online shopping activities
 Certain ad interactions
 Can be extended to capture any online activity
 Every 5 min connects to the proxy to retrieve pending queries,
return answers and periodically generated coins.

Implementation and Deployment (2/2)
 Proxy
 Web service on Tomcat 6.0.33
 3600 lines of code
 Proxy state in MySQL database.
 Analyst
 800 lines of code
Deployment
 Correctness verified on a set of local machines.
 600+ real clients

Comparison: “Paillier-based” design
 Honest-but-Curious Proxy
 Paillier Cryptosystem
 Additive homomorphism
 Proxy can directly sum up all clients‟ encrypted binary
answers to get the encrypted sum of each bucket.
 A single malicious client can distort substantially the
result
 Use of zero-knowledge-proofs (ZKP) to ensure that
encrypted answers are „1‟ or „0‟.
 Proxy knows exactly how much noise has been
added.

Evaluation (1/5)
 Client Performance
 Clients encrypt a binary value for each bucket.
 GM cryptosystem
 Paillier cryptosystem

Evaluation (2/5)
 Proxy - Analyst Performance
 Proxy
PDDP
 One encryption and one homomorphic XOR for one unbiased coin.
 Jacobi symbol checking on received coins and answer values
(faster than a decryption).
Paillier-based
 One ZKP for each client answer in each bucket.
 Homomorphically sum up all clients answers per bucket.
 Add noise to each per-bucket total sum.

Evaluation (3/5)
 Proxy - Analyst Performance
 Analyst
PDDP
 Decrypt all encrypted values in each bucket.
Paillier-based
 Decrypt one encrypted value in each bucket

Evaluation (4/5)
 Bandwidth overhead
 In both systems, a client transmits an encrypted answer to
each bucket.
 In PDDP, a client transmits periodically generated coin to the
proxy.
 In Paillier-based, a client transmits a ZKP for each bucket.
 Storage overhead
 In PDDP, the proxy stores all clients‟ answer values for each
bucket plus the required number of coins.
 In Paillier-based, proxy stores only one answer value per
bucket.

Evaluation (5/5)
 Querying the client deployment
 Parameters
c = 250 (out of 600 clients)
ε = 5.0
clients are selected as they connect until 250 unique clients are queried or
24-hours expire.
These parameters result in 16 coins per bucket.
 Ensure that a per bucket aggregate answer is within plus or minus 2, 4, 6
of the noisy-free answer with a probability of 68%, 95% and 99,7%

Future Work
 Support of statistical learning algorithms
 Scalability of non-numeric queries
 Bloom filters – map a large number of possible answers in
a small number of buckets.
 Gather statistical data for a large-scale experiment.
 Weaken proxy trust requirements.
 Use of trusted hardware (TPM)
 General: measure the actual privacy loss for
differential privacy.

Conclusion
 PDDP: Practical Distributed Differential Private
System
 Scales well
 Tolerates churn
 Places tight bound on malicious user‟s capability.
 Key insights
 Binary answer in each bucket
 Blind noise addition

Questions?

Towards Statistical Queries over Distributed Private User Data

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Towards Statistical Queries over Distributed Private User Data

Similar to Towards Statistical Queries over Distributed Private User Data (20)

Recently uploaded

Recently uploaded (20)

Towards Statistical Queries over Distributed Private User Data