Towards Statistical Queries over Distributed Private User Data
Upcoming SlideShare
Loading in...5
×
 

Towards Statistical Queries over Distributed Private User Data

on

  • 199 views

Towards Statistical Queries over Distributed Private User Data - Presentation

Towards Statistical Queries over Distributed Private User Data - Presentation

Statistics

Views

Total Views
199
Views on SlideShare
199
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Towards Statistical Queries over Distributed Private User Data Towards Statistical Queries over Distributed Private User Data Presentation Transcript

  • Towards Statistical Queries overDistributed Private User DataR.Chen, A.Reznichenko, P.Francis – MPI-SWS, GermanyJ.Gehrke – Cornell University, USASerafeim ChatzopoulosM1258schatz@di.uoa.grMDE519 – Distributed SystemsInstructor: Mema RoussopoulouMay 31,2013
  • User PrivacyTowards Statistical Queries over Distributed Private User Data 2 User Data is exposed to organizations in manyways. Users are aware of their data being exposed. Make a purchase in an online store. Update a profile on a social network. Users are unaware of their data exposure. Third party trackers. Smart phone Apps.
  • The “user-owned and operated” principleTowards Statistical Queries over Distributed Private User Data 3 Personal data should be stored in a local host or acloud device under the user‟s control and is releasedin a controlled, limited or noisy fashion.Users must have the exclusive control oftheir own data and must be able to sharedata selectively or voluntarily.
  • Motivation and ProblemTowards Statistical Queries over Distributed Private User Data 4 Distributed private user data is important. Analyst could use such data to understand users‟ behaviors discover their statistic patterns evaluate proposed enhancements. How to make statistical queries over such distributedprivate user data while still preserving privacy?
  • Related WorkTowards Statistical Queries over Distributed Private User Data 5 AnonymizationRemoves well-known personally identifiableinformation(PPI). RandomizationAdds random distortion values to user data. k-anonymity, l-diversity, t-closeness Differential Privacy
  • Differential PrivacyTowards Statistical Queries over Distributed Private User Data 6 Differential privacy adds noise to the output of acomputation (i.e., answer of query). Hides the presence or absence of a record in thedataset. Makes no assumption about the adversary.Some form of distributed differential privacy isrequired…
  • Prior Distributed Differential Privacy DesignsTowards Statistical Queries over Distributed Private User Data 7 First design has a per-user computational load ofO(U).Dwork et al. EUROCRYPT ‟06 Poor scalability Following designs reduce per-user computationalload to O(1) by using expensive secret sharingprotocols.Rastogi and Nath, SIGMOD ‟10 – Shi et al. NDSS ‟11 Not tolerate churn Recent designs introduce two honest-but-curiousservers to collaboratively compute the query result.Gotz and Nath, MSR-TR ‟11 Even a single malicious user can substantially distortthe query result.
  • Practical Distributed Differential Privacy System(PDDP)Towards Statistical Queries over Distributed Private User Data 8 Goals: The differential private guarantee is always maintained forevery honest client. Puts tight bound to the extent to which a malicious usercan distort query results. The maximum absolute distortion in the final result is boundedby the number of malicious users. Operates at a large scale. Millions of users. Tolerates churn. Not prevent results from being produced.
  • PDDP ComponentsTowards Statistical Queries over Distributed Private User Data 9 Analyst Makes queries to the systemand collects answers. Proxy Adds differential private noiseto client‟s answers to preserveprivacy Clients Locally maintain their own dataand answer queries.
  • Security Assumptions (1/2)Towards Statistical Queries over Distributed Private User Data 10 General Assumptions Clients have the correct public keys for analyst and theproxy. Analyst and the proxy have the correct public keys foreach other. Corresponding private keys are kept secure. Analyst is potentially malicious (violating users‟privacy) Collude with other analysts. Pretend to be multiple distinct analysts. Take control of clients and use PDDP protocol to revealinfo. Publish its collected answers. Intercept and modify all messages.
  • Security Assumptions (2/2)Towards Statistical Queries over Distributed Private User Data 11 Proxy is honest but curious (HbC) Follows the specified protocol. Tries to exploit additional info that can be learned in sodoing. Does not collude with other components. Clients are potentially malicious (distorting thestatistical results learned by analysts) Have churn characteristics. Limited resources for computation and data transmission. Generate false or illegitimate answers. Act as Sybils.
  • PDDP Key insights – Binary answerTowards Statistical Queries over Distributed Private User Data 12 How to limit query result distortion? Split answer‟s value into buckets. Enforce a binary answer in each bucket. Goldwasser-Micali (GM) bit-cryptosystem.Example:Query: “SELECT age FROM info WHERE gender=„m‟” 4 buckets: 0~12, 13~20, 21~59, and ≥60. Answers: „1‟ or „0‟ per bucket Malicious clients cannot substantially distort the queryresult.
  • PDDP Key insights – Blind noiseTowards Statistical Queries over Distributed Private User Data 13 How to achieve differential privacy ? Honest-but-curious proxyGenerates additional binary answers in each bucket asdifferentially private noise. If analyst publishes the final noisy result proxy knows the noise added can subtract noise from the publish result to get a noisy-freeresult. Solution: Proxy can only blindly add noise! Proxy knows that the added noise is enough to achievedifferential privacy Proxy does not know the exact noise added.
  • PDDP Workflow – Step 1Towards Statistical Queries over Distributed Private User Data 14 Query InitializationAnalyst first issuesa query to theProxy. Message consists of 4 items: Query: SELECT age FROM info WHERE gender=„m‟ Buckets: 0∼12, 13∼20, 21∼59 and ≥60. # clients queried (c): 1000 DP parameter (ε): 1.0 Controls tradeoff between accuracy of computation and strength ofits privacy guarantee.
  • PDDP Workflow – Step 2Towards Statistical Queries over Distributed Private User Data 15 Query ForwardingSelect clients andsend them thequery. Proxy: rejects the query if c is too low or too high. rejects the query if ε exceeds the max privacy level allowed. selects c unique clients and send them the query, under the oneof the following policies: Select c clients randomly and wait for them to connect. Select the first c clients that connect.
  • PDDP Workflow – Step 3 (1/2)Towards Statistical Queries over Distributed Private User Data 16 Client ResponseClients executethe query andsend answers. Client executes query over its local data and producesanswer: „1‟ or „0‟ per bucket. More than one bucket may contain a „1‟. Per-bucket answer value is individually encrypted with theanalyst‟s public key. (GM cryptosystem)
  • PDDP Workflow – Step 3 (2/2)Towards Statistical Queries over Distributed Private User Data 17 Goldwasser-Micali (GM) cryptosystem Single-bit cryptosystem Enforces binary answer in each bucket. Very Efficient XOR – homomorphic E(a) * E(b) = E(a XOR b)
  • PDDP Workflow – Step 4Towards Statistical Queries over Distributed Private User Data 18 Blind noiseaddition The proxy maintains a pool of additional binaryanswers called coins and adds them as noise toeach bucket. Coins must be unbiased. Coins are encrypted with the analyst‟s public key. In each bucket must be added n coins:How to generate coins blindly?
  • Coin pool generationTowards Statistical Queries over Distributed Private User Data 19 Straightforward approaches Proxy generates coins Curious proxy could know noise-free result Clients generate coins Malicious clients could generate biased coins
  • Collaborative coin generationTowards Statistical Queries over Distributed Private User Data 20 Paper‟s approach Each online client periodically generates an encryptedunbiased coin E(oc) and sends it to the proxy The proxy receives the coin and verifies the legitimacy of thecoin. The proxy blindly re-flips the coin E(oc) by multiplying it with aproxy‟s locally generated unbiased coin E(op) plus a modulooperation.E(oc) * E(op) mod m = E(oc XOR op),where m is part of the analyst’s public key The proxy stores the unbiased coin in the locally maintainedpool. Proxy doesn‟t know the actual value of the generated unbiasedcoin.
  • PDDP Workflow – Step 5Towards Statistical Queries over Distributed Private User Data 21 Noisy answers toanalyst Each bucket has clients answers + coins (noise) After random delay the proxy shuffles the c + n values. Prevents identification of a client based on the vector of „1‟ and „0‟ in its answer. Finally, analyst decrypts with its private key all encrypted binary values. sums the plaintext values obtained. obtains the noisy answer for the clients that fall within each bucket.
  • Practical Considerations (1/2)Towards Statistical Queries over Distributed Private User Data 22 Utility of aggregate result Depends on the amount of added noise. The n coins added by the proxy and the analyst‟s adjustment onthe means of n/2 form a binomial distribution (approximation ofthe normal distribution N(0, n/4) ). Example :c =106 , ε = 1.0Given normal distribution in each bucket 68% probability that the noisy answer is 15.24 away from the true answer 95% probability that the noisy answer is 30.48 away from the true answer 99.7% probability that the noisy answer is 45.72 away from the true answer
  • Practical Considerations (2/2)Towards Statistical Queries over Distributed Private User Data 23 Non-numeric Queries Map query into a numeric query.Example:“Which website do you visit most often?”Map each website the analyst wishes to learn into a numericvalue.Large number of buckets – limit the answer to 5000 buckets. Sybils Design susceptible to Sybil attacks (single client canmasquerades multiple clients). Proxy can limit the number of clients selected at a single IPaddress for a given query.
  • Implementation and Deployment (1/2)Towards Statistical Queries over Distributed Private User Data 24 Client Firefox add-on 9600 lines of Java code Information is stored in local SQLite storage Web browsing activities Certain online shopping activities Certain ad interactions Can be extended to capture any online activity Every 5 min connects to the proxy to retrieve pending queries,return answers and periodically generated coins.
  • Implementation and Deployment (2/2)Towards Statistical Queries over Distributed Private User Data 25 Proxy Web service on Tomcat 6.0.33 3600 lines of code Proxy state in MySQL database. Analyst 800 lines of codeDeployment Correctness verified on a set of local machines. 600+ real clients
  • Comparison: “Paillier-based” designTowards Statistical Queries over Distributed Private User Data 26 Honest-but-Curious Proxy Paillier Cryptosystem Additive homomorphism Proxy can directly sum up all clients‟ encrypted binaryanswers to get the encrypted sum of each bucket. A single malicious client can distort substantially theresult Use of zero-knowledge-proofs (ZKP) to ensure thatencrypted answers are „1‟ or „0‟. Proxy knows exactly how much noise has beenadded.
  • Evaluation (1/5)Towards Statistical Queries over Distributed Private User Data 27 Client Performance Clients encrypt a binary value for each bucket. GM cryptosystem Paillier cryptosystem
  • Evaluation (2/5)Towards Statistical Queries over Distributed Private User Data 28 Proxy - Analyst Performance ProxyPDDP One encryption and one homomorphic XOR for one unbiased coin. Jacobi symbol checking on received coins and answer values(faster than a decryption).Paillier-based One ZKP for each client answer in each bucket. Homomorphically sum up all clients answers per bucket. Add noise to each per-bucket total sum.
  • Evaluation (3/5)Towards Statistical Queries over Distributed Private User Data 29 Proxy - Analyst Performance AnalystPDDP Decrypt all encrypted values in each bucket.Paillier-based Decrypt one encrypted value in each bucket
  • Evaluation (4/5)Towards Statistical Queries over Distributed Private User Data 30 Bandwidth overhead In both systems, a client transmits an encrypted answer toeach bucket. In PDDP, a client transmits periodically generated coin to theproxy. In Paillier-based, a client transmits a ZKP for each bucket. Storage overhead In PDDP, the proxy stores all clients‟ answer values for eachbucket plus the required number of coins. In Paillier-based, proxy stores only one answer value perbucket.
  • Evaluation (5/5)Towards Statistical Queries over Distributed Private User Data 31 Querying the client deployment Parametersc = 250 (out of 600 clients)ε = 5.0clients are selected as they connect until 250 unique clients are queried or24-hours expire.These parameters result in 16 coins per bucket. Ensure that a per bucket aggregate answer is within plus or minus 2, 4, 6of the noisy-free answer with a probability of 68%, 95% and 99,7%
  • Future WorkTowards Statistical Queries over Distributed Private User Data 32 Support of statistical learning algorithms Scalability of non-numeric queries Bloom filters – map a large number of possible answers ina small number of buckets. Gather statistical data for a large-scale experiment. Weaken proxy trust requirements. Use of trusted hardware (TPM) General: measure the actual privacy loss fordifferential privacy.
  • ConclusionTowards Statistical Queries over Distributed Private User Data 33 PDDP: Practical Distributed Differential PrivateSystem Scales well Tolerates churn Places tight bound on malicious user‟s capability. Key insights Binary answer in each bucket Blind noise addition
  • Towards Statistical Queries over Distributed Private User Data 34Questions?