User PrivacyTowards Statistical Queries over Distributed Private User Data 2 User Data is exposed to organizations in manyways. Users are aware of their data being exposed. Make a purchase in an online store. Update a profile on a social network. Users are unaware of their data exposure. Third party trackers. Smart phone Apps.
The “user-owned and operated” principleTowards Statistical Queries over Distributed Private User Data 3 Personal data should be stored in a local host or acloud device under the user‟s control and is releasedin a controlled, limited or noisy fashion.Users must have the exclusive control oftheir own data and must be able to sharedata selectively or voluntarily.
Motivation and ProblemTowards Statistical Queries over Distributed Private User Data 4 Distributed private user data is important. Analyst could use such data to understand users‟ behaviors discover their statistic patterns evaluate proposed enhancements. How to make statistical queries over such distributedprivate user data while still preserving privacy?
Related WorkTowards Statistical Queries over Distributed Private User Data 5 AnonymizationRemoves well-known personally identifiableinformation(PPI). RandomizationAdds random distortion values to user data. k-anonymity, l-diversity, t-closeness Differential Privacy
Differential PrivacyTowards Statistical Queries over Distributed Private User Data 6 Differential privacy adds noise to the output of acomputation (i.e., answer of query). Hides the presence or absence of a record in thedataset. Makes no assumption about the adversary.Some form of distributed differential privacy isrequired…
Prior Distributed Differential Privacy DesignsTowards Statistical Queries over Distributed Private User Data 7 First design has a per-user computational load ofO(U).Dwork et al. EUROCRYPT ‟06 Poor scalability Following designs reduce per-user computationalload to O(1) by using expensive secret sharingprotocols.Rastogi and Nath, SIGMOD ‟10 – Shi et al. NDSS ‟11 Not tolerate churn Recent designs introduce two honest-but-curiousservers to collaboratively compute the query result.Gotz and Nath, MSR-TR ‟11 Even a single malicious user can substantially distortthe query result.
Practical Distributed Differential Privacy System(PDDP)Towards Statistical Queries over Distributed Private User Data 8 Goals: The differential private guarantee is always maintained forevery honest client. Puts tight bound to the extent to which a malicious usercan distort query results. The maximum absolute distortion in the final result is boundedby the number of malicious users. Operates at a large scale. Millions of users. Tolerates churn. Not prevent results from being produced.
PDDP ComponentsTowards Statistical Queries over Distributed Private User Data 9 Analyst Makes queries to the systemand collects answers. Proxy Adds differential private noiseto client‟s answers to preserveprivacy Clients Locally maintain their own dataand answer queries.
Security Assumptions (1/2)Towards Statistical Queries over Distributed Private User Data 10 General Assumptions Clients have the correct public keys for analyst and theproxy. Analyst and the proxy have the correct public keys foreach other. Corresponding private keys are kept secure. Analyst is potentially malicious (violating users‟privacy) Collude with other analysts. Pretend to be multiple distinct analysts. Take control of clients and use PDDP protocol to revealinfo. Publish its collected answers. Intercept and modify all messages.
Security Assumptions (2/2)Towards Statistical Queries over Distributed Private User Data 11 Proxy is honest but curious (HbC) Follows the specified protocol. Tries to exploit additional info that can be learned in sodoing. Does not collude with other components. Clients are potentially malicious (distorting thestatistical results learned by analysts) Have churn characteristics. Limited resources for computation and data transmission. Generate false or illegitimate answers. Act as Sybils.
PDDP Key insights – Binary answerTowards Statistical Queries over Distributed Private User Data 12 How to limit query result distortion? Split answer‟s value into buckets. Enforce a binary answer in each bucket. Goldwasser-Micali (GM) bit-cryptosystem.Example:Query: “SELECT age FROM info WHERE gender=„m‟” 4 buckets: 0~12, 13~20, 21~59, and ≥60. Answers: „1‟ or „0‟ per bucket Malicious clients cannot substantially distort the queryresult.
PDDP Key insights – Blind noiseTowards Statistical Queries over Distributed Private User Data 13 How to achieve differential privacy ? Honest-but-curious proxyGenerates additional binary answers in each bucket asdifferentially private noise. If analyst publishes the final noisy result proxy knows the noise added can subtract noise from the publish result to get a noisy-freeresult. Solution: Proxy can only blindly add noise! Proxy knows that the added noise is enough to achievedifferential privacy Proxy does not know the exact noise added.
PDDP Workflow – Step 1Towards Statistical Queries over Distributed Private User Data 14 Query InitializationAnalyst first issuesa query to theProxy. Message consists of 4 items: Query: SELECT age FROM info WHERE gender=„m‟ Buckets: 0∼12, 13∼20, 21∼59 and ≥60. # clients queried (c): 1000 DP parameter (ε): 1.0 Controls tradeoff between accuracy of computation and strength ofits privacy guarantee.
PDDP Workflow – Step 2Towards Statistical Queries over Distributed Private User Data 15 Query ForwardingSelect clients andsend them thequery. Proxy: rejects the query if c is too low or too high. rejects the query if ε exceeds the max privacy level allowed. selects c unique clients and send them the query, under the oneof the following policies: Select c clients randomly and wait for them to connect. Select the first c clients that connect.
PDDP Workflow – Step 3 (1/2)Towards Statistical Queries over Distributed Private User Data 16 Client ResponseClients executethe query andsend answers. Client executes query over its local data and producesanswer: „1‟ or „0‟ per bucket. More than one bucket may contain a „1‟. Per-bucket answer value is individually encrypted with theanalyst‟s public key. (GM cryptosystem)
PDDP Workflow – Step 3 (2/2)Towards Statistical Queries over Distributed Private User Data 17 Goldwasser-Micali (GM) cryptosystem Single-bit cryptosystem Enforces binary answer in each bucket. Very Efficient XOR – homomorphic E(a) * E(b) = E(a XOR b)
PDDP Workflow – Step 4Towards Statistical Queries over Distributed Private User Data 18 Blind noiseaddition The proxy maintains a pool of additional binaryanswers called coins and adds them as noise toeach bucket. Coins must be unbiased. Coins are encrypted with the analyst‟s public key. In each bucket must be added n coins:How to generate coins blindly?
Coin pool generationTowards Statistical Queries over Distributed Private User Data 19 Straightforward approaches Proxy generates coins Curious proxy could know noise-free result Clients generate coins Malicious clients could generate biased coins
Collaborative coin generationTowards Statistical Queries over Distributed Private User Data 20 Paper‟s approach Each online client periodically generates an encryptedunbiased coin E(oc) and sends it to the proxy The proxy receives the coin and verifies the legitimacy of thecoin. The proxy blindly re-flips the coin E(oc) by multiplying it with aproxy‟s locally generated unbiased coin E(op) plus a modulooperation.E(oc) * E(op) mod m = E(oc XOR op),where m is part of the analyst’s public key The proxy stores the unbiased coin in the locally maintainedpool. Proxy doesn‟t know the actual value of the generated unbiasedcoin.
PDDP Workflow – Step 5Towards Statistical Queries over Distributed Private User Data 21 Noisy answers toanalyst Each bucket has clients answers + coins (noise) After random delay the proxy shuffles the c + n values. Prevents identification of a client based on the vector of „1‟ and „0‟ in its answer. Finally, analyst decrypts with its private key all encrypted binary values. sums the plaintext values obtained. obtains the noisy answer for the clients that fall within each bucket.
Practical Considerations (1/2)Towards Statistical Queries over Distributed Private User Data 22 Utility of aggregate result Depends on the amount of added noise. The n coins added by the proxy and the analyst‟s adjustment onthe means of n/2 form a binomial distribution (approximation ofthe normal distribution N(0, n/4) ). Example :c =106 , ε = 1.0Given normal distribution in each bucket 68% probability that the noisy answer is 15.24 away from the true answer 95% probability that the noisy answer is 30.48 away from the true answer 99.7% probability that the noisy answer is 45.72 away from the true answer
Practical Considerations (2/2)Towards Statistical Queries over Distributed Private User Data 23 Non-numeric Queries Map query into a numeric query.Example:“Which website do you visit most often?”Map each website the analyst wishes to learn into a numericvalue.Large number of buckets – limit the answer to 5000 buckets. Sybils Design susceptible to Sybil attacks (single client canmasquerades multiple clients). Proxy can limit the number of clients selected at a single IPaddress for a given query.
Implementation and Deployment (1/2)Towards Statistical Queries over Distributed Private User Data 24 Client Firefox add-on 9600 lines of Java code Information is stored in local SQLite storage Web browsing activities Certain online shopping activities Certain ad interactions Can be extended to capture any online activity Every 5 min connects to the proxy to retrieve pending queries,return answers and periodically generated coins.
Implementation and Deployment (2/2)Towards Statistical Queries over Distributed Private User Data 25 Proxy Web service on Tomcat 6.0.33 3600 lines of code Proxy state in MySQL database. Analyst 800 lines of codeDeployment Correctness verified on a set of local machines. 600+ real clients
Comparison: “Paillier-based” designTowards Statistical Queries over Distributed Private User Data 26 Honest-but-Curious Proxy Paillier Cryptosystem Additive homomorphism Proxy can directly sum up all clients‟ encrypted binaryanswers to get the encrypted sum of each bucket. A single malicious client can distort substantially theresult Use of zero-knowledge-proofs (ZKP) to ensure thatencrypted answers are „1‟ or „0‟. Proxy knows exactly how much noise has beenadded.
Evaluation (1/5)Towards Statistical Queries over Distributed Private User Data 27 Client Performance Clients encrypt a binary value for each bucket. GM cryptosystem Paillier cryptosystem
Evaluation (2/5)Towards Statistical Queries over Distributed Private User Data 28 Proxy - Analyst Performance ProxyPDDP One encryption and one homomorphic XOR for one unbiased coin. Jacobi symbol checking on received coins and answer values(faster than a decryption).Paillier-based One ZKP for each client answer in each bucket. Homomorphically sum up all clients answers per bucket. Add noise to each per-bucket total sum.
Evaluation (3/5)Towards Statistical Queries over Distributed Private User Data 29 Proxy - Analyst Performance AnalystPDDP Decrypt all encrypted values in each bucket.Paillier-based Decrypt one encrypted value in each bucket
Evaluation (4/5)Towards Statistical Queries over Distributed Private User Data 30 Bandwidth overhead In both systems, a client transmits an encrypted answer toeach bucket. In PDDP, a client transmits periodically generated coin to theproxy. In Paillier-based, a client transmits a ZKP for each bucket. Storage overhead In PDDP, the proxy stores all clients‟ answer values for eachbucket plus the required number of coins. In Paillier-based, proxy stores only one answer value perbucket.
Evaluation (5/5)Towards Statistical Queries over Distributed Private User Data 31 Querying the client deployment Parametersc = 250 (out of 600 clients)ε = 5.0clients are selected as they connect until 250 unique clients are queried or24-hours expire.These parameters result in 16 coins per bucket. Ensure that a per bucket aggregate answer is within plus or minus 2, 4, 6of the noisy-free answer with a probability of 68%, 95% and 99,7%
Future WorkTowards Statistical Queries over Distributed Private User Data 32 Support of statistical learning algorithms Scalability of non-numeric queries Bloom filters – map a large number of possible answers ina small number of buckets. Gather statistical data for a large-scale experiment. Weaken proxy trust requirements. Use of trusted hardware (TPM) General: measure the actual privacy loss fordifferential privacy.
ConclusionTowards Statistical Queries over Distributed Private User Data 33 PDDP: Practical Distributed Differential PrivateSystem Scales well Tolerates churn Places tight bound on malicious user‟s capability. Key insights Binary answer in each bucket Blind noise addition
Towards Statistical Queries over Distributed Private User Data 34Questions?
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.