Privacy-preserving Information
Sharing: Tools and Applications
Emiliano De Cristofaro
University College London
https://emilianodc.com
Koc University, Jan 2016
Prologue
Privacy-Enhancing Technologies (PETs):
Increase privacy of users, groups, and/or organizations
PETs often respond to privacy threats
Protect personally identifiable information
Support anonymous communications
Privacy-respecting data processing
Another angle: privacy as an enabler
Actively enabling scenarios otherwise impossible w/o
clear privacy guarantees
2
Sharing Information w/ Privacy
When parties with limited mutual trust willing or
required to share information
Only the required minimum amount of information should
be disclosed in the process
3
Secure Computation
Alice (a) Bob (b)
f(a,b)
f(a,b)f(a,b)
Map information sharing to f(·,·)?
Realize secure f(·,·) efficiently?
Quantify information disclosure from output of f(·,·)?
4
Private Set Intersection (PSI)
Server Client
S ={s1, ,sw} C ={c1, ,cv}
Private
Set Intersection
SÇC
5
PSI Cardinality (PSI-CA)
Server Client
S ={s1, ,sw} C ={c1, ,cv}
PSI-CA
SÇC
6
PSI w/ Data Transfer (PSI-DT)
Server Client
C ={c1, ,cv}
PSI-DT
 ),(),...,,( 11 ww datasdatasS 
SÇC = (sj,dataj ) $ci Î C :ci = sj{ }
7
PSI w/ Data Transfer
Client Server
8
Authorized Private Set Intersection
Server Client
S ={s1, ,sw} C ={c1, ,cv}
Private
Set Intersection
SÇC
What if the client
populates C with its best
guesses for S?
Client needs to prove that inputs satisfy
a policy or be authorized
Authorizations issued by appropriate authority
Authorizations need to be verified implicitly
9
Size-Hiding Private Set Intersection
Server Client
S ={s1, ,sw} C ={c1, ,cv}
Private
Set Intersection
SÇC
v w
10
Garbled-Circuit Based PSI
Very fast implementations
[HKE12, DCW13, PSZ14], possibly more 
Based on pipelining
Exploits advances in OT Extension
Do not support all functionalities
Still high communication overhead
11
Special-purpose PSI
[DT10]: scales efficiently to very large sets
First protocol with linear complexities and fast crypto
[DKT10]: extends to arbitrarily malicious adversaries
Works also for Authorized Private Set Intersection
[DJLLT11]: PSI-based database querying
Won IARPA APP challenge, basis for IARPA SPAR
[DT12]: optimized toolkit for PSI
Privately intersect sets – 2,000 items/sec
[ADT11]: size-hiding PSI
12
Oblivious Pseudo-Random Functions
Server Client
fk (x)
OPRF
k
fk (x)
x
13
OPRF-based PSI
Server Client
fk (x)
OPRF
k
Unless sj is in the intersection fk(sj)
indistinguishable from random
ci
fk (sj )
fk (ci )
14
OPRF from Blind-RSA Signatures
RSA Signatures:
PRF: fd (x)= H(sigd (x))
e×d º1mod(p-1)(q-1)(N = p×q, e), d
Sigd (x)= H(x)d
modN,
Ver(Sig(x), x)=1ÛSig(x)e
= H(x)modN
Server (d) Client (x)
(H one way function)
a = H(x)×re r Î ZN
(= H(x)d
red
)
sigd (x) = b /rb = ad
fd (x)= H(sigd (x))
15
Authorized Private Set Intersection (APSI)
Server Client
S ={s1, ,sw} C ={(c1,auth(c1)), ,(cv,auth(cv ))}
Authorized Private
Set Intersection
SÇC =
def
sj Î S $ci Î C :ci = sj Ùauth(ci )is valid{ }
Court
16
OPRF w/ Implicit Signature Verification
Server Client
fk (x)
OPRF with ISV
k sig(x)
fk (x) if Ver(sig(x), x)=1
$ otherwise
17
A simple OPRF-like with ISV
Court issues authorizations:
OPRF: fk (x)= F(H(x)2k
modN)
Sig(x)= H(x)d
modN
Server (k) Client (H(x)d)
a = H(x)d
gr r Î ZN
(b = H(x)2edk
g2rek
)
H(x)2k
= b/g2erk
b = a2×e×k
;gk
fk (x)= F(H(x)2k
)(Implicit Verification)
18
OPRF with ISV – Malicious Security
OPRF: fk (x)= F(H(x)2k
)
Server (k) Client (H(x)d)
a = H(x)d
gr r Î ZN
(b = H(x)2edk
g2rek
)
H(x)2k
= b/g2erkb = a2ek
fk (x)= F(H(x)2k
)
a = H(x)(g')r
p = ZKPK{r :a2e
/a2
=(ge
/g')2r
}
gk
p'=ZKPK{k:b=a2ek
}
19
Other Building Blocks
[DGT12]: Private Set Intersection Cardinality-only
[BDG12]: Private Sample Set Similarity
[DFT13]: Private Substring/Pattern Matching
20
Cool! So what? 
21
22
Genomics…
23
24
25
26
The Bad News
Sensitivity of human genome:
Uniquely identifies an individual
Discloses ethnicity, disease predispositions (including mental)
Progress aggravates fears of discrimination
Once leaked, it cannot be “revoked”
De-identification and obfuscation are not effective
More info:
[ADHT13] Chills and Thrills of Whole Genome Sequencing. IEEE
Computer Magazine.
27
Secure Genomics?
Privacy:
Individuals remain in control of their genome
Allow doctors/clinicians/labs to run genomic tests, while
disclosing the required minimum amount of information, i.e.:
(1) Individuals don’t disclose their entire genome
(2) Testing facilities keep test specifics (“secret sauce”)
confidential
[BBDGT11]: Secure genomics via *-PSI
Most personalized medicine tests in < 1 second
Works on Android too
28
Private Set
Intersection
Cardinality
Test Result:
(#fragments with same length)
Private RFLP-based Paternity Test
29
doctor
or lab
genome
individual
test specifics
Secure
Function
Evaluation
test result test result
• Private Set Intersection (PSI)
• Authorized PSI
• Private Pattern Matching
• Homomorphic Encryption
• Garbled Circtuis
• […]
Output reveals nothing beyond
test result
• Paternity/Ancestry Testing
• Testing of SNPs/Markers
• Compatibility Testing
• Disease Predisposition […]
30
Open Problems
Where do we store genomes?
Encryption can’t guarantee security past 30-50 yrs
Reliability and availability issues?
Cryptography
Efficiency overhead
Data representation assumptions
How much understanding required from users?
31
Collaborative Anomaly Detection
Anomaly detection is hard
Suspicious activities deliberately mimic normal behavior
But, malevolent actors often use same resources
Wouldn’t it be better if organizations
collaborated?
It’s a win-win, no?
“It is the policy of the United States
Government to increase the volume, timelines,
and quality of cyber threat information shared
with U.S. private sector entities so that these
entities may better protect and defend
themselves against cyber attacks.”
Barack Obama
2013 State of the Union Address
32
Problems with Collaborations
Trust
Will others leak my data?
Legal Liability
Will I be sued for sharing customer data?
Will others find me negligible?
Competitive concerns
Will my competitors outperform me?
Shared data quality
Will data be reliable?
33
Solution Intuition [FDB15]
t
now
Sharing
Information
w/ Privacy
Company 2
tnow
Company 1
Better Analytics
Securely assess
the benefits of
sharing
Securely assess
the risks of
sharing
34
Training Machine Learning Models
The Big Data “Hype”
Large-scale collection of contextual information often
essential to gather statistics, train machine learning models,
and extract knowledge from data
Doing so privately…
35
Efficient Private Statistics [MDD16]
Real-world problems:
1. Recommender systems for online streaming services
2. Statistics about mass transport movements
3. Traffic statistics for the Tor Network
Available tools for computing private statistics are
impractical for large streams collection
Intuition: Approximate statistics are acceptable in
some cases?
36
Preliminaries: Count-Min Sketch
An estimate of an item’s frequency in a stream
Mapping a stream of values (of length T) into a matrix of size
O(logT)
The sum of two sketches results in the sketch of the union of
the two data streams
37
ItemKNN Recommender Systems
Predict favorite TV programs based on their own
ratings and those of “similar” users
Consider N users, M programs and binary ratings
Build a co-views matrix C, where Cab is the number of
views for the pair of programs (a,b)
Compute the Similarity Matrix
Identify K-Neighbors based on the Similarity Matrix
38
Private Recommender System
We build a global matrix of co-views for training
ItemKNN in a privacy-friendly way by relying on:
Private data aggregation based on [Kursawe et al. 2011]
Count-Min Sketch to reduce overhead
System Model
Users (in groups)
Tally Server (e.g, the BBC)
39
Security & Implementation
Security
In the honest-but-curious model under the CDH assumption
Prototype implementation:
Tally as a Node.js web server
Users run in the browser or as a mobile cross-
platform application (Apache Cordova)
Transparency, ease of use, ease of deployment
40
41
User side
Server side
Accuracy
42
Challenges Ahead…
43
This slide is intentionally left blank

Privacy-preserving Information Sharing: Tools and Applications

  • 1.
    Privacy-preserving Information Sharing: Toolsand Applications Emiliano De Cristofaro University College London https://emilianodc.com Koc University, Jan 2016
  • 2.
    Prologue Privacy-Enhancing Technologies (PETs): Increaseprivacy of users, groups, and/or organizations PETs often respond to privacy threats Protect personally identifiable information Support anonymous communications Privacy-respecting data processing Another angle: privacy as an enabler Actively enabling scenarios otherwise impossible w/o clear privacy guarantees 2
  • 3.
    Sharing Information w/Privacy When parties with limited mutual trust willing or required to share information Only the required minimum amount of information should be disclosed in the process 3
  • 4.
    Secure Computation Alice (a)Bob (b) f(a,b) f(a,b)f(a,b) Map information sharing to f(·,·)? Realize secure f(·,·) efficiently? Quantify information disclosure from output of f(·,·)? 4
  • 5.
    Private Set Intersection(PSI) Server Client S ={s1, ,sw} C ={c1, ,cv} Private Set Intersection SÇC 5
  • 6.
    PSI Cardinality (PSI-CA) ServerClient S ={s1, ,sw} C ={c1, ,cv} PSI-CA SÇC 6
  • 7.
    PSI w/ DataTransfer (PSI-DT) Server Client C ={c1, ,cv} PSI-DT  ),(),...,,( 11 ww datasdatasS  SÇC = (sj,dataj ) $ci Î C :ci = sj{ } 7
  • 8.
    PSI w/ DataTransfer Client Server 8
  • 9.
    Authorized Private SetIntersection Server Client S ={s1, ,sw} C ={c1, ,cv} Private Set Intersection SÇC What if the client populates C with its best guesses for S? Client needs to prove that inputs satisfy a policy or be authorized Authorizations issued by appropriate authority Authorizations need to be verified implicitly 9
  • 10.
    Size-Hiding Private SetIntersection Server Client S ={s1, ,sw} C ={c1, ,cv} Private Set Intersection SÇC v w 10
  • 11.
    Garbled-Circuit Based PSI Veryfast implementations [HKE12, DCW13, PSZ14], possibly more  Based on pipelining Exploits advances in OT Extension Do not support all functionalities Still high communication overhead 11
  • 12.
    Special-purpose PSI [DT10]: scalesefficiently to very large sets First protocol with linear complexities and fast crypto [DKT10]: extends to arbitrarily malicious adversaries Works also for Authorized Private Set Intersection [DJLLT11]: PSI-based database querying Won IARPA APP challenge, basis for IARPA SPAR [DT12]: optimized toolkit for PSI Privately intersect sets – 2,000 items/sec [ADT11]: size-hiding PSI 12
  • 13.
    Oblivious Pseudo-Random Functions ServerClient fk (x) OPRF k fk (x) x 13
  • 14.
    OPRF-based PSI Server Client fk(x) OPRF k Unless sj is in the intersection fk(sj) indistinguishable from random ci fk (sj ) fk (ci ) 14
  • 15.
    OPRF from Blind-RSASignatures RSA Signatures: PRF: fd (x)= H(sigd (x)) e×d º1mod(p-1)(q-1)(N = p×q, e), d Sigd (x)= H(x)d modN, Ver(Sig(x), x)=1ÛSig(x)e = H(x)modN Server (d) Client (x) (H one way function) a = H(x)×re r Î ZN (= H(x)d red ) sigd (x) = b /rb = ad fd (x)= H(sigd (x)) 15
  • 16.
    Authorized Private SetIntersection (APSI) Server Client S ={s1, ,sw} C ={(c1,auth(c1)), ,(cv,auth(cv ))} Authorized Private Set Intersection SÇC = def sj Î S $ci Î C :ci = sj Ùauth(ci )is valid{ } Court 16
  • 17.
    OPRF w/ ImplicitSignature Verification Server Client fk (x) OPRF with ISV k sig(x) fk (x) if Ver(sig(x), x)=1 $ otherwise 17
  • 18.
    A simple OPRF-likewith ISV Court issues authorizations: OPRF: fk (x)= F(H(x)2k modN) Sig(x)= H(x)d modN Server (k) Client (H(x)d) a = H(x)d gr r Î ZN (b = H(x)2edk g2rek ) H(x)2k = b/g2erk b = a2×e×k ;gk fk (x)= F(H(x)2k )(Implicit Verification) 18
  • 19.
    OPRF with ISV– Malicious Security OPRF: fk (x)= F(H(x)2k ) Server (k) Client (H(x)d) a = H(x)d gr r Î ZN (b = H(x)2edk g2rek ) H(x)2k = b/g2erkb = a2ek fk (x)= F(H(x)2k ) a = H(x)(g')r p = ZKPK{r :a2e /a2 =(ge /g')2r } gk p'=ZKPK{k:b=a2ek } 19
  • 20.
    Other Building Blocks [DGT12]:Private Set Intersection Cardinality-only [BDG12]: Private Sample Set Similarity [DFT13]: Private Substring/Pattern Matching 20
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    The Bad News Sensitivityof human genome: Uniquely identifies an individual Discloses ethnicity, disease predispositions (including mental) Progress aggravates fears of discrimination Once leaked, it cannot be “revoked” De-identification and obfuscation are not effective More info: [ADHT13] Chills and Thrills of Whole Genome Sequencing. IEEE Computer Magazine. 27
  • 28.
    Secure Genomics? Privacy: Individuals remainin control of their genome Allow doctors/clinicians/labs to run genomic tests, while disclosing the required minimum amount of information, i.e.: (1) Individuals don’t disclose their entire genome (2) Testing facilities keep test specifics (“secret sauce”) confidential [BBDGT11]: Secure genomics via *-PSI Most personalized medicine tests in < 1 second Works on Android too 28
  • 29.
    Private Set Intersection Cardinality Test Result: (#fragmentswith same length) Private RFLP-based Paternity Test 29
  • 30.
    doctor or lab genome individual test specifics Secure Function Evaluation testresult test result • Private Set Intersection (PSI) • Authorized PSI • Private Pattern Matching • Homomorphic Encryption • Garbled Circtuis • […] Output reveals nothing beyond test result • Paternity/Ancestry Testing • Testing of SNPs/Markers • Compatibility Testing • Disease Predisposition […] 30
  • 31.
    Open Problems Where dowe store genomes? Encryption can’t guarantee security past 30-50 yrs Reliability and availability issues? Cryptography Efficiency overhead Data representation assumptions How much understanding required from users? 31
  • 32.
    Collaborative Anomaly Detection Anomalydetection is hard Suspicious activities deliberately mimic normal behavior But, malevolent actors often use same resources Wouldn’t it be better if organizations collaborated? It’s a win-win, no? “It is the policy of the United States Government to increase the volume, timelines, and quality of cyber threat information shared with U.S. private sector entities so that these entities may better protect and defend themselves against cyber attacks.” Barack Obama 2013 State of the Union Address 32
  • 33.
    Problems with Collaborations Trust Willothers leak my data? Legal Liability Will I be sued for sharing customer data? Will others find me negligible? Competitive concerns Will my competitors outperform me? Shared data quality Will data be reliable? 33
  • 34.
    Solution Intuition [FDB15] t now Sharing Information w/Privacy Company 2 tnow Company 1 Better Analytics Securely assess the benefits of sharing Securely assess the risks of sharing 34
  • 35.
    Training Machine LearningModels The Big Data “Hype” Large-scale collection of contextual information often essential to gather statistics, train machine learning models, and extract knowledge from data Doing so privately… 35
  • 36.
    Efficient Private Statistics[MDD16] Real-world problems: 1. Recommender systems for online streaming services 2. Statistics about mass transport movements 3. Traffic statistics for the Tor Network Available tools for computing private statistics are impractical for large streams collection Intuition: Approximate statistics are acceptable in some cases? 36
  • 37.
    Preliminaries: Count-Min Sketch Anestimate of an item’s frequency in a stream Mapping a stream of values (of length T) into a matrix of size O(logT) The sum of two sketches results in the sketch of the union of the two data streams 37
  • 38.
    ItemKNN Recommender Systems Predictfavorite TV programs based on their own ratings and those of “similar” users Consider N users, M programs and binary ratings Build a co-views matrix C, where Cab is the number of views for the pair of programs (a,b) Compute the Similarity Matrix Identify K-Neighbors based on the Similarity Matrix 38
  • 39.
    Private Recommender System Webuild a global matrix of co-views for training ItemKNN in a privacy-friendly way by relying on: Private data aggregation based on [Kursawe et al. 2011] Count-Min Sketch to reduce overhead System Model Users (in groups) Tally Server (e.g, the BBC) 39
  • 40.
    Security & Implementation Security Inthe honest-but-curious model under the CDH assumption Prototype implementation: Tally as a Node.js web server Users run in the browser or as a mobile cross- platform application (Apache Cordova) Transparency, ease of use, ease of deployment 40
  • 41.
  • 42.
  • 43.
    Challenges Ahead… 43 This slideis intentionally left blank

Editor's Notes

  • #13  using pipeling and multi-threading
  • #16 There are a few possible ways to instantiate OPRFs and apply them to PSI. I’d like to show one that is relatively simple and, incidentally, turns out to yield the most efficient PSI protocol. It’s based on Blind RSA. First, notice that, in the random oracle model, the hash of a signature is a PRF. Therefore, we can construct an OPRF using Blind RSA Signatures, whereby the client obtains, from the server, a signature on a message x without disclosing x and without the server disclosing the RSA private signing key to the client. We can do so with this simple two round protocol.
  • #20 ZKPK of equality…
  • #21  using pipeling and multi-threading
  • #24 So progress in sequencing can help researchers gain better understanding of diseases and their relationship with genetic features, but it is also already bearing fruits in the clinical setting as we hear more and more news of how doctors have been able to diagnose and/or cure patients thanks to whole genome sequencing
  • #25 Another important point I’d like to make is that the genomics revolution is not only happening within the walls of our research labs and our hospitals. Take as an example the case of Angelina Jolie, who was diagnosed … her choice of undergoing … really put genetic testing in the spotlight, if not for the debate it generated
  • #26 Another important phenomenon is the emergence of direct-to-consumer genomics, i.e., genetic tests that are marketed directly to customers, without necessarily involving a doctor or a health professional. One of the most popular companies in this market is 23andme, a US company now operating in the UK as well, that provides individuals with reports on a number of genetic traits, inherited risk factors, etc.
  • #28 Not clear if cryptographic techniques can scale to (full) genomes The ultimate ID, invariant, the most precise way of authentication, it survives time, even death
  • #30 Different sizes
  • #31 Users keep sequenced genomes (encrypted) But still allow doctors/clinicians to run tests Only disclose minimum amount of information E.g. privacy-preserving version of paternity, ancestry, genealogy, personalized medicine tests