Privacy-Preserving Data Analysis
Adria Gascon
The Alan Turing Institute & Warwick University
Based on joint work with Borja Balle, Phillipp Schoppmann,
Mariana Raykova, Jack Doerner, Samee Zahur, David
Evans, Age Chapman, Alan Davoust, Peter Buneman
What analysis on what data?

Fined grained private data, e.g. tracking for targeted
advertising, credit scoring...

Data held by several organisations, e.g. hospitals?

Data held by individuals, e.g. on their phones?
Who cares?

Data owners (of course)

Data controllers
Adria Gascon Phillipp Schoppmann Borja Balle
Mariana Raykova Jack Doerner Samee Zahur David Evans
Privacy Preserving
Distributed Linear Regression on
High-Dimensional Data
Motivation
Treatment
Outcome
Medical Data
Census Data
Financial Data
Atr. 1 Atr. 2 … Atr. 4 Atr. 5 … Atr. 7 Atr. 8 …
-1.0 0 54.3 … North 34 … 5 1 …
1.5 1 0.6 … South 12 … 10 0 …
-0.3 1 16.0 … East 56 … 2 0 …
0.7 0 35.0 … Centre 67 … 15 1 …
3.1 1 20.2 … West 29 … 7 1 …
Note: This is vertcally-parttoned data; similar problems with horizontally-parttoned
Private Multi-Party Machine Learning
Assumptons
• Parameters of the model will be received by all partes
• Partes can engage in on-line secure communicatons
• External partes might be used to outsource
computaton or initalize cryptographic primitves
Problem
• Two or more partes want to jointly learn a model of
their data
• But they can’t share their private data with other partes
The Trusted Party “Solution”
(secure channel)
(secure channel)
(secure channel)
Trusted
Party
Receives plain-text data, runs
algorithm, returns result to partes
?
The Trusted Party assumpton:
• Introduces a single point of failure
• Relies on weak incentves
• Requires agreement between all data providers
=> Useful but unrealistc. Maybe can be simulated?
Secure Multi-Party Computation (MPC)
Public:
Private:
(party i)
Goal:
Compute f in a way that each party
learns y (and nothing else!)
Our Contribution
A PMPML system for vertcally parttoned linear regression
Features:
• Scalable to millions of records and hundreds of dimensions
• Formal privacy guarantees (semi-honest security)
• Open source implementaton
Tools:
• Combine standard MPC constructons (GC, OT, TI, …)
• Efcient private inner product protocols
• Conjugate gradient descent robust to fxed-point encodings
FAQ: Why is PMPML…
Excitng?
Can provide access to previously ”locked” data
Hard?
Privacy is tricky to formalize, hard to implement,
and inherently interdisciplinary
Worth?
Beter models while avoiding legal risks and bad
PR
Read It, Use It
https://github.com/schoppmp/linreg-mpc
http://eprint.iacr.org/2016/892PETS’17
Adria Gascon Phillipp Schoppmann Borja Balle
Private Document Classifcaton in
Federated Databases
Secure document classification
Secure document classification
Adria Gascon James Bell Tejas Kulkarni
Privacy-Preserving Distributed
Hypothesis Testng
● Drop off in Manhattan?
● Tip over 25 %?
● Was it a short journey?
● Was payment method
credit card?
Drop-off in Manhattan and tip over 25%
are significantly correlated events.
But this result is differentially private, so I cannot easily tell
if a given journey was included in the training dataset or not.
Problem: model-check security properties on
private source code.
Privacy-Preserving Model Checking
●
Problem: Check security properties on (private)
source code.
●
“Public” equivalent: MOPS [1], and some others.
– Security property expressed as regular expression over
sequences of instructions
– Find all paths in control flow graph that match path
●
Application of Private Regular Path Queries
[1] Hao Chen and David Wagner. 2002. MOPS: an infrastructure for examining security properties of software.
In Proceedings of the 9th ACM conference on Computer and communications security (CCS '02), Vijay Atluri
(Ed.). ACM, New York, NY, USA, 235-244. DOI=http://dx.doi.org/10.1145/586110.586142
Privacy-Preserving Model Checking
Secure queries on graph data
Simple Example
1 #include <stdio.h>
2 #include <sys/types.h>
3 #include <unistd.h>
4 #include <pwd.h>
5
6 void drop_priv()
7 {
8 struct passwd *passwd;
9
10 if ((passwd = getpwuid(getuid())) == NULL)
11 {
12 printf("getpwuid() failed");
13 return;
14 }
15 printf("Drop user %s's privilegen", passwd-
>pw_name);
16 seteuid(getuid());
17 }
18
19 int main(int argc, char *argv[])
20 {
21 drop_priv();
22 printf("About to execn");
hello.c
Simple Example
Control flow graph Security property FSA
(system call with root priviledge)
Interesting case:
distributed private graph (code)
main.c library.c
Related Work
Verification Across Intellectual Property Boundaries [2]:
[2] Chaki, Sagar, Christian Schallhart, and Helmut Veith. "Verification across intellectual property boundaries."
ACM Transactions on Software Engineering and Methodology (TOSEM) 22.2 (2013): 15.
Related Work
Verification Across Intellectual Property Boundaries [2]
They also say...
“While we are aware of advanced methods such as secure multiparty computation
[Goldreich 2002] and zeroknowledge proofs [Ben-Or et al. 1988], we believe that they are
impracticable for our problem, as such methods cannot be easily wrapped over given
validation tools. Finally, we believe that any advanced method without an intuitive proof for
its secrecy will be heavily opposed by the supplier—and might therefore be hard to
establish in practice.”
Case study: thttpd
●
Tiny http server
●
2 main modules (thttp.c and libhttp.c)
thttp.c
(2k loc)
libhttp.c
(4k loc)
thttpd control flow graph...
●
2 main modules only
●
functions are disconnected
thttpd: next steps
●
Adapt private Regular Path Queries work for
pushdown automata
●
Find some bugs.
●
Write paper.
●
Voila!
Thanks!

Privacy-Preserving Data Analysis, Adria Gascon

  • 1.
    Privacy-Preserving Data Analysis AdriaGascon The Alan Turing Institute & Warwick University Based on joint work with Borja Balle, Phillipp Schoppmann, Mariana Raykova, Jack Doerner, Samee Zahur, David Evans, Age Chapman, Alan Davoust, Peter Buneman
  • 2.
    What analysis onwhat data?  Fined grained private data, e.g. tracking for targeted advertising, credit scoring...  Data held by several organisations, e.g. hospitals?  Data held by individuals, e.g. on their phones?
  • 3.
    Who cares?  Data owners(of course)  Data controllers
  • 4.
    Adria Gascon PhillippSchoppmann Borja Balle Mariana Raykova Jack Doerner Samee Zahur David Evans Privacy Preserving Distributed Linear Regression on High-Dimensional Data
  • 5.
    Motivation Treatment Outcome Medical Data Census Data FinancialData Atr. 1 Atr. 2 … Atr. 4 Atr. 5 … Atr. 7 Atr. 8 … -1.0 0 54.3 … North 34 … 5 1 … 1.5 1 0.6 … South 12 … 10 0 … -0.3 1 16.0 … East 56 … 2 0 … 0.7 0 35.0 … Centre 67 … 15 1 … 3.1 1 20.2 … West 29 … 7 1 … Note: This is vertcally-parttoned data; similar problems with horizontally-parttoned
  • 6.
    Private Multi-Party MachineLearning Assumptons • Parameters of the model will be received by all partes • Partes can engage in on-line secure communicatons • External partes might be used to outsource computaton or initalize cryptographic primitves Problem • Two or more partes want to jointly learn a model of their data • But they can’t share their private data with other partes
  • 7.
    The Trusted Party“Solution” (secure channel) (secure channel) (secure channel) Trusted Party Receives plain-text data, runs algorithm, returns result to partes ? The Trusted Party assumpton: • Introduces a single point of failure • Relies on weak incentves • Requires agreement between all data providers => Useful but unrealistc. Maybe can be simulated?
  • 8.
    Secure Multi-Party Computation(MPC) Public: Private: (party i) Goal: Compute f in a way that each party learns y (and nothing else!)
  • 9.
    Our Contribution A PMPMLsystem for vertcally parttoned linear regression Features: • Scalable to millions of records and hundreds of dimensions • Formal privacy guarantees (semi-honest security) • Open source implementaton Tools: • Combine standard MPC constructons (GC, OT, TI, …) • Efcient private inner product protocols • Conjugate gradient descent robust to fxed-point encodings
  • 10.
    FAQ: Why isPMPML… Excitng? Can provide access to previously ”locked” data Hard? Privacy is tricky to formalize, hard to implement, and inherently interdisciplinary Worth? Beter models while avoiding legal risks and bad PR
  • 11.
    Read It, UseIt https://github.com/schoppmp/linreg-mpc http://eprint.iacr.org/2016/892PETS’17
  • 12.
    Adria Gascon PhillippSchoppmann Borja Balle Private Document Classifcaton in Federated Databases
  • 13.
  • 14.
  • 15.
    Adria Gascon JamesBell Tejas Kulkarni Privacy-Preserving Distributed Hypothesis Testng
  • 16.
    ● Drop offin Manhattan? ● Tip over 25 %? ● Was it a short journey? ● Was payment method credit card? Drop-off in Manhattan and tip over 25% are significantly correlated events. But this result is differentially private, so I cannot easily tell if a given journey was included in the training dataset or not.
  • 17.
    Problem: model-check securityproperties on private source code. Privacy-Preserving Model Checking
  • 18.
    ● Problem: Check securityproperties on (private) source code. ● “Public” equivalent: MOPS [1], and some others. – Security property expressed as regular expression over sequences of instructions – Find all paths in control flow graph that match path ● Application of Private Regular Path Queries [1] Hao Chen and David Wagner. 2002. MOPS: an infrastructure for examining security properties of software. In Proceedings of the 9th ACM conference on Computer and communications security (CCS '02), Vijay Atluri (Ed.). ACM, New York, NY, USA, 235-244. DOI=http://dx.doi.org/10.1145/586110.586142 Privacy-Preserving Model Checking
  • 19.
  • 20.
    Simple Example 1 #include<stdio.h> 2 #include <sys/types.h> 3 #include <unistd.h> 4 #include <pwd.h> 5 6 void drop_priv() 7 { 8 struct passwd *passwd; 9 10 if ((passwd = getpwuid(getuid())) == NULL) 11 { 12 printf("getpwuid() failed"); 13 return; 14 } 15 printf("Drop user %s's privilegen", passwd- >pw_name); 16 seteuid(getuid()); 17 } 18 19 int main(int argc, char *argv[]) 20 { 21 drop_priv(); 22 printf("About to execn"); hello.c
  • 21.
    Simple Example Control flowgraph Security property FSA (system call with root priviledge)
  • 22.
    Interesting case: distributed privategraph (code) main.c library.c
  • 23.
    Related Work Verification AcrossIntellectual Property Boundaries [2]: [2] Chaki, Sagar, Christian Schallhart, and Helmut Veith. "Verification across intellectual property boundaries." ACM Transactions on Software Engineering and Methodology (TOSEM) 22.2 (2013): 15.
  • 24.
    Related Work Verification AcrossIntellectual Property Boundaries [2] They also say... “While we are aware of advanced methods such as secure multiparty computation [Goldreich 2002] and zeroknowledge proofs [Ben-Or et al. 1988], we believe that they are impracticable for our problem, as such methods cannot be easily wrapped over given validation tools. Finally, we believe that any advanced method without an intuitive proof for its secrecy will be heavily opposed by the supplier—and might therefore be hard to establish in practice.”
  • 25.
    Case study: thttpd ● Tinyhttp server ● 2 main modules (thttp.c and libhttp.c) thttp.c (2k loc) libhttp.c (4k loc)
  • 26.
    thttpd control flowgraph... ● 2 main modules only ● functions are disconnected
  • 27.
    thttpd: next steps ● Adaptprivate Regular Path Queries work for pushdown automata ● Find some bugs. ● Write paper. ● Voila!
  • 28.