2 0 J U L Y 2 0 2 1
B U I L D I N G A I W I T H S E C U R I T Y A N D
P R I V A C Y I N M I N D
G E E T A C H A U H A N
PyTorch Partner Engineering, Facebook AI
@ C H A U H A N G
CTO Connection 2021
A G E N D A 0 1


P R I V A C Y C H A L L E N G E S I N A I


0 2


P R I V A C Y P R E S E R V I N G M L


0 3


T O O L S & T E C H N I Q U E S


0 4


S T E P S F O R S T A R T I N G Y O U R J O U R N E Y
CTO Connection 2021
Privacy Challenges in AI
CTO Connection 2021
Centralized AI is like Closed Source of the 90s
CTO Connection 2021
• Privacy Tradeoff between protecting data privacy and training AI/ML models


• Tensions associated with data minimization and retention


• Proliferation of AI/ML models: heighten lack of public understanding


• As artificial intelligence evolves, it magnifies the ability to use information in ways that
can intrude on privacy interests


• Increasingly sensitive nature of data used for research raises other privacy challenges


• Sourcing of data that is free of bias


•
PRIVACY CHALLENGES IN AI
CTO Connection 2021
PRIVACY CHALLENGES
CTO Connection 2021
Privacy Preserving ML Techniques
CTO Connection 2021
PRIVACY PRESERVING ML TECHNIQUES
CTO Connection 2021
Data x Function f(.)
Encrypted Data c
Encrypted Output c'
HOMOMORPHIC ENCRYPTION
CTO Connection 2021
HOMOMORPHIC ENCRYPTION
CTO Connection 2021
D I F F E R E N T I A L P R I V A C Y
Promise, made by a data holder, or a curator, to a data subject:


“ You will not be affected, adversely or otherwise, by allowing
your data to be used in any study or analysis, no matter what
other studies, data sets, or information sources, are available ”


~ Cynthia Dwork
CTO Connection 2021
D I F F E R E N T I A L P R I V A C Y
∀ D and D′ that di
ff
er in one person’s data ∀ x: ℙ[M(D) = x] ≤ exp(ε) ⋅ ℙ[M(D′) = x] +
𝛿
The distribution of the output M(D) on database D is (nearly) the same
as M(D′), where D and D′ di
ff
er in one person’s contributions.
Parameter ε quanti
fi
es
information leakage


Parameter
𝛿
gives some slack


(ε,
𝛿
)
CTO Connection 2021
Data corrupted
with noise
Function f(.)


Data corrupted
with noise
D I F F E R E N T I A L P R I V A C Y
CTO Connection 2021
Jointly compute function f(.)
Random numbers Random numbers
Trusted Third Party
Jointly compute function f(.)
Secret


share


of Data x
Secret


share


of Data x
SECURE MULTI-PARTY COMPUTATION
CTO Connection 2021
B
SECURE MULTI-PARTY COMPUTATION
CTO Connection 2021
Function f(.)


Data x
Train / Evaluate
ON DEVICE COMPUTATION
CTO Connection 2021
Function f(.)


in an enclave


Encrypted Data c
Data x


Output and function
attestation
TRUSTED EXECUTION ENVIRONMENT
CTO Connection 2021
F E D E R A T E D L E A R N I N G
Federated Learning enables devices to
collaboratively train global models with
privacy by default
CTO Connection 2021
Clients
Server
...
⧖
⧖
⧖
⧖
⫐
Private
data
Eligibility
criteria
Need
me?
F E D E R A T E D L E A R N I N G
Checkin
CTO Connection 2021
Clients
Server
...
⧖
⧖
⧖
⧖
⫐
Private
data
Eligibility
criteria
Need
me?
Yes!
Not
now...
Select subset
of devices
Selection
F E D E R A T E D L E A R N I N G
CTO Connection 2021
Clients
Server
...
⧖
⧖
⧖
⧖
⫐
Model weights
and code
Current
model
Current model
distribution
F E D E R A T E D L E A R N I N G
CTO Connection 2021
Clients
Server
...
⟳
⟳
⟳
⧖
⫐
Λ
Updated
model
Model
training
On-device
model training
F E D E R A T E D L E A R N I N G
CTO Connection 2021
Clients
Server
...
⟳
⟳
⟳
⧖
➖
Model
delta
Focused collection,
deltas are ephemeral
Model update
sharing
F E D E R A T E D L E A R N I N G
CTO Connection 2021
Clients
Server
...
⟳
⟳
⟳
⧖
⟳
Weighted delta
aggregation
Σ
Λ
Optimizing current model
using weighted delta
Global model
optimization
F E D E R A T E D L E A R N I N G
CTO Connection 2021
Clients
Server
...
⟳
⟳
⟳
⧖
⟳
Monitor and
snapshot progress
Repeat until
model
converges
Repeat until
model
converges
F E D E R A T E D L E A R N I N G
CTO Connection 2021
P R I V A C Y P R E S E R V I N G M L T E C H N I Q U E S
Homomorphic Encryption
• Encrypted data, encrypted computations


• Zero-knowledge proof intermediate results


• May leak information when output is revealed
Differential Privacy
• Low-probability guarantees on the output


• Prevent linkages attacks
Secure MPC
• Zero-knowledge proof intermediate results


• No information is leaked through the transcript of a computation


• May leak information when other parties output is revealed
On-device computation
• Local data privacy


• Limitations due to computation or memory on device


• Reduced ability to aggregate data from multiple devices
Trusted Execution Environments
• Hardware isolated environment, Limited in memory


• Securing data and models


• Remote attestation
Federated Learning
• Decentralized, training takes longer, heterogenous


• Aggregate data from multiple devices / datasets, without revealing data


• Network transmission costs high for model downloads, gradient updates
CTO Connection 2021
Training
Server
...
⟳
⟳
⟳
Σ
Λ
Inference
...
⌃
⌃
⌃
Intermediate
model state
Encrypted model
update
Ephemeral model
update
Final model state
Intermediate
model state
Can we
improve?
W H A T C A N B E S E E N ?
CTO Connection 2021
E N D T O E N D P R I V A C Y P R E S E R V I N G S Y S T E M
Federated Learning (FL) FL+ Secure Enclaves + DP + Secure Aggregation
Device
• Intermediate Model State


• Still prone to remembering
• Clients get the secure enclave private key


• Client clips the model updates before adding random mask


• Model with DP noise
Server
• Ephemeral Model Updates


• Compromised server, can leak
details
• Logic for computing the sum of the masks and the DP noise
inside Secure enclaves with attestation


• Only-In-Aggregate Model Updates
Network • Encrypted Model Updates • Randomly masked Encrypted Ephemeral Model Updates
Developer • Intermediate Model State • Intermediate Model State with DP
Consumer/World • Final Model State • Final Model State - Model with user level DP
CTO Connection 2021
Tools & Techniques
CTO Connection 2021
AI Broad Guidelines and Considerations
PRIVACY BY DESIGN
Opt-In vs. Opt-Out




Making opt-in the default
approach.
Data Minimization




Collecting only the data
that is needed.
Limited Data Retention


Limiting the amount of time
that data is kept.
Transparency and
Education




Ensure consumers are
aware of processes that
use their data.
Privacy Review Boards


Ensure consumer privacy is
prioritized across the
organization.
Responsible


AI Principles




Committing AI development
to principles that the
company abides by.
CTO Connection 2021
Understand
Align
Mitigate
Monitor
Measure
Stakeholder conversations to find


consensus and outline measurement and
mitigation plans


Analyze model performance,


label bias, outcomes, and other
relevant signals
Address observed


issues in dataset,


models, policies, etc
How might the product’s goals, its policy,
and its implementation affect users from
different subgroups? Identify contextual
definitions of fairness


Monitor effect of mitigations on


subgroups, and ensure fairness
analysis holds as product adapts


PRIVACY BY DESIGN
CTO Connection 2021
TOOLS & LIBRARIES
CTO Connection 2021
CrypTen is a platform for research in machine learning + MPC


•BGW + Beaver triples


•PyTorch-based


•reverse-mode autograd


•GPU support


•import torch  → import crypten as torch 


•Designed to expose MPC in an API familiar to ML researchers that use PyTorch
https://crypten.ai/
CTO Connection 2021
Library that enables training PyTorch models with Di
ff
erential Privacy


•PyTorch-based


•Instantiate Privacy Engine and attach to Optimizer


•Vectorized per-sample gradient computation that is 10x faster than microbatching


•Cryptographically safe pseudo-random number generator


•Extensible API


https://opacus.ai/
CTO Connection 2021
“The mission of the OpenMined community is to create an accessible ecosystem of
privacy tools and education. We do this by extending popular libraries like PyTorch
with advanced techniques in cryptography and di
ff
erential privacy.”




“With OpenMined, people and organizations can host private datasets, allowing data
scientists to train or query on data they "cannot see". The data owners retain
complete control: data is never copied, moved, or shared.”


Remote Execution, Encrypted Computation, Di
ff
erential Privacy


PySyft, PyGrid, Duet, TenSEAL…


https://www.openmined.org/
CTO Connection 2021
OTHER INDUSTRY SOLUTIONS
Private Federated Learning
Azure Confidential
Computing
CTO Connection 2021
Getting Started Resources
CTO Connection 2021
Level1: Just Starting


• Intro course from courses.openmined.org


• Simple sample with Con
fi
dential VMs


Level2: Intermediate


• Experiment with Server side FL, DP, Secure MPC


• Use tools like OpenMined, Crypten, Opacus


Level3: Advanced


• Experiment with Secure Enclaves, combine multiple techniques


• Experiment with On-device training for Decentralized Distributed ML


Level4: Mature


• Advanced techniques for large scale on-device training HSL, VSL


• Sols for Adversarial attacks
WHERE TO START YOUR JOURNEY?
USE CASES
+ COVID-19 Sols


+ Cancer Research


+ Integrity (eg PhotoDNA project)


+ Federated AI across Enterprise Silos


+ What problems will you solve?
PAPERS WITH CODE
•Reproducible
Research - ArXiv
integration


•Datasets


•Federated
Learning task


https://paperswithcode.com/task/federated-learning
REFERENCES
• CrypTen: https://crypten.ai/


• CrypTen Tutorials: https://github.com/facebookresearch/CrypTen#how-crypten-works


• Opacus: https://ai.facebook.com/blog/introducing-opacus-a-high-speed-library-for-training-pytorch-mo
with-differential-privacy/


• Opacus Tutorials: https://opacus.ai/tutorials/


• Papers w/ Code- FL task: https://paperswithcode.com/task/federated-learning


• OpenMined for Covid-19 Apps: https://blog.openmined.org/providing-opensource-privacy-for-covid19/


• Udacity Course: https://www.udacity.com/course/secure-and-private-ai--ud185


• Private AI Series, OpenMined: https://courses.openmined.org/


• Active Federated Learning Paper: https://arxiv.org/pdf/1909.12641.pdf


• Fair Resource allocation in FL: https://arxiv.org/pdf/1905.10497.pdf


• Ditto: Fair & Robust FL through Personalization: https://arxiv.org/pdf/2012.04221.pdf


• Resilient: Failure resilient inference: https://arxiv.org/pdf/2002.07386.pdf


• Owkin: https://owkin.com/
CTO Connection 2021
QUESTIONS?


Contact:


Email: gchauhan@fb.com


Linkedin: https://www.linkedin.com/in/geetachauhan/
CTO Connections 2021
T H A N K Y O U
Big thanks to Brian Knott, Dzmitry Huba, Selena Chan, Ilya Mironov, Laurens Van Der
Maaten, Davide Testuggine, Joe Spisak, Shauna Keller, Christian Keller for inputs

Building AI with Security and Privacy in mind

  • 1.
    2 0 JU L Y 2 0 2 1 B U I L D I N G A I W I T H S E C U R I T Y A N D P R I V A C Y I N M I N D G E E T A C H A U H A N PyTorch Partner Engineering, Facebook AI @ C H A U H A N G
  • 2.
    CTO Connection 2021 AG E N D A 0 1 P R I V A C Y C H A L L E N G E S I N A I 0 2 P R I V A C Y P R E S E R V I N G M L 0 3 T O O L S & T E C H N I Q U E S 0 4 S T E P S F O R S T A R T I N G Y O U R J O U R N E Y
  • 3.
  • 4.
    CTO Connection 2021 CentralizedAI is like Closed Source of the 90s
  • 5.
    CTO Connection 2021 •Privacy Tradeoff between protecting data privacy and training AI/ML models • Tensions associated with data minimization and retention • Proliferation of AI/ML models: heighten lack of public understanding • As artificial intelligence evolves, it magnifies the ability to use information in ways that can intrude on privacy interests • Increasingly sensitive nature of data used for research raises other privacy challenges • Sourcing of data that is free of bias • PRIVACY CHALLENGES IN AI
  • 6.
  • 7.
    CTO Connection 2021 PrivacyPreserving ML Techniques
  • 8.
    CTO Connection 2021 PRIVACYPRESERVING ML TECHNIQUES
  • 9.
    CTO Connection 2021 Datax Function f(.) Encrypted Data c Encrypted Output c' HOMOMORPHIC ENCRYPTION
  • 10.
  • 11.
    CTO Connection 2021 DI F F E R E N T I A L P R I V A C Y Promise, made by a data holder, or a curator, to a data subject: 
 “ You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available ” ~ Cynthia Dwork
  • 12.
    CTO Connection 2021 DI F F E R E N T I A L P R I V A C Y ∀ D and D′ that di ff er in one person’s data ∀ x: ℙ[M(D) = x] ≤ exp(ε) ⋅ ℙ[M(D′) = x] + 𝛿 The distribution of the output M(D) on database D is (nearly) the same as M(D′), where D and D′ di ff er in one person’s contributions. Parameter ε quanti fi es information leakage Parameter 𝛿 gives some slack (ε, 𝛿 )
  • 13.
    CTO Connection 2021 Datacorrupted with noise Function f(.) Data corrupted with noise D I F F E R E N T I A L P R I V A C Y
  • 14.
    CTO Connection 2021 Jointlycompute function f(.) Random numbers Random numbers Trusted Third Party Jointly compute function f(.) Secret share of Data x Secret share of Data x SECURE MULTI-PARTY COMPUTATION
  • 15.
    CTO Connection 2021 B SECUREMULTI-PARTY COMPUTATION
  • 16.
    CTO Connection 2021 Functionf(.) Data x Train / Evaluate ON DEVICE COMPUTATION
  • 17.
    CTO Connection 2021 Functionf(.) in an enclave Encrypted Data c Data x Output and function attestation TRUSTED EXECUTION ENVIRONMENT
  • 18.
    CTO Connection 2021 FE D E R A T E D L E A R N I N G Federated Learning enables devices to collaboratively train global models with privacy by default
  • 19.
  • 20.
  • 21.
    CTO Connection 2021 Clients Server ... ⧖ ⧖ ⧖ ⧖ ⫐ Modelweights and code Current model Current model distribution F E D E R A T E D L E A R N I N G
  • 22.
  • 23.
    CTO Connection 2021 Clients Server ... ⟳ ⟳ ⟳ ⧖ ➖ Model delta Focusedcollection, deltas are ephemeral Model update sharing F E D E R A T E D L E A R N I N G
  • 24.
    CTO Connection 2021 Clients Server ... ⟳ ⟳ ⟳ ⧖ ⟳ Weighteddelta aggregation Σ Λ Optimizing current model using weighted delta Global model optimization F E D E R A T E D L E A R N I N G
  • 25.
    CTO Connection 2021 Clients Server ... ⟳ ⟳ ⟳ ⧖ ⟳ Monitorand snapshot progress Repeat until model converges Repeat until model converges F E D E R A T E D L E A R N I N G
  • 26.
    CTO Connection 2021 PR I V A C Y P R E S E R V I N G M L T E C H N I Q U E S Homomorphic Encryption • Encrypted data, encrypted computations • Zero-knowledge proof intermediate results • May leak information when output is revealed Differential Privacy • Low-probability guarantees on the output • Prevent linkages attacks Secure MPC • Zero-knowledge proof intermediate results • No information is leaked through the transcript of a computation • May leak information when other parties output is revealed On-device computation • Local data privacy • Limitations due to computation or memory on device • Reduced ability to aggregate data from multiple devices Trusted Execution Environments • Hardware isolated environment, Limited in memory • Securing data and models • Remote attestation Federated Learning • Decentralized, training takes longer, heterogenous • Aggregate data from multiple devices / datasets, without revealing data • Network transmission costs high for model downloads, gradient updates
  • 27.
    CTO Connection 2021 Training Server ... ⟳ ⟳ ⟳ Σ Λ Inference ... ⌃ ⌃ ⌃ Intermediate modelstate Encrypted model update Ephemeral model update Final model state Intermediate model state Can we improve? W H A T C A N B E S E E N ?
  • 28.
    CTO Connection 2021 EN D T O E N D P R I V A C Y P R E S E R V I N G S Y S T E M Federated Learning (FL) FL+ Secure Enclaves + DP + Secure Aggregation Device • Intermediate Model State • Still prone to remembering • Clients get the secure enclave private key • Client clips the model updates before adding random mask • Model with DP noise Server • Ephemeral Model Updates • Compromised server, can leak details • Logic for computing the sum of the masks and the DP noise inside Secure enclaves with attestation • Only-In-Aggregate Model Updates Network • Encrypted Model Updates • Randomly masked Encrypted Ephemeral Model Updates Developer • Intermediate Model State • Intermediate Model State with DP Consumer/World • Final Model State • Final Model State - Model with user level DP
  • 29.
  • 30.
    CTO Connection 2021 AIBroad Guidelines and Considerations PRIVACY BY DESIGN Opt-In vs. Opt-Out 
 Making opt-in the default approach. Data Minimization 
 Collecting only the data that is needed. Limited Data Retention Limiting the amount of time that data is kept. Transparency and Education 
 Ensure consumers are aware of processes that use their data. Privacy Review Boards Ensure consumer privacy is prioritized across the organization. Responsible 
 AI Principles 
 
 Committing AI development to principles that the company abides by.
  • 31.
    CTO Connection 2021 Understand Align Mitigate Monitor Measure Stakeholderconversations to find 
 consensus and outline measurement and mitigation plans Analyze model performance, 
 label bias, outcomes, and other relevant signals Address observed 
 issues in dataset, 
 models, policies, etc How might the product’s goals, its policy, and its implementation affect users from different subgroups? Identify contextual definitions of fairness Monitor effect of mitigations on 
 subgroups, and ensure fairness analysis holds as product adapts PRIVACY BY DESIGN
  • 32.
  • 33.
    CTO Connection 2021 CrypTenis a platform for research in machine learning + MPC •BGW + Beaver triples 
 •PyTorch-based 
 •reverse-mode autograd 
 •GPU support 
 •import torch  → import crypten as torch  
 •Designed to expose MPC in an API familiar to ML researchers that use PyTorch https://crypten.ai/
  • 34.
    CTO Connection 2021 Librarythat enables training PyTorch models with Di ff erential Privacy •PyTorch-based 
 •Instantiate Privacy Engine and attach to Optimizer 
 •Vectorized per-sample gradient computation that is 10x faster than microbatching 
 •Cryptographically safe pseudo-random number generator •Extensible API https://opacus.ai/
  • 35.
    CTO Connection 2021 “Themission of the OpenMined community is to create an accessible ecosystem of privacy tools and education. We do this by extending popular libraries like PyTorch with advanced techniques in cryptography and di ff erential privacy.” 
 
 “With OpenMined, people and organizations can host private datasets, allowing data scientists to train or query on data they "cannot see". The data owners retain complete control: data is never copied, moved, or shared.” Remote Execution, Encrypted Computation, Di ff erential Privacy PySyft, PyGrid, Duet, TenSEAL… https://www.openmined.org/
  • 36.
    CTO Connection 2021 OTHERINDUSTRY SOLUTIONS Private Federated Learning Azure Confidential Computing
  • 37.
    CTO Connection 2021 GettingStarted Resources
  • 38.
    CTO Connection 2021 Level1:Just Starting • Intro course from courses.openmined.org • Simple sample with Con fi dential VMs Level2: Intermediate • Experiment with Server side FL, DP, Secure MPC • Use tools like OpenMined, Crypten, Opacus Level3: Advanced • Experiment with Secure Enclaves, combine multiple techniques • Experiment with On-device training for Decentralized Distributed ML Level4: Mature • Advanced techniques for large scale on-device training HSL, VSL • Sols for Adversarial attacks WHERE TO START YOUR JOURNEY?
  • 39.
    USE CASES + COVID-19Sols + Cancer Research + Integrity (eg PhotoDNA project) + Federated AI across Enterprise Silos 
 + What problems will you solve?
  • 40.
    PAPERS WITH CODE •Reproducible Research- ArXiv integration •Datasets 
 •Federated Learning task https://paperswithcode.com/task/federated-learning
  • 41.
    REFERENCES • CrypTen: https://crypten.ai/ •CrypTen Tutorials: https://github.com/facebookresearch/CrypTen#how-crypten-works • Opacus: https://ai.facebook.com/blog/introducing-opacus-a-high-speed-library-for-training-pytorch-mo with-differential-privacy/ • Opacus Tutorials: https://opacus.ai/tutorials/ • Papers w/ Code- FL task: https://paperswithcode.com/task/federated-learning • OpenMined for Covid-19 Apps: https://blog.openmined.org/providing-opensource-privacy-for-covid19/ • Udacity Course: https://www.udacity.com/course/secure-and-private-ai--ud185 • Private AI Series, OpenMined: https://courses.openmined.org/ • Active Federated Learning Paper: https://arxiv.org/pdf/1909.12641.pdf • Fair Resource allocation in FL: https://arxiv.org/pdf/1905.10497.pdf • Ditto: Fair & Robust FL through Personalization: https://arxiv.org/pdf/2012.04221.pdf • Resilient: Failure resilient inference: https://arxiv.org/pdf/2002.07386.pdf • Owkin: https://owkin.com/
  • 42.
    CTO Connection 2021 QUESTIONS? Contact: Email:gchauhan@fb.com Linkedin: https://www.linkedin.com/in/geetachauhan/
  • 43.
    CTO Connections 2021 TH A N K Y O U Big thanks to Brian Knott, Dzmitry Huba, Selena Chan, Ilya Mironov, Laurens Van Der Maaten, Davide Testuggine, Joe Spisak, Shauna Keller, Christian Keller for inputs