2. On Device Anonymize Obfuscate Smash/ Encrypt
Local calculations
Only download large
datasets
Random Identifiers Add Noise
Quantize
Differential Privacy
Smash: Convert to lower
dimensions or model
Encrypt: Use ‘shares’ and
multiple servers
ZKP: Zero knowledge proof
Imaging devices, Distributed ML and Privacy
Guiding Question
How to create portable skin imaging device(s) (like a wrist watch) that
collaborate and train to do better diagnostics without sharing sensitive data?
5. Assuming penalty of
just $6 per person =
$480M
Reality: Per-person
costs are way
higher in hundreds
or thousands of
dollars.
Ethical, moral, legal,
trust, economic,
news and PR
repercussions.
Follow-up effects
6. Semantic Privacy:
Architectures to prevent
empirical reconstruction
attacks
Formal Privacy + Images
Differentially Private Image Retrieval
Iterative feedback/private data
structures: Improving DP Histograms
& set intersection verification
Parallel Combinatorial
Optimization without
submodularity
What’s new in DPC @ MIT: Distributed & Private Computation Projects
Distributed ML: Split Learning AirMixML PoC: Wireless + Formal Renyi
DP + ML
Splintering: A foundation for
distributed scientific
computation
17. IoT
Low Compute/Comms
(Cannot train models)
HealthData
Few Clients
(non-homogeneous data)
Complex Models
(large unoptimized models)
Too Little Data Per Client
(Unviable to train or send large models)
Many Untrusted Parties
(How to encourage 3rd party developers)
Challenges in Federated Learning
21. Client FL
Server
Smasher
Smashed Data
𝐳
NoPeek-Infer via Decorrelation
Input Data
Labels
Input Data 𝐱
NoPeek-Infer: Preventing reconstruction attacks in distributed predictive inference after on-premise training
Vepakomma, Singh, Zhang, Raskar 2021
24. Colorectal histology image public dataset
Decreasing
leakage
Original
Activation
Original
Activation
Original
Original
Activation
Activation
Traditional
NoPeek
27. Differential Privacy (Zoomed-out view)
Learning nothing about an individual
while learning useful statistical info about a population
• Is provided (noisy) data enough for task accuracy? (Util-priv tradeoff)
(ε, δ)-Differential Privacy:
• The distribution of the output M(D) (a query) on database D is nearly the
same as M(D′) for all adjacent databases D and D′:
∀S: Pr[M(D) ∈ S] ≤ exp(ε) ∙ Pr[M(D′) ∈ S]+δ.
Example mechanisms: Laplace Mechanism for ε-DP , Gaussian Mechanism for
(ε, δ)-DP, Contractive Noisy Iterations and the list goes on….
2 Key properties: Post-processing (once DP, DP forever) & composition (loss of
privacy via multiple queries)
How much noise to add? Depends on Global Sensitivity of query (not data).
29. Retrieve Nearest
Matches to
Privatized Image
Query
Client
Private Image Retrieval with Differential Privacy
Differentially Private Supervised Manifold Learning with Applications like Private Image Retrieval, Vepakomma, Balla, Raskar, 2021
30. Lifecycle of PrivateMail
Key: Perform retrieval after ‘differentially private’ manifold learning on DL features for IR
Big Question: How to Perform ‘differentially private’ manifold learning?
35. Compute
Compute or memory on clients (e.g. IoT)
Unknown / large # of Clients
Unreliable nodes (slow downs, faults)
Adversarial clients
Trusting services, unregulated/unethical use
Ownership+control of ecosystem
Compute Local vs Remote tradeoff
Communication Bandwidth
Limited communication (few rounds, unreliable
comms)
Latency of training (queuing delays)
Dynamic availability (streaming, time zone)
Data
Too little or too much data per client
Unbalanced (size, distribution, adversarial )
Highly non-IID (heterogeneous data)
Personalization
Vertical partitions (instead of horizontal)
Exposing labels
ML
Validation on hidden data
Parameter tuning
Convergence guarantee
Incremental updates (new features)
Data leakage/ Invertibility
Challenges/Choices in Decentralized & Pvt ML: Compute, Privacy, BW,
Mem, Accuracy/Uility
37. Split learning ported into PySyft & FedML.ai
Project Website: splitlearning.mit.edu
Annual Research Workshop: SLDML
Tutorials/Talk Videos
vepakom@mit.edu
38. AirMixML
• Over-the-Air Data Mixup for Inherently
Privacy-Preserving Edge Machine Learning
• Wireless communications + PPML
39. Client creates secure splinters
from raw data
Server computes
on splinters
Send Secure Splinters
Receive. intermediate
results
Client Unsplinters for
final result
Splintering, Vepakomma, Raskar et.al 2020
Stochastic Splintering
43. DAMS: Proposed private sketching data structure for set
intersection verification queries
Client
Device
Set Intersection
Verification
Result
• Key Idea: The algorithm is run q times using a different dictionary of hash functions in each of the run of the private sketching algorithm
• The final result is obtained as the average of the estimated counts. We refer to this option as our proposed private DAMS estimator.
44. Theoretical guarantees on utility
• Baseline 1: The scenario of using eps = p.eps’ with the algorithm being run once with one set of hash functions.
– This is equivalent to the privacy level obtained when the same set of hash functions are used across p runs of the algorithm on the
same dataset due to the sequential composition property
• Baseline 2: The scenario of using eps = eps’ , while the algorithm is run q times using a same dictionary of hash
functions in each of the run as part of the private count-mean-sketch algorithm.
– This is an important baseline to compare against in order to confirm that changing the hash function dictionary across each of the p
runs, is a better option than keeping them same across the p runs
• DAMS- Diversifed Averaging for Meta estimation of Sketches:
– The algorithm is run q times using a different dictionary of hash functions in each of the run of the private count-mean-sketch algorithm
– The final result is obtained as the average of the estimated counts.
– Proved Guarantees:
• Private DAMS estimator is unbiased
• Var(DAMS) < Var(Baseline 1) ; when eps > 2
• Var(DAMS) < Var(Baseline 2) ; always
Sum of covariances of order
DAMS: Meta-estimation of private sketch data structures for differentially private COVID-19 contact tracing,
P.Vepakomma, S.N.Pushpita, R.Raskar, (PRIML AND PPML JOINT EDITION, NeurIPS-2020)
45. DAMS: Improved
TPR/FPR’s & lower
variance than
traditional private
data structures like
Count-Mean-Sketch
46. Iterative Feedback to Improve Utility-Privacy
Tradeoff for DP Histograms
Key Idea: Use subsample to privately estimate heavy-hitters and use that as feedback
to distribute epsilon across partitions of HH and !HH
47. Quasi-Concave
Set Functions
Induced Quasi-
Concave Set Functions
Duality
Exponential
time
Polynomial
time
Monotone Linkage
Functions
Data
Data Data Data
Broadcast
data
maximal minimal
pi-cluster
pi-series pi-series pi-series
O(K2
N log N) + O(log K)
O(log K)
Objective
Find
Where
Parallel Quasi-concave set optimization:
A new frontier that scales without needing submodularity
Parallel Quasi-concave set optimization: A new frontier that scales without needing submodularity, Vepakomma, Kempner, Raskar, 2021