Everything you always wanted to know about Synthetic Data
1. Dr. Michael Platzer, MOSTLY AI
Oct 21, 2022
Guest Lecture at Imperial College London
Everything You Always Wanted to
Know About Synthetic Data
in 30 minutes 😳
2. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Synthetic Data
How accurate is it?
How safe is it?
2
Agenda
1.
2.
3.
Give it a try!
4.
3. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
3
We’re MOSTLY AI, and we’re all about…
It’s smarter than using real data and we’ve made
structured synthetic data smarter to create, use and
share across your organization.
4. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
useful, but
re-identifiable
private, but
useless
Let’s try to anonymize this guy!
4
5. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Let’s create some new faces!
5
random-generated
400px
300px
model-generated
400px
300px
AI-generated
400px
300px
self-generated
400px
300px
6. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
6
Structured Data
source: https://archive.ics.uci.edu/ml/datasets/census+income, ~49k records, 421 billion combinations possible
access denied 💣
7. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
7
Let’s create some structured data!
→ https://www.mostly.ai/
Upload
1. Synthesize 💫
2. Done ✔ in ~50secs
3.
8. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
8
Structured Synthetic Data
statistically representative, highly realistic, truly anonymous synthetic data - at any volume
9. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
9
Synthetic Data - The Lingua Franca of Learning
Original, Privacy-sensitive Data
restricted, biased, incomplete
Smarter Synthetic Data
realistic, representative, truly anonymous
🤖
💡
Data Consumers
learn, collaborate, innovate
access denied 💣
10. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How good is it?
10
vs.
How close is the synthetic data to the original?
What does it mean to be close for datasets?
And, then how close should it even be?
[there is surprisingly little consensus on answering this question]
Original Data Synthetic Data
11. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Automated Empirical Quality Assurance
11
See also our paper on Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data and our blog post series on accuracy and privacy.
12. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
12
1. Turing Tests - is it fake or not?
Measures realism, i.e. the rule-adherence, of synthetic samples at record-level. Can be performed by
humans as well as machines. But it doesn’t inform about the statistical representativeness.
2. Compare Utility for Machine Learning
Train on synthetic, test on a real holdout data for a specific ML task, and compare predictive accuracy
to the same model being trained on the original data. Strong test, but only checks for one specific
signal in the data.
3. Measure Deviations in Marginal Distributions
Calculate lower-level marginal empirical distributions, and systematically measure any deviations
thereof.
13. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
13
Univariate Distributions
14. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
14
Bivariate Distributions
15. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
15
16. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it? - Train-Synthetic Test-Real
16
Holdout Data
(20%)
Training Data
(80%)
Synthesize
Synthetic Data
(eg. 10x)
Train ML Model
(on synthetic)
Train ML Model
(on real)
Evaluate on Holdout
Evaluate on Holdout
Compare
17. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
17
→ https://mostly.ai/blog/boost-machine-learning-accuracy-with-synthetic-data/
18. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
1. Run Shadow Models
a. for Membership Inference Attacks
b. for Attribute Disclosure Attacks
2. Compare Distances To Closest Records
a. Identical Match Share (IMS)
b. Distance to Closest Record (DCR)
c. Nearest Neighbour Distance Ratio (NNDR)
How safe is it?
18
19. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Shadow Models
19
ST
synthesize
ST’
T X
X
T’
X
X
X
X
?
?
Δ difference?
synthesize
infer
infer
exclude
subject
-
We shall not be able to
infer anything more about
an individual, given that
that person was included
in the database used for
synthetization.
20. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Training on Synthetic Data automatically fixed the privacy leak / memorization of some
downstream ML models, without negatively impacting the overall predictive accuracy.
How safe is it? - Shadow Models
20
Accuracy scores for 50 randomly chosen subjects, that were part of training
Target T
Synthetic ST
Accuracy scores for 50 randomly chosen subjects, that were NOT part of training
T’arget
Target T’
Synthetic ST’
21. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Shadow Models
21
Important Note 1
The baseline for Attribute Disclosure is any inference
based on a world that doesn’t know about an individual.
Naturally, the privacy of a person, that doesn’t exist, can
not be already violated. Yet, some SD critics mix up
attribute inference with privacy.
22. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Important Note 2
Meta-Data (=value ranges) need to be protected to be
safe against trivial membership inference attacks. Most of
the open-source as well as custom-coded solutions don’t
protect against that. Then the existence of a “US President”
in a synthesized cancer dataset would already leak privacy.
How safe is it? - Shadow Models
22
23. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Distance Measures
23
T
X
H
IMS(H, T) ≦ IMS(S, T) ?
DCR(H, T) ≦ DCR(S, T) ?
NNDR(H, T) ≦ NNDR(S, T)
?
Training Data Holdout Data
Synthetic Data
S
Synthetic Data shall be “as close as
possible”, but not “too close” to
Original Data. A Holdout helps to set a
benchmark for being too close. As an
ideal synthetic data generator creates
new samples that behave exactly like
actual samples, that haven’t been
seen before (=holdout data).
24. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Distance Measures
24
25. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Learn More
25
→ https://blog.mostly.ai/
● Synthetic Behavioral Data
● Synthetic Geo Data
● Synthetic Text Data
● Fair Synthetic Data
● Synthetic Data Benchmarks
● JRC Report on Synthetic Data
● AI-based Re-Identification Attacks
● Privacy Assessment of Synthetic Data
● and so much more…
26. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Give it a try!
26
Sign up, get going and join our Discord Community.
→ https://synthetic.mostly.ai/