Everything you always wanted to know about Synthetic Data

Dr. Michael Platzer, MOSTLY AI
Oct 21, 2022
Guest Lecture at Imperial College London
Everything You Always Wanted to
Know About Synthetic Data
in 30 minutes 😳

TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Synthetic Data
How accurate is it?
How safe is it?
2
Agenda
1.
2.
3.
Give it a try!
4.

3
We’re MOSTLY AI, and we’re all about…
It’s smarter than using real data and we’ve made
structured synthetic data smarter to create, use and
share across your organization.

useful, but
re-identifiable
private, but
useless
Let’s try to anonymize this guy!
4

Let’s create some new faces!
5
random-generated
400px
300px
model-generated
400px
300px
AI-generated
400px
300px
self-generated
400px
300px

6
Structured Data
source: https://archive.ics.uci.edu/ml/datasets/census+income, ~49k records, 421 billion combinations possible
󰢃
access denied 💣

7
Let’s create some structured data!
→ https://www.mostly.ai/
Upload 󰟑
1. Synthesize 💫
2. Done ✔ in ~50secs
3.

8
Structured Synthetic Data
statistically representative, highly realistic, truly anonymous synthetic data - at any volume

9
Synthetic Data - The Lingua Franca of Learning
Original, Privacy-sensitive Data
restricted, biased, incomplete
Smarter Synthetic Data
realistic, representative, truly anonymous
󰟑 🤖
󰞍 󰞥
󰥽 󰱷
󰟦 💡
󰡵 󰱘
󰲞 󰭁
Data Consumers
learn, collaborate, innovate
󰢃
access denied 💣

How good is it?
10
vs.
How close is the synthetic data to the original?
What does it mean to be close for datasets?
And, then how close should it even be?
[there is surprisingly little consensus on answering this question]
Original Data Synthetic Data

Automated Empirical Quality Assurance
11
See also our paper on Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data and our blog post series on accuracy and privacy.

How accurate is it?
12
1. Turing Tests - is it fake or not?
Measures realism, i.e. the rule-adherence, of synthetic samples at record-level. Can be performed by
humans as well as machines. But it doesn’t inform about the statistical representativeness.
2. Compare Utility for Machine Learning
Train on synthetic, test on a real holdout data for a specific ML task, and compare predictive accuracy
to the same model being trained on the original data. Strong test, but only checks for one specific
signal in the data.
3. Measure Deviations in Marginal Distributions
Calculate lower-level marginal empirical distributions, and systematically measure any deviations
thereof.

How accurate is it?
13
Univariate Distributions

How accurate is it?
14
Bivariate Distributions

How accurate is it?
15

How accurate is it? - Train-Synthetic Test-Real
16
Holdout Data
(20%)
Training Data
(80%)
Synthesize
Synthetic Data
(eg. 10x)
Train ML Model
(on synthetic)
Train ML Model
(on real)
Evaluate on Holdout
Evaluate on Holdout
Compare

How accurate is it?
17
→ https://mostly.ai/blog/boost-machine-learning-accuracy-with-synthetic-data/

1. Run Shadow Models
a. for Membership Inference Attacks
b. for Attribute Disclosure Attacks
2. Compare Distances To Closest Records
a. Identical Match Share (IMS)
b. Distance to Closest Record (DCR)
c. Nearest Neighbour Distance Ratio (NNDR)
How safe is it?
18

How safe is it? - Shadow Models
19
ST
synthesize
ST’
T X
X
T’
X
X
X
X
?
?
Δ difference?
synthesize
infer
infer
exclude
subject
-
We shall not be able to
infer anything more about
an individual, given that
that person was included
in the database used for
synthetization.

Training on Synthetic Data automatically fixed the privacy leak / memorization of some
downstream ML models, without negatively impacting the overall predictive accuracy.
20
Accuracy scores for 50 randomly chosen subjects, that were part of training
Target T
Synthetic ST
Accuracy scores for 50 randomly chosen subjects, that were NOT part of training
T’arget
Target T’
Synthetic ST’

21
Important Note 1
The baseline for Attribute Disclosure is any inference
based on a world that doesn’t know about an individual.
Naturally, the privacy of a person, that doesn’t exist, can
not be already violated. Yet, some SD critics mix up
attribute inference with privacy.

Important Note 2
Meta-Data (=value ranges) need to be protected to be
safe against trivial membership inference attacks. Most of
the open-source as well as custom-coded solutions don’t
protect against that. Then the existence of a “US President”
in a synthesized cancer dataset would already leak privacy.
22

How safe is it? - Distance Measures
23
T
X
H
IMS(H, T) ≦ IMS(S, T) ?
DCR(H, T) ≦ DCR(S, T) ?
NNDR(H, T) ≦ NNDR(S, T)
?
Training Data Holdout Data
Synthetic Data
S
Synthetic Data shall be “as close as
possible”, but not “too close” to
Original Data. A Holdout helps to set a
benchmark for being too close. As an
ideal synthetic data generator creates
new samples that behave exactly like
actual samples, that haven’t been
seen before (=holdout data).

How safe is it? - Distance Measures
24

Learn More
25
→ https://blog.mostly.ai/
● Synthetic Behavioral Data
● Synthetic Geo Data
● Synthetic Text Data
● Fair Synthetic Data
● Synthetic Data Benchmarks
● JRC Report on Synthetic Data
● AI-based Re-Identification Attacks
● Privacy Assessment of Synthetic Data
● and so much more…

Give it a try!
26
Sign up, get going and join our Discord Community.
→ https://synthetic.mostly.ai/

Everything you always wanted to know about Synthetic Data

Recommended

Recommended

More Related Content

Similar to Everything you always wanted to know about Synthetic Data

Similar to Everything you always wanted to know about Synthetic Data (20)

More from MOSTLY AI

More from MOSTLY AI (9)

Recently uploaded

Recently uploaded (20)

Everything you always wanted to know about Synthetic Data