Dr. Michael Platzer, MOSTLY AI
Nov 2023
Guest Lecture at Imperial College London
Everything You Always Wanted
to Know About Synthetic Data
in 45 minutes 😳
What is it?
How accurate is it?
How safe is it?
2
Synthetic Data
1.
2.
3.
Give it a try!
4.
3
We’re MOSTLY AI, and we’re all about…
useful, but
re-identifiable
private, but
useless
Let’s try to anonymize this guy!
4
Let’s create some new faces!
5
random-generated
400px
300px
model-generated
400px
300px
AI-generated
400px
300px
self-generated
400px
300px
6
Privacy-Sensitive Structured Data
source: https://archive.ics.uci.edu/ml/datasets/census+income, ~49k records, 421 billion combinations possible
󰢃
access denied 💣
7
Let’s create some Synthetic Data!
→ https://www.mostly.ai/
Upload 󰟑
1. Synthesize 💫
2. Done ✔ in ~50secs
3.
8
Representative Synthetic Data
statistically representative, highly realistic, truly anonymous synthetic data - at any volume
Representative Synthetic Data for Everyone
Actual Data
privacy-restricted
biased
incomplete
Synthetic Data
realistic
representative
truly anonymous
granular level
Data Consumers
people & algorithms
Representative Synthetic Data for Everything
for Exploration
for Analytics
for AI Training
for AI Validation
How good is it?
11
vs.
How close is the synthetic data to the original?
What does it mean to be close for datasets?
And, then how close should it even be?
[there is surprisingly little consensus on answering this question]
Original Data Synthetic Data
Automated Empirical Quality Assurance
12
See also our paper on Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data and our blog post series on accuracy and privacy.
How accurate is it?
13
1. Turing Tests - is it fake or not?
Measures realism, i.e. the rule-adherence, of synthetic samples at record-level. Can be performed by
humans as well as machines. But it doesn’t inform about the statistical representativeness.
2. Compare Utility for Machine Learning
Train on synthetic, test on a real holdout data for a specific ML task, and compare predictive accuracy
to the same model being trained on the original data. Strong test, but only checks for one specific
signal in the data.
3. Measure Deviations in Marginal Distributions
Calculate lower-level marginal empirical distributions, and systematically measure any deviations
thereof.
How accurate is it?
14
Univariate Distributions
How accurate is it?
15
Bivariate Distributions
How accurate is it?
16
How accurate is it? - Train-Synthetic Test-Real
17
Holdout Data
(20%)
Training Data
(80%)
Synthesize
Synthetic Data
(eg. 10x)
Train ML Model
(on synthetic)
Train ML Model
(on real)
Evaluate on Holdout
Evaluate on Holdout
Compare
How accurate is it?
18
→ https://mostly.ai/blog/boost-machine-learning-accuracy-with-synthetic-data/
Case Study: American Community Survey
19
→ Amp It Up: synthetic population data at scale
1.5 million actual records with 32 attributes ⚠
Case Study: American Community Survey
20
→ Amp It Up: synthetic population data at scale
1.5 million synthetic records with 32 attributes ✅
Case Study: American Community Survey
21
→ Amp It Up: synthetic population data at scale
1.5 million synthetic records with 32 attributes ✅
Identical
Model Performance
Case Study: American Community Survey
22
→ Amp It Up: synthetic population data at scale
1.5 million synthetic records with 32 attributes ✅
Identical
Driver Analysis
Case Study: American Community Survey
23
→ Amp It Up: synthetic population data at scale
1.5 million synthetic records with 32 attributes ✅
Identical
Driver Analysis
1. Holdout-Based Comparisons
2. Shadow Model Comparisons
3. Distance-Based Comparisons
How safe is it?
24
How safe is it? - Holdout-Based Comparisons
25
https://github.com/statice/anonymeter
Example
rcontrol
= 50%
rtrain
= 55%
R = 0.10
How safe is it? - Shadow Models
26
ST
synthesize
ST’
T X
X
T’
X
X
X
X
?
?
Δ difference?
synthesize
infer
infer
exclude
subject
-
We shall not be able to
infer anything more about
an individual, given that
that person was included
in the database used for
synthetization.
Training on Actual Data T → Information Leak
Training on Synthetic Data S → No Information Leak (plus on-par performance)
https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-definitions-part-ii
How safe is it? - Shadow Models for Attribute Inference
27
Accuracy scores for 50 randomly chosen subjects, that were part of training
Target T
Synthetic ST
Accuracy scores for 50 randomly chosen subjects, that were NOT part of training
T’arget
Target T’
Synthetic ST’
T vs T’ → 44% > 39%
S vs S’→ 38% ~ 39%
How safe is it? - Shadow Models for Membership Inference
28
MOSTLY AI
Successful attack ??
How safe is it? - Distance-Based Comparisons
29
Idea: Synthetic subjects shall not be “too close” to actual subjects, i.e. shall be sufficiently distinct.
Distances between actual subjects serve as reference.
Actual
Subject
Actual
Nearest Neighbor
Synthetic
Nearest Neighbor
DCRS
DCR
T
Distance-to-Closest-Records
How safe is it?
30
1. Holdout-Based Comparisons → requires holdout
2. Shadow Model Comparisons → heavy compute
3. Distance-Based Comparisons → easy compute & interpret
Learn More
31
→ https://mostly.ai/docs/tutorials
● Synthetic Behavioral Data
● Synthetic Geo Data
● Synthetic Text Data
● Fair Synthetic Data
● Synthetic Data Benchmarks
● JRC Report on Synthetic Data
● AI-based Re-Identification Attacks
● Privacy Assessment of Synthetic Data
● and so much more…
Give it a try!
32
#GoSynthetic
→ https://synthetic.mostly.ai/
33
Important Note 1
The baseline for Attribute Disclosure is any inference
based on a world that doesn’t know about an individual.
Naturally, the privacy of a person, that doesn’t exist, can
not be already violated. Yet, some SD critics mix up
attribute inference with privacy.
Important Note 2
Meta-Data (=value ranges) need to be protected to be
safe against trivial membership inference attacks. Most of
the open-source as well as custom-coded solutions don’t
protect against that. Then the existence of a “US President”
in a synthesized cancer dataset would already leak privacy.
34
Important Note 3
Privacy needs to be protected at subject level, and not at
event level. No privacy check will ever be able to know by
itself whether that was the case or not. Thus, context
information on the data itself is always required to assess
privacy.
35

Everything You Always Wanted to Know About Synthetic Data

  • 1.
    Dr. Michael Platzer,MOSTLY AI Nov 2023 Guest Lecture at Imperial College London Everything You Always Wanted to Know About Synthetic Data in 45 minutes 😳
  • 2.
    What is it? Howaccurate is it? How safe is it? 2 Synthetic Data 1. 2. 3. Give it a try! 4.
  • 3.
    3 We’re MOSTLY AI,and we’re all about…
  • 4.
  • 5.
    Let’s create somenew faces! 5 random-generated 400px 300px model-generated 400px 300px AI-generated 400px 300px self-generated 400px 300px
  • 6.
    6 Privacy-Sensitive Structured Data source:https://archive.ics.uci.edu/ml/datasets/census+income, ~49k records, 421 billion combinations possible 󰢃 access denied 💣
  • 7.
    7 Let’s create someSynthetic Data! → https://www.mostly.ai/ Upload 󰟑 1. Synthesize 💫 2. Done ✔ in ~50secs 3.
  • 8.
    8 Representative Synthetic Data statisticallyrepresentative, highly realistic, truly anonymous synthetic data - at any volume
  • 9.
    Representative Synthetic Datafor Everyone Actual Data privacy-restricted biased incomplete Synthetic Data realistic representative truly anonymous granular level Data Consumers people & algorithms
  • 10.
    Representative Synthetic Datafor Everything for Exploration for Analytics for AI Training for AI Validation
  • 11.
    How good isit? 11 vs. How close is the synthetic data to the original? What does it mean to be close for datasets? And, then how close should it even be? [there is surprisingly little consensus on answering this question] Original Data Synthetic Data
  • 12.
    Automated Empirical QualityAssurance 12 See also our paper on Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data and our blog post series on accuracy and privacy.
  • 13.
    How accurate isit? 13 1. Turing Tests - is it fake or not? Measures realism, i.e. the rule-adherence, of synthetic samples at record-level. Can be performed by humans as well as machines. But it doesn’t inform about the statistical representativeness. 2. Compare Utility for Machine Learning Train on synthetic, test on a real holdout data for a specific ML task, and compare predictive accuracy to the same model being trained on the original data. Strong test, but only checks for one specific signal in the data. 3. Measure Deviations in Marginal Distributions Calculate lower-level marginal empirical distributions, and systematically measure any deviations thereof.
  • 14.
    How accurate isit? 14 Univariate Distributions
  • 15.
    How accurate isit? 15 Bivariate Distributions
  • 16.
  • 17.
    How accurate isit? - Train-Synthetic Test-Real 17 Holdout Data (20%) Training Data (80%) Synthesize Synthetic Data (eg. 10x) Train ML Model (on synthetic) Train ML Model (on real) Evaluate on Holdout Evaluate on Holdout Compare
  • 18.
    How accurate isit? 18 → https://mostly.ai/blog/boost-machine-learning-accuracy-with-synthetic-data/
  • 19.
    Case Study: AmericanCommunity Survey 19 → Amp It Up: synthetic population data at scale 1.5 million actual records with 32 attributes ⚠
  • 20.
    Case Study: AmericanCommunity Survey 20 → Amp It Up: synthetic population data at scale 1.5 million synthetic records with 32 attributes ✅
  • 21.
    Case Study: AmericanCommunity Survey 21 → Amp It Up: synthetic population data at scale 1.5 million synthetic records with 32 attributes ✅ Identical Model Performance
  • 22.
    Case Study: AmericanCommunity Survey 22 → Amp It Up: synthetic population data at scale 1.5 million synthetic records with 32 attributes ✅ Identical Driver Analysis
  • 23.
    Case Study: AmericanCommunity Survey 23 → Amp It Up: synthetic population data at scale 1.5 million synthetic records with 32 attributes ✅ Identical Driver Analysis
  • 24.
    1. Holdout-Based Comparisons 2.Shadow Model Comparisons 3. Distance-Based Comparisons How safe is it? 24
  • 25.
    How safe isit? - Holdout-Based Comparisons 25 https://github.com/statice/anonymeter Example rcontrol = 50% rtrain = 55% R = 0.10
  • 26.
    How safe isit? - Shadow Models 26 ST synthesize ST’ T X X T’ X X X X ? ? Δ difference? synthesize infer infer exclude subject - We shall not be able to infer anything more about an individual, given that that person was included in the database used for synthetization.
  • 27.
    Training on ActualData T → Information Leak Training on Synthetic Data S → No Information Leak (plus on-par performance) https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-definitions-part-ii How safe is it? - Shadow Models for Attribute Inference 27 Accuracy scores for 50 randomly chosen subjects, that were part of training Target T Synthetic ST Accuracy scores for 50 randomly chosen subjects, that were NOT part of training T’arget Target T’ Synthetic ST’ T vs T’ → 44% > 39% S vs S’→ 38% ~ 39%
  • 28.
    How safe isit? - Shadow Models for Membership Inference 28 MOSTLY AI Successful attack ??
  • 29.
    How safe isit? - Distance-Based Comparisons 29 Idea: Synthetic subjects shall not be “too close” to actual subjects, i.e. shall be sufficiently distinct. Distances between actual subjects serve as reference. Actual Subject Actual Nearest Neighbor Synthetic Nearest Neighbor DCRS DCR T Distance-to-Closest-Records
  • 30.
    How safe isit? 30 1. Holdout-Based Comparisons → requires holdout 2. Shadow Model Comparisons → heavy compute 3. Distance-Based Comparisons → easy compute & interpret
  • 31.
    Learn More 31 → https://mostly.ai/docs/tutorials ●Synthetic Behavioral Data ● Synthetic Geo Data ● Synthetic Text Data ● Fair Synthetic Data ● Synthetic Data Benchmarks ● JRC Report on Synthetic Data ● AI-based Re-Identification Attacks ● Privacy Assessment of Synthetic Data ● and so much more…
  • 32.
    Give it atry! 32 #GoSynthetic → https://synthetic.mostly.ai/
  • 33.
    33 Important Note 1 Thebaseline for Attribute Disclosure is any inference based on a world that doesn’t know about an individual. Naturally, the privacy of a person, that doesn’t exist, can not be already violated. Yet, some SD critics mix up attribute inference with privacy.
  • 34.
    Important Note 2 Meta-Data(=value ranges) need to be protected to be safe against trivial membership inference attacks. Most of the open-source as well as custom-coded solutions don’t protect against that. Then the existence of a “US President” in a synthesized cancer dataset would already leak privacy. 34
  • 35.
    Important Note 3 Privacyneeds to be protected at subject level, and not at event level. No privacy check will ever be able to know by itself whether that was the case or not. Thus, context information on the data itself is always required to assess privacy. 35