SlideShare a Scribd company logo
1 of 26
Download to read offline
Dr. Michael Platzer, MOSTLY AI
Oct 21, 2022
Guest Lecture at Imperial College London
Everything You Always Wanted to
Know About Synthetic Data
in 30 minutes 😳
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Synthetic Data
How accurate is it?
How safe is it?
2
Agenda
1.
2.
3.
Give it a try!
4.
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
3
We’re MOSTLY AI, and we’re all about…
It’s smarter than using real data and we’ve made
structured synthetic data smarter to create, use and
share across your organization.
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
useful, but
re-identifiable
private, but
useless
Let’s try to anonymize this guy!
4
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Let’s create some new faces!
5
random-generated
400px
300px
model-generated
400px
300px
AI-generated
400px
300px
self-generated
400px
300px
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
6
Structured Data
source: https://archive.ics.uci.edu/ml/datasets/census+income, ~49k records, 421 billion combinations possible
󰢃
access denied 💣
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
7
Let’s create some structured data!
→ https://www.mostly.ai/
Upload 󰟑
1. Synthesize 💫
2. Done ✔ in ~50secs
3.
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
8
Structured Synthetic Data
statistically representative, highly realistic, truly anonymous synthetic data - at any volume
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
9
Synthetic Data - The Lingua Franca of Learning
Original, Privacy-sensitive Data
restricted, biased, incomplete
Smarter Synthetic Data
realistic, representative, truly anonymous
󰟑 🤖
󰞍 󰞥
󰥽 󰱷
󰟦 💡
󰡵 󰱘
󰲞 󰭁
Data Consumers
learn, collaborate, innovate
󰢃
access denied 💣
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How good is it?
10
vs.
How close is the synthetic data to the original?
What does it mean to be close for datasets?
And, then how close should it even be?
[there is surprisingly little consensus on answering this question]
Original Data Synthetic Data
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Automated Empirical Quality Assurance
11
See also our paper on Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data and our blog post series on accuracy and privacy.
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
12
1. Turing Tests - is it fake or not?
Measures realism, i.e. the rule-adherence, of synthetic samples at record-level. Can be performed by
humans as well as machines. But it doesn’t inform about the statistical representativeness.
2. Compare Utility for Machine Learning
Train on synthetic, test on a real holdout data for a specific ML task, and compare predictive accuracy
to the same model being trained on the original data. Strong test, but only checks for one specific
signal in the data.
3. Measure Deviations in Marginal Distributions
Calculate lower-level marginal empirical distributions, and systematically measure any deviations
thereof.
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
13
Univariate Distributions
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
14
Bivariate Distributions
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
15
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it? - Train-Synthetic Test-Real
16
Holdout Data
(20%)
Training Data
(80%)
Synthesize
Synthetic Data
(eg. 10x)
Train ML Model
(on synthetic)
Train ML Model
(on real)
Evaluate on Holdout
Evaluate on Holdout
Compare
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How accurate is it?
17
→ https://mostly.ai/blog/boost-machine-learning-accuracy-with-synthetic-data/
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
1. Run Shadow Models
a. for Membership Inference Attacks
b. for Attribute Disclosure Attacks
2. Compare Distances To Closest Records
a. Identical Match Share (IMS)
b. Distance to Closest Record (DCR)
c. Nearest Neighbour Distance Ratio (NNDR)
How safe is it?
18
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Shadow Models
19
ST
synthesize
ST’
T X
X
T’
X
X
X
X
?
?
Δ difference?
synthesize
infer
infer
exclude
subject
-
We shall not be able to
infer anything more about
an individual, given that
that person was included
in the database used for
synthetization.
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Training on Synthetic Data automatically fixed the privacy leak / memorization of some
downstream ML models, without negatively impacting the overall predictive accuracy.
How safe is it? - Shadow Models
20
Accuracy scores for 50 randomly chosen subjects, that were part of training
Target T
Synthetic ST
Accuracy scores for 50 randomly chosen subjects, that were NOT part of training
T’arget
Target T’
Synthetic ST’
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Shadow Models
21
Important Note 1
The baseline for Attribute Disclosure is any inference
based on a world that doesn’t know about an individual.
Naturally, the privacy of a person, that doesn’t exist, can
not be already violated. Yet, some SD critics mix up
attribute inference with privacy.
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Important Note 2
Meta-Data (=value ranges) need to be protected to be
safe against trivial membership inference attacks. Most of
the open-source as well as custom-coded solutions don’t
protect against that. Then the existence of a “US President”
in a synthesized cancer dataset would already leak privacy.
How safe is it? - Shadow Models
22
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Distance Measures
23
T
X
H
IMS(H, T) ≦ IMS(S, T) ?
DCR(H, T) ≦ DCR(S, T) ?
NNDR(H, T) ≦ NNDR(S, T)
?
Training Data Holdout Data
Synthetic Data
S
Synthetic Data shall be “as close as
possible”, but not “too close” to
Original Data. A Holdout helps to set a
benchmark for being too close. As an
ideal synthetic data generator creates
new samples that behave exactly like
actual samples, that haven’t been
seen before (=holdout data).
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
How safe is it? - Distance Measures
24
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Learn More
25
→ https://blog.mostly.ai/
● Synthetic Behavioral Data
● Synthetic Geo Data
● Synthetic Text Data
● Fair Synthetic Data
● Synthetic Data Benchmarks
● JRC Report on Synthetic Data
● AI-based Re-Identification Attacks
● Privacy Assessment of Synthetic Data
● and so much more…
TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL
Give it a try!
26
Sign up, get going and join our Discord Community.
→ https://synthetic.mostly.ai/

More Related Content

Similar to Everything you always wanted to know about Synthetic Data

Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNING
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNINGSOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNING
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNINGIRJET Journal
 
Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios ijcsa
 
JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...
JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...
JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...IEEEGLOBALSOFTTECHNOLOGIES
 
Utility privacy tradeoff in databases an information-theoretic approach
Utility privacy tradeoff in databases an information-theoretic approachUtility privacy tradeoff in databases an information-theoretic approach
Utility privacy tradeoff in databases an information-theoretic approachIEEEFINALYEARPROJECTS
 
Deep Neural Networks for Machine Learning
Deep Neural Networks for Machine LearningDeep Neural Networks for Machine Learning
Deep Neural Networks for Machine LearningJustin Beirold
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Liming Zhu
 
Staying Ahead of the Race - Quantum computing in Cybersecurity
Staying Ahead of the Race - Quantum computing in Cybersecurity Staying Ahead of the Race - Quantum computing in Cybersecurity
Staying Ahead of the Race - Quantum computing in Cybersecurity Lilminow
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science processMathieu d'Aquin
 
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and OwnershipDe-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and OwnershipIIIT Hyderabad
 
THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?
THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?
THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?Danny Pająk
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachAndry Alamsyah
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsIOSR Journals
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Gianluca Tarasconi
 

Similar to Everything you always wanted to know about Synthetic Data (20)

Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNING
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNINGSOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNING
SOCIAL DISTANCING MONITORING IN COVID-19 USING DEEP LEARNING
 
J017446568
J017446568J017446568
J017446568
 
Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios Detecting Unknown Insider Threat Scenarios
Detecting Unknown Insider Threat Scenarios
 
Debugging AI
Debugging AIDebugging AI
Debugging AI
 
JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...
JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...
JAVA 2013 IEEE NETWORKSECURITY PROJECT Utility privacy tradeoff in databases ...
 
Utility privacy tradeoff in databases an information-theoretic approach
Utility privacy tradeoff in databases an information-theoretic approachUtility privacy tradeoff in databases an information-theoretic approach
Utility privacy tradeoff in databases an information-theoretic approach
 
Deep Neural Networks for Machine Learning
Deep Neural Networks for Machine LearningDeep Neural Networks for Machine Learning
Deep Neural Networks for Machine Learning
 
Hy3414631468
Hy3414631468Hy3414631468
Hy3414631468
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61
 
Staying Ahead of the Race - Quantum computing in Cybersecurity
Staying Ahead of the Race - Quantum computing in Cybersecurity Staying Ahead of the Race - Quantum computing in Cybersecurity
Staying Ahead of the Race - Quantum computing in Cybersecurity
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science process
 
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and OwnershipDe-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership
 
THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?
THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?
THE SHADOW OF HIERARCHY - HOW TO SAMPLE A HIDDEN POPULATION OF FORMER EMPLOYEES?
 
Tecnologías emergentes: priorizando al ciudadano
Tecnologías emergentes: priorizando al ciudadanoTecnologías emergentes: priorizando al ciudadano
Tecnologías emergentes: priorizando al ciudadano
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network Approach
 
Differential privacy and ml
Differential privacy and mlDifferential privacy and ml
Differential privacy and ml
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logs
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 

More from MOSTLY AI

Synthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AISynthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AIMOSTLY AI
 
AI-based re-identification of behavioral data
AI-based re-identification of behavioral dataAI-based re-identification of behavioral data
AI-based re-identification of behavioral dataMOSTLY AI
 
Synthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacySynthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacyMOSTLY AI
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferMOSTLY AI
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnMOSTLY AI
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016MOSTLY AI
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMOSTLY AI
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsMOSTLY AI
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...MOSTLY AI
 

More from MOSTLY AI (9)

Synthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AISynthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AI
 
AI-based re-identification of behavioral data
AI-based re-identification of behavioral dataAI-based re-identification of behavioral data
AI-based re-identification of behavioral data
 
Synthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacySynthetic Data for Big Data Privacy
Synthetic Data for Big Data Privacy
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer Töglhofer
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines Learn
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV Contest
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer Relationships
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 

Everything you always wanted to know about Synthetic Data

  • 1. Dr. Michael Platzer, MOSTLY AI Oct 21, 2022 Guest Lecture at Imperial College London Everything You Always Wanted to Know About Synthetic Data in 30 minutes 😳
  • 2. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL Synthetic Data How accurate is it? How safe is it? 2 Agenda 1. 2. 3. Give it a try! 4.
  • 3. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL 3 We’re MOSTLY AI, and we’re all about… It’s smarter than using real data and we’ve made structured synthetic data smarter to create, use and share across your organization.
  • 4. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL useful, but re-identifiable private, but useless Let’s try to anonymize this guy! 4
  • 5. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL Let’s create some new faces! 5 random-generated 400px 300px model-generated 400px 300px AI-generated 400px 300px self-generated 400px 300px
  • 6. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL 6 Structured Data source: https://archive.ics.uci.edu/ml/datasets/census+income, ~49k records, 421 billion combinations possible 󰢃 access denied 💣
  • 7. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL 7 Let’s create some structured data! → https://www.mostly.ai/ Upload 󰟑 1. Synthesize 💫 2. Done ✔ in ~50secs 3.
  • 8. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL 8 Structured Synthetic Data statistically representative, highly realistic, truly anonymous synthetic data - at any volume
  • 9. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL 9 Synthetic Data - The Lingua Franca of Learning Original, Privacy-sensitive Data restricted, biased, incomplete Smarter Synthetic Data realistic, representative, truly anonymous 󰟑 🤖 󰞍 󰞥 󰥽 󰱷 󰟦 💡 󰡵 󰱘 󰲞 󰭁 Data Consumers learn, collaborate, innovate 󰢃 access denied 💣
  • 10. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How good is it? 10 vs. How close is the synthetic data to the original? What does it mean to be close for datasets? And, then how close should it even be? [there is surprisingly little consensus on answering this question] Original Data Synthetic Data
  • 11. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL Automated Empirical Quality Assurance 11 See also our paper on Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data and our blog post series on accuracy and privacy.
  • 12. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How accurate is it? 12 1. Turing Tests - is it fake or not? Measures realism, i.e. the rule-adherence, of synthetic samples at record-level. Can be performed by humans as well as machines. But it doesn’t inform about the statistical representativeness. 2. Compare Utility for Machine Learning Train on synthetic, test on a real holdout data for a specific ML task, and compare predictive accuracy to the same model being trained on the original data. Strong test, but only checks for one specific signal in the data. 3. Measure Deviations in Marginal Distributions Calculate lower-level marginal empirical distributions, and systematically measure any deviations thereof.
  • 13. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How accurate is it? 13 Univariate Distributions
  • 14. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How accurate is it? 14 Bivariate Distributions
  • 15. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How accurate is it? 15
  • 16. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How accurate is it? - Train-Synthetic Test-Real 16 Holdout Data (20%) Training Data (80%) Synthesize Synthetic Data (eg. 10x) Train ML Model (on synthetic) Train ML Model (on real) Evaluate on Holdout Evaluate on Holdout Compare
  • 17. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How accurate is it? 17 → https://mostly.ai/blog/boost-machine-learning-accuracy-with-synthetic-data/
  • 18. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL 1. Run Shadow Models a. for Membership Inference Attacks b. for Attribute Disclosure Attacks 2. Compare Distances To Closest Records a. Identical Match Share (IMS) b. Distance to Closest Record (DCR) c. Nearest Neighbour Distance Ratio (NNDR) How safe is it? 18
  • 19. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How safe is it? - Shadow Models 19 ST synthesize ST’ T X X T’ X X X X ? ? Δ difference? synthesize infer infer exclude subject - We shall not be able to infer anything more about an individual, given that that person was included in the database used for synthetization.
  • 20. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL Training on Synthetic Data automatically fixed the privacy leak / memorization of some downstream ML models, without negatively impacting the overall predictive accuracy. How safe is it? - Shadow Models 20 Accuracy scores for 50 randomly chosen subjects, that were part of training Target T Synthetic ST Accuracy scores for 50 randomly chosen subjects, that were NOT part of training T’arget Target T’ Synthetic ST’
  • 21. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How safe is it? - Shadow Models 21 Important Note 1 The baseline for Attribute Disclosure is any inference based on a world that doesn’t know about an individual. Naturally, the privacy of a person, that doesn’t exist, can not be already violated. Yet, some SD critics mix up attribute inference with privacy.
  • 22. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL Important Note 2 Meta-Data (=value ranges) need to be protected to be safe against trivial membership inference attacks. Most of the open-source as well as custom-coded solutions don’t protect against that. Then the existence of a “US President” in a synthesized cancer dataset would already leak privacy. How safe is it? - Shadow Models 22
  • 23. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How safe is it? - Distance Measures 23 T X H IMS(H, T) ≦ IMS(S, T) ? DCR(H, T) ≦ DCR(S, T) ? NNDR(H, T) ≦ NNDR(S, T) ? Training Data Holdout Data Synthetic Data S Synthetic Data shall be “as close as possible”, but not “too close” to Original Data. A Holdout helps to set a benchmark for being too close. As an ideal synthetic data generator creates new samples that behave exactly like actual samples, that haven’t been seen before (=holdout data).
  • 24. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL How safe is it? - Distance Measures 24
  • 25. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL Learn More 25 → https://blog.mostly.ai/ ● Synthetic Behavioral Data ● Synthetic Geo Data ● Synthetic Text Data ● Fair Synthetic Data ● Synthetic Data Benchmarks ● JRC Report on Synthetic Data ● AI-based Re-Identification Attacks ● Privacy Assessment of Synthetic Data ● and so much more…
  • 26. TITLE OF DOCUMENT | PUBLIC/INTERNAL/CONFIDENTIAL Give it a try! 26 Sign up, get going and join our Discord Community. → https://synthetic.mostly.ai/