SlideShare a Scribd company logo
Data Synthesis:
A Tool for Responsible
Data Sharing
Khaled El Emam
16th June 2021
(c) Copyright 2019-2021 Replica Analytics Ltd.
Agenda
2
General description of what
synthetic data is and general use
cases
1
Introduction to Synthesis
2 An examination of privacy risks and
the utility of synthetic data
Privacy and Utility
3 A brief look at methods for the
generation of synthetic data
Methods
(c) Copyright 2019-2021 Replica Analytics Ltd.
Synthetic Data Uses
• Data Sharing and Data Access
• AI and data science projects
• Software testing
• Proof of concept and technology evaluations
• Open data/open science
• Hackathons and data competitions/challenges
• Data Amplification and Data Augmentation
• Amplifying small datasets
• Correct bias
3
(c) Copyright 2019-2021 Replica Analytics Ltd.
The Synthesis Process
4
(c) Copyright 2019-2021 Replica Analytics Ltd.
Data Simulator
5
Allows generation of synthetic data without direct
access to real data
(c) Copyright 2019-2021 Replica Analytics Ltd.
Simulator Exchange
6
Data Consumers
(c) Copyright 2019-2021 Replica Analytics Ltd.
Two Synthesis Strategies
7
Partial
Synthesis
Synthesize
quasi-identifiers
Synthesis
Full Synthesis
Synthesize all
variables
Synthesis
(c) Copyright 2019-2021 Replica Analytics Ltd.
Identifiability Spectrum
8
(c) Copyright 2019-2021 Replica Analytics Ltd.
Privacy Risks
9
Dataset Fully Synthetic Data Original Data
Washington Hospital
Data
0.0197 0.098
Canadian COVID Data 0.0086 0.034
A commonly used risk threshold = 0.09
(c) Copyright 2019-2021 Replica Analytics Ltd.
Privacy-Utility Tradeoff
10
(c) Copyright 2019-2021 Replica Analytics Ltd.
Distribution Comparisons
11
(c) Copyright 2019-2021 Replica Analytics Ltd.
Mortality Over Time
12
(c) Copyright 2019-2021 Replica Analytics Ltd.
Mortality By Age
13
(c) Copyright 2019-2021 Replica Analytics Ltd.
Utility Framework
• An important concern of data
users is the data utility
• Utility has multiple dimensions to it
• Synthetic data may be optimized
on multiple utility dimensions
simultaneously to meet the needs
of multiple users, or on single
dimensions to address the needs
of limited users
14
(c) Copyright 2019-2021 Replica Analytics Ltd.
Risk-based Approach
15
(c) Copyright 2019-2021 Replica Analytics Ltd.
Risk-based Approach
16
• Generalization
• Suppression
• Addition of noise
• Microaggregation
(c) Copyright 2019-2021 Replica Analytics Ltd.
Risk-based Approach
17
• Security controls
• Privacy controls
• Contractual controls
(c) Copyright 2019-2021 Replica Analytics Ltd.
(c) Copyright 2019-2021 Replica Analytics Ltd.
(c) Copyright 2019-2021 Replica Analytics Ltd.
(c) Copyright 2019-2021 Replica Analytics Ltd.
The Erosion of Trust
21
(c) Copyright 2019-2021 Replica Analytics Ltd.
Skill Set
• The skills needed to create
de-personalized datasets
are very specialized, take
time to develop, and
generally difficult to find
cost-effectively
• This limits the ability to scale
• Synthesis requires minimal
skills in practice – it is a
computationally intensive
process
22
(c) Copyright 2019-2021 Replica Analytics Ltd.
Regulatory Questions
• Is synthetic data considered non-identifiable
information ?
• Does the act of converting identifiable information into
non-identifiable synthetic information require
additional consent or authorization ?
• Can a data custodian outsource the creation of
synthetic data ?
• Can synthetic data be used for any purpose ?
23
(c) Copyright 2019-2021 Replica Analytics Ltd.
Sequential Synthesis
24
(c) Copyright 2019-2021 Replica Analytics Ltd.
Variational Auto Encoder (VAE)
25
(c) Copyright 2019-2021 Replica Analytics Ltd.
Generative Adversarial Network (GAN)
26
TOOLBOX OF TECHNIQUES
cc: tunnelarmr - https://www.flickr.com/photos/27311060@N00
QUESTIONS
cc: an untrained eye - https://www.flickr.com/photos/26312642@N00
(c) Copyright 2019-2021 Replica Analytics Ltd.
References
• Z. Azizi, C. Zheng, L. Mosquera, L. Pilote, K. El Emam: “Replicating Secondary
Studies Using Synthetic Clinical Trial Data”, BMJ Open, 11:e043497, 2021.
• K. El Emam, L. Mosquera, E. Jonker, H. Sood: “Evaluating the Utility of Synthetic
COVID-19 Case Data”, JAMIA Open, 14(1):ooab012, January 2021.
• K. El Emam, L. Mosquera, and C. Zheng, “Optimizing the Synthesis of Clinical Trial
Data Using Sequential Trees,” JAMIA, 28(1): 3-13, 2021.
• K. El Emam, L. Mosquera, and J. Bass, “Evaluating Identity Disclosure Risk in Fully
Synthetic Health Data: Model Development and Validation,” JMIR, vol. 22, no. 11, Nov.
2020. [Online]. Available: https://www.jmir.org/2020/11/e23139.
• K. El Emam, L. Mosquera, and R. Hoptroff, Practical Synthetic Data Generation:
Balancing Privacy and the Broad Availability of Data. O’Reilly, 2020.
• K. El Emam, “Seven Ways to Evaluate the Utility of Synthetic Data,” IEEE Security and
Privacy, July/August, 2020.
29

More Related Content

Similar to 02_khaled_el_emam_en.pptx

Digital twin technology - seminar presentation
Digital twin technology - seminar presentationDigital twin technology - seminar presentation
Digital twin technology - seminar presentation
1js20ec036ksspoorthi
 
Defect Prediction & Prevention In Automotive Software Development
Defect Prediction & Prevention In Automotive Software DevelopmentDefect Prediction & Prevention In Automotive Software Development
Defect Prediction & Prevention In Automotive Software Development
RAKESH RANA
 
Tear Down Data Silos - CROWN 2019 conference
Tear Down Data Silos - CROWN 2019 conferenceTear Down Data Silos - CROWN 2019 conference
Tear Down Data Silos - CROWN 2019 conference
Saama
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdf
Akuhuruf
 
Digital Twins Case Study in CPG Industry
Digital Twins Case Study in CPG IndustryDigital Twins Case Study in CPG Industry
Digital Twins Case Study in CPG Industry
Shruti Chaurasia
 
POV - Digital Twins Technology in CPG Industry.pdf
POV - Digital Twins Technology in CPG Industry.pdfPOV - Digital Twins Technology in CPG Industry.pdf
POV - Digital Twins Technology in CPG Industry.pdf
Shruti Chaurasia
 
Implementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep LearningImplementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep Learning
Md. Mahfujur Rahman
 
Ck34520526
Ck34520526Ck34520526
Ck34520526
IJERA Editor
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Sebastiano Panichella
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Sebastiano Panichella
 
MACHINE LEARNING-3.pptx
MACHINE LEARNING-3.pptxMACHINE LEARNING-3.pptx
MACHINE LEARNING-3.pptx
Sanjay164567
 
A predictive model for network intrusion detection using stacking approach
A predictive model for network intrusion detection using stacking approach A predictive model for network intrusion detection using stacking approach
A predictive model for network intrusion detection using stacking approach
IJECEIAES
 
digital twin seminar 1.pptx
digital twin seminar 1.pptxdigital twin seminar 1.pptx
digital twin seminar 1.pptx
MacZain
 
Grid computing iot_sci_bbsr
Grid computing iot_sci_bbsrGrid computing iot_sci_bbsr
Grid computing iot_sci_bbsr
Arpan Pal
 
Grid computing iot_sci_bbsr
Grid computing iot_sci_bbsrGrid computing iot_sci_bbsr
Grid computing iot_sci_bbsr
Arpan Pal
 
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Justin Hayward
 
The Role of Artificial Intelligence in Enhancing Cloud Application Performance
The Role of Artificial Intelligence in Enhancing Cloud Application PerformanceThe Role of Artificial Intelligence in Enhancing Cloud Application Performance
The Role of Artificial Intelligence in Enhancing Cloud Application Performance
IRJET Journal
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
Liming Zhu
 

Similar to 02_khaled_el_emam_en.pptx (20)

Digital twin technology - seminar presentation
Digital twin technology - seminar presentationDigital twin technology - seminar presentation
Digital twin technology - seminar presentation
 
Defect Prediction & Prevention In Automotive Software Development
Defect Prediction & Prevention In Automotive Software DevelopmentDefect Prediction & Prevention In Automotive Software Development
Defect Prediction & Prevention In Automotive Software Development
 
Tear Down Data Silos - CROWN 2019 conference
Tear Down Data Silos - CROWN 2019 conferenceTear Down Data Silos - CROWN 2019 conference
Tear Down Data Silos - CROWN 2019 conference
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdf
 
Digital Twins Case Study in CPG Industry
Digital Twins Case Study in CPG IndustryDigital Twins Case Study in CPG Industry
Digital Twins Case Study in CPG Industry
 
POV - Digital Twins Technology in CPG Industry.pdf
POV - Digital Twins Technology in CPG Industry.pdfPOV - Digital Twins Technology in CPG Industry.pdf
POV - Digital Twins Technology in CPG Industry.pdf
 
Implementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep LearningImplementation of Automated Attendance System using Deep Learning
Implementation of Automated Attendance System using Deep Learning
 
Ck34520526
Ck34520526Ck34520526
Ck34520526
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
MACHINE LEARNING-3.pptx
MACHINE LEARNING-3.pptxMACHINE LEARNING-3.pptx
MACHINE LEARNING-3.pptx
 
A predictive model for network intrusion detection using stacking approach
A predictive model for network intrusion detection using stacking approach A predictive model for network intrusion detection using stacking approach
A predictive model for network intrusion detection using stacking approach
 
digital twin seminar 1.pptx
digital twin seminar 1.pptxdigital twin seminar 1.pptx
digital twin seminar 1.pptx
 
Grid computing iot_sci_bbsr
Grid computing iot_sci_bbsrGrid computing iot_sci_bbsr
Grid computing iot_sci_bbsr
 
Grid computing iot_sci_bbsr
Grid computing iot_sci_bbsrGrid computing iot_sci_bbsr
Grid computing iot_sci_bbsr
 
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
 
The Role of Artificial Intelligence in Enhancing Cloud Application Performance
The Role of Artificial Intelligence in Enhancing Cloud Application PerformanceThe Role of Artificial Intelligence in Enhancing Cloud Application Performance
The Role of Artificial Intelligence in Enhancing Cloud Application Performance
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
 

Recently uploaded

Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
ramrag33
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 

Recently uploaded (20)

Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 

02_khaled_el_emam_en.pptx

  • 1. Data Synthesis: A Tool for Responsible Data Sharing Khaled El Emam 16th June 2021
  • 2. (c) Copyright 2019-2021 Replica Analytics Ltd. Agenda 2 General description of what synthetic data is and general use cases 1 Introduction to Synthesis 2 An examination of privacy risks and the utility of synthetic data Privacy and Utility 3 A brief look at methods for the generation of synthetic data Methods
  • 3. (c) Copyright 2019-2021 Replica Analytics Ltd. Synthetic Data Uses • Data Sharing and Data Access • AI and data science projects • Software testing • Proof of concept and technology evaluations • Open data/open science • Hackathons and data competitions/challenges • Data Amplification and Data Augmentation • Amplifying small datasets • Correct bias 3
  • 4. (c) Copyright 2019-2021 Replica Analytics Ltd. The Synthesis Process 4
  • 5. (c) Copyright 2019-2021 Replica Analytics Ltd. Data Simulator 5 Allows generation of synthetic data without direct access to real data
  • 6. (c) Copyright 2019-2021 Replica Analytics Ltd. Simulator Exchange 6 Data Consumers
  • 7. (c) Copyright 2019-2021 Replica Analytics Ltd. Two Synthesis Strategies 7 Partial Synthesis Synthesize quasi-identifiers Synthesis Full Synthesis Synthesize all variables Synthesis
  • 8. (c) Copyright 2019-2021 Replica Analytics Ltd. Identifiability Spectrum 8
  • 9. (c) Copyright 2019-2021 Replica Analytics Ltd. Privacy Risks 9 Dataset Fully Synthetic Data Original Data Washington Hospital Data 0.0197 0.098 Canadian COVID Data 0.0086 0.034 A commonly used risk threshold = 0.09
  • 10. (c) Copyright 2019-2021 Replica Analytics Ltd. Privacy-Utility Tradeoff 10
  • 11. (c) Copyright 2019-2021 Replica Analytics Ltd. Distribution Comparisons 11
  • 12. (c) Copyright 2019-2021 Replica Analytics Ltd. Mortality Over Time 12
  • 13. (c) Copyright 2019-2021 Replica Analytics Ltd. Mortality By Age 13
  • 14. (c) Copyright 2019-2021 Replica Analytics Ltd. Utility Framework • An important concern of data users is the data utility • Utility has multiple dimensions to it • Synthetic data may be optimized on multiple utility dimensions simultaneously to meet the needs of multiple users, or on single dimensions to address the needs of limited users 14
  • 15. (c) Copyright 2019-2021 Replica Analytics Ltd. Risk-based Approach 15
  • 16. (c) Copyright 2019-2021 Replica Analytics Ltd. Risk-based Approach 16 • Generalization • Suppression • Addition of noise • Microaggregation
  • 17. (c) Copyright 2019-2021 Replica Analytics Ltd. Risk-based Approach 17 • Security controls • Privacy controls • Contractual controls
  • 18. (c) Copyright 2019-2021 Replica Analytics Ltd.
  • 19. (c) Copyright 2019-2021 Replica Analytics Ltd.
  • 20. (c) Copyright 2019-2021 Replica Analytics Ltd.
  • 21. (c) Copyright 2019-2021 Replica Analytics Ltd. The Erosion of Trust 21
  • 22. (c) Copyright 2019-2021 Replica Analytics Ltd. Skill Set • The skills needed to create de-personalized datasets are very specialized, take time to develop, and generally difficult to find cost-effectively • This limits the ability to scale • Synthesis requires minimal skills in practice – it is a computationally intensive process 22
  • 23. (c) Copyright 2019-2021 Replica Analytics Ltd. Regulatory Questions • Is synthetic data considered non-identifiable information ? • Does the act of converting identifiable information into non-identifiable synthetic information require additional consent or authorization ? • Can a data custodian outsource the creation of synthetic data ? • Can synthetic data be used for any purpose ? 23
  • 24. (c) Copyright 2019-2021 Replica Analytics Ltd. Sequential Synthesis 24
  • 25. (c) Copyright 2019-2021 Replica Analytics Ltd. Variational Auto Encoder (VAE) 25
  • 26. (c) Copyright 2019-2021 Replica Analytics Ltd. Generative Adversarial Network (GAN) 26
  • 27. TOOLBOX OF TECHNIQUES cc: tunnelarmr - https://www.flickr.com/photos/27311060@N00
  • 28. QUESTIONS cc: an untrained eye - https://www.flickr.com/photos/26312642@N00
  • 29. (c) Copyright 2019-2021 Replica Analytics Ltd. References • Z. Azizi, C. Zheng, L. Mosquera, L. Pilote, K. El Emam: “Replicating Secondary Studies Using Synthetic Clinical Trial Data”, BMJ Open, 11:e043497, 2021. • K. El Emam, L. Mosquera, E. Jonker, H. Sood: “Evaluating the Utility of Synthetic COVID-19 Case Data”, JAMIA Open, 14(1):ooab012, January 2021. • K. El Emam, L. Mosquera, and C. Zheng, “Optimizing the Synthesis of Clinical Trial Data Using Sequential Trees,” JAMIA, 28(1): 3-13, 2021. • K. El Emam, L. Mosquera, and J. Bass, “Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation,” JMIR, vol. 22, no. 11, Nov. 2020. [Online]. Available: https://www.jmir.org/2020/11/e23139. • K. El Emam, L. Mosquera, and R. Hoptroff, Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. O’Reilly, 2020. • K. El Emam, “Seven Ways to Evaluate the Utility of Synthetic Data,” IEEE Security and Privacy, July/August, 2020. 29

Editor's Notes

  1. Getting started with the generation of synthetic data: There are many ways to generate synthetic data, but high-level synthetic data is generated by fitting a model to real data then sampling from the model to generate synthetic records. So for example, these faces are actually synthetic images that have been generated. For synthetic data the goal is to generate new records that look realistic, but in the context of datasets that include personal information we are also interested in preserving the privacy of individuals in the real data. So synthetic data generating needs to balance preserving privacy with generating high utility data that looks realistic. This is especially true in the context of clinical trial data.
  2. Getting started with the generation of synthetic data: There are many ways to generate synthetic data, but high-level synthetic data is generated by fitting a model to real data then sampling from the model to generate synthetic records. So for example, these faces are actually synthetic images that have been generated. For synthetic data the goal is to generate new records that look realistic, but in the context of datasets that include personal information we are also interested in preserving the privacy of individuals in the real data. So synthetic data generating needs to balance preserving privacy with generating high utility data that looks realistic. This is especially true in the context of clinical trial data.