Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Simulacrum, a Synthetic Cancer Dataset


Published on

This presentation describes the applications of synthetic data to cancer registries's efforts to support understanding of and research based on cancer while reducing privacy risks to cancer patients.
The Simulacrum imitates some of the data held securely by the Public Health England’s National Cancer Registration and Analysis Service.
The data in the Simulacrum is entirely artificial. It does not contain data about real patients, so users can never identify a real person. It is free to use and allows anyone who wants to use record-level cancer data to do so, safe in the knowledge that while the data feels like the real thing, there is no danger of breaching patient confidentiality.

Published in: Healthcare
  • Be the first to comment

  • Be the first to like this

The Simulacrum, a Synthetic Cancer Dataset

  1. 1. Supporting precise data analysis without releasing patient records: the Simulacrum in action Cong Chen, Paul Clarke, Lora Frayling, Sally Vernon, Brian Shand, Pesh Doubleday, Jem Rashbass
  2. 2. Overview • Context and goals of this talk • Background: our motivating problem • What is synthetic data and how does it help? • What is the goal of the data exercise? • Building a synthetic data model in the Simulacrum • Results and applications • Conclusion Presentation title - edit in Header and Footer
  3. 3. TalkAims • Introduce and motivate concepts • Synthetic data • The information governance environment • Externally guided analysis • Describe and explain • The Simulacrum as synthetic data – what is it and how was it created? • Synthetic data-guided queries • How this has led to faster, more private answers Presentation title - edit in Header and Footer
  4. 4. Problems with sharing cancer data • Lots of data is available • This would enable researchers and industry to provide valuable insight into disease epidemiology, survival, clinical practice, resource utilisation, outcomes • Highly sensitive • Sharing data is an exercise in risk-reward balancing • Complex and intricate • Data dictionaries do not provide a perfect view of what to expect, analysis can be slow to converge Presentation title - edit in Header and Footer
  5. 5. Synthetic data • Data items which are not created by observations • This includes simulations (e.g. Synthea), partially synthetic data (generalised perturbation) and fully synthetic data • Does not represent individuals • Removes re-identification risk, but attribution risks remain Presentation title - edit in Header and Footer
  6. 6. Simulacrum project aims Users should have direct access to a public resource • Showing data as it looks to internal analysts • Be able to identify their cohort and the cohort size, data completeness and quality, and the codes/ranges used • Be able to prepare and code algorithms against the synthetic data With a prepared analytical plan • Engage PHE with the proposed study • Share code which runs on the real data • Be able to complete analysis without releasing row-level or other sensitive data Take a data-driven approach where possible • Use parameters • To adjust for differently sized or shaped datasets • To adjust to different privacy constraints/requirements Presentation title - edit in Header and Footer
  7. 7. Linked datasets • Data represents the course of patient treatment – we are interested in a coherent story and sensible timeline. • Patients can have multiple tumours, with very many treatment events – we need to capture this.
  8. 8. How did we do it? • Key idea: sample from empirical conditional distributions. • Question: how do we keep from running out of data? • Use low-dimensional distributions. • Question: which variables do we condition on? • Use independence tests to find strongly associated variables.
  9. 9. More details • Question: what do we do for linked tables? • Use all previous data (but in read-only mode). • Question: what about sequences of events? • Use information from the previous event (if it exists) and data in upstream tables – so a Markov model. • Question: what about sampling from small conditional distributions, which risk reflecting real individuals? • Cluster these distributions to meet accepted healthcare data standards.
  10. 10. What models look like (without the data)
  11. 11. The Simulacrum as a dataset • Version 1 – released 2018. 1.5 million tumours (corresponding to English incidences 2013-2015) with tumour/demographic/mortality data and chemotherapy treatment. • Representative at low dimensions (of variable combinations), not as good for complex detail. • Non-disclosive for public release. • Ongoing development.
  12. 12. How does it look? Cumulative age distribution (breast) Blue: Synthetic, Red: Real 0 0.25 0.5 0.75 1 0 50 100 Cumulative age distribution (prostate) 0 0.25 0.5 0.75 1 0 50 100
  13. 13. Applications • Synthetic data used to back up a statistical query gateway (currently manual). • We’ve shared our synthetic data with partners to write queries against – those have turned out to be robust and aware of data formats, categories in our data and run against our data. • Publications accepted for conferences and journal articles. • We then try to release non-disclosive aggregates, model parameters/diagnostics without the personal data used to build those models. Presentation title - edit in Header and Footer
  14. 14. Current work • Better documentation of research and access process for less technical researchers • Model improvement, application in context of other datasets • More test-driven quality measures, automatic simulation with specific goals • Use other synthetic methodology within the data architecture • Fidelity isn’t objective – need to think about suitability for specific purpose
  15. 15. Conclusions • Synthetic data is a game changer for supporting research and reducing risks • This opens understanding of the data and analysis to a wider audience while reducing workload and misunderstandings • Realistic understanding of aims and expectations helps a synthetic data project improve mutual understanding Presentation title - edit in Header and Footer
  16. 16. Acknowledgements • Analyses were based on anonymous aggregate patient data from the National Cancer Registration and Analysis Service. • Thank you to NCRAS and HDI, as well as everyone working on or who has worked on the Simulacrum. • Pick up the data at • is an amazing piece of work carried out by UCL students over 3 months with no reference to the real data. • • Presentation title - edit in Header and Footer