Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Datasets and Highly Sensitive Data


Published on

Dr Bennet McComish, Menzies Institute of Medical Research, University of Tasmania, presented at the Research Integrity Advisors Data Management Workshop in Hobart 2017

Published in: Education
  • Be the first to comment

  • Be the first to like this

Big Datasets and Highly Sensitive Data

  1. 1. Big Datasets and Highly Sensitive Data Bennet McComish 31 July 2017
  2. 2. Computational Genomics Study of the structure, function, evolution, and mapping of genomes Genes control our basic biology, how the body works, how we respond to drugs Changes in your genome make you who you are They can also cause disease (such as cancer) or mean your cancer therapy doesn’t work (or works really well) We study those changes to understand and improve your health 2/14
  3. 3. What is the human genome? The genome is basically a string of letters (A T C G) 1 human genome = 3.2 billion letters or ‘bases’ spread across 23 chromosomes 3% of the genome (3 million bases) ‘coding’ for ~25,000 genes Print version of one genome at the “Wellcome Collection” 120 books, 1000 pages each at 4.5 point text 3/14
  4. 4. Genome sequencing Technology now allows us to read the code of our genomes We have a human ‘reference’ genome – made of the most common (3.2 billion base) sequence We compare a person’s genome with the reference to find all the ‘different’ sites (~3 million per person or 0.1%) Then only focus on the places where there are differences 4/14
  5. 5. Genome variation 5/14
  6. 6. Approaching the "$1000 genome" Exponential increase in the number of genomes being sequenced Bottleneck has moved from data generation to data analysis Cost of sequencing 6/14
  7. 7. "Big data" Hiseq 200G run Image data 32 TB discarded Intensity data 2 TB usually discarded Raw sequence and quality score data 250 GB backed up Aligned sequence 100 GB aligned to ref. genome Variation data 1-10 GB used in most analysis Filtered variants of interest 50-500 MB depends on study 7/14
  8. 8. One study: 254 samples from 5 large families Don't try to drink from the fire hydrant! Use smart study design Filter the data: Data overload? changes that alter proteins changes that run in families… · · 8/14
  9. 9. Pipelines Use fast parallelised analysis pipelines where possible Even parallelised pipeline takes several weeks to align 30 samples and call variants Makes it difficult to use standard HPC queuing systems 9/14
  10. 10. Menzies Computational Genomics Cluster Sunnydale 4 compute nodes 250 CPUs 2 TB RAM 214 TB working data 200 TB secure archive storage · · · · · 10/14
  11. 11. Data storage requirements Australian code for the responsible conduct of research requires us to keep research data and primary materials All raw sequence data and final filtered data must be kept Can discard some intermediate files, but need a large amount of fast working storage Data generation is now much cheaper and faster than data analysis Data storage, transfer and analysis now critical 11/14
  12. 12. Indigenous genomes High incidence of vulvar cancer in East Arnhem indigenous population Ten years' work securing appropriate consent Consent strictly limited to vulvar cancer study - indigenous communities often wary of genetic research Risk management - public perception and trust is often biggest risk identified - far worse than losing data 12/14
  13. 13. Family studies We infer family relationships from genetic data These sometimes differ from those reported by the families We can also infer information about family members not involved in the study Full pedigrees can't always be published or shared 13/14
  14. 14. Genomes technically identifiable Privacy Act 1988 - information is "personal" if identity "can reasonably be ascertained" from it Identifying someone from their genome sequence is feasible and getting easier Gymrek et al. (2013) Science 339:321 Shared/cloud resources more challenging to use in terms of data privacy 14/14