Life sciences big data use cases

  • 1,201 views
Uploaded on

Big data use cases for sequencing and life sciences given at the Big Data Management workshop at Imperial College 27 June 2013

Big data use cases for sequencing and life sciences given at the Big Data Management workshop at Imperial College 27 June 2013

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,201
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
36
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Sequencing the start of most analysis People = Umanaged data Data in wrong place Duplicated Nobody can find anything Inc systems:Backups/security Capacity planning?

Transcript

  • 1. Big data and Life Sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
  • 2. The Sanger Institute Funded by Wellcome Trust. • 2nd largest research charity in the world. • ~700 employees. • Based in Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. • Sequenced 1/3 of the human genome. (largest single contributor). • Large scale sequencing with an impact on human and animal health. Data is freely available. • Websites, ftp, direct database access, programmatic APIs. • Some restrictions for potentially identifiable data. My team: • Scientific computing systems architects.
  • 3. DNA Sequencing TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGCTGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTGATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGGTGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCAAAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCCATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million * 75-108 Base fragments250 Million * 75-108 Base fragments ~1 TByte / day / machine~1 TByte / day / machine Human Genome (3GBases)Human Genome (3GBases)
  • 4. Economic Trends: Cost of sequencing halves every 12 months. • Wrong side of Moore's Law. The Human genome project: • 13 years. • 23 labs. • $500 Million. A Human genome today: • 3 days. • 1 machine. • $8,000. Trend will continue: • $1000 genome is probable within 2 years. • Informatics not included.
  • 5. The scary graph Peak Yearly capillaryPeak Yearly capillary sequencing: 30 Gbasesequencing: 30 Gbase Current weekly sequencing:Current weekly sequencing: 7-10 Tbases7-10 Tbases Data doubling Time: 4-6Data doubling Time: 4-6 months.months.
  • 6. Gen III Sequencers this year?
  • 7. PbytesPbytes!! Sequencing data flow. SequencerSequencerSequencerSequencer Processing/Processing/ QCQC Processing/Processing/ QCQC ComparativeComparative analysisanalysis ComparativeComparative analysisanalysis ArchiveArchiveArchiveArchive Structured dataStructured data (databases)(databases) Unstructured dataUnstructured data (Flat files)(Flat files) InternetInternetInternetInternet AlignmentsAlignments (200GB)(200GB) Variation dataVariation data (1GB)(1GB) FeatureFeature (3MB)(3MB) Raw dataRaw data (10 TB)(10 TB) SequenceSequence (500GB)(500GB)
  • 8. A Sequencing Centre Today CPU • Generic x86_64 cluster. • (16,000 cores) Storage • ~1 TB per day per sequencer. • (15 PB disk) • (Lustre + NFS) Metadata driven data management • Only keep our important files. • Catalogue them, so we can find them! • Keep the number of copies we want, and no more. • (iRODS, in house LIMs). A solved problem; we know how to do this.A solved problem; we know how to do this.
  • 9. This is not big data
  • 10. This is not big data either...
  • 11. Proper Big Data We want to compute across all the data. • Sequencing data (of course). • Patient records, treatment and outcomes. Why? • Cancer: tie in genetics, patient outcomes and treatments. • Pharma: high failure rate due to genetic factors in drug response. • Infectious disease epidemiology. • Rare genetic diseases. Many genetic effects are small • Million member cohorts to get good signal:noise.
  • 12. Translation: Genomics of drug sensitivity in Cancer Pre-treatmentPre-treatment BRAF inhibitorBRAF inhibitor 15 weeks of treatment15 weeks of treatment molecularmolecular diagnosticdiagnostic BRAF mutation positiveBRAF mutation positive ✔✔ 70% response rate vs 10% for standard chemotherapy70% response rate vs 10% for standard chemotherapy BRAF Inhibitors in maligant melanomaBRAF Inhibitors in maligant melanoma Slide from Mathew Garnet (CGP)Slide from Mathew Garnet (CGP)
  • 13. Current Data Archives EBI ERA / NCBI SRA store results of all sequencing experiments. • Public data availability: A good thing (tm) • 1.6 Pbases Problems • Archives are “dark”. • You can put data in, but you can't do anything with it. • In order to analyse the data, you need to download it all. • 100s of Tbytes Situation replicated at local Institute level too. • eg How does CRI get hold of their data currently held at Sanger?
  • 14. The Vision Global Alliance for sharing genomic and clinical data • 70 research institutes & hospitals (including Sanger, Broad, EBI, BGI, Cancer Research UK) Million cancer genome warehouse • (UC Berkeley)
  • 15. Institute AInstitute AInstitute AInstitute A To the Cloud! DataData AnalysisAnalysis pipelinepipeline Institute BInstitute BInstitute BInstitute B DataData AnalysisAnalysis pipelinepipeline DataData DataData DataData AnalysisAnalysis pipelinepipeline AnalysisAnalysis pipelinepipeline AnalysisAnalysis pipelinepipeline DataData
  • 16. How do we get there?
  • 17. Code & Algorithms Bioinformatics code: • Integer not FP heavy. • Single threaded. • Large memory footprints. • Interpreted languages. Not a good fit for future computing architectures. Expensive to run on public clouds. • Memory footprint leads to unused cores. Out of scope for a data talk, but still an important point.
  • 18. Architectural differences Global File systemGlobal File system cpucpucpucpu cpucpucpucpu cpucpucpucpu cpucpucpucpu cpucpucpucpu cpucpucpucpu cpucpucpucpu Object StoreObject Store cpucpucpucpu Fast NetworkFast Network Slow NetworkSlow Network Static nodesStatic nodes dynamic nodesdynamic nodes VSVS
  • 19. Whose Cloud? A VM is just a VM, right? • Clouds are supposed to be programmable. • Nobody wants to re-write a pipeline when they move clouds. Storage: • Posix: • (lustre/GPFS/EMC)? • Object: • Low level: AWS S3, Openstack SWIFT, Ceph/rados • High level: Data management layer (eg iRODS)? Cloud Interoperability? • Do we need is more standards?! Pragmatic approach: • First person to make one that actually works, wins.
  • 20. Moving data Data still has to get from our instruments to the Cloud. Good news: • Lots of products out there for wide area data movement. Bad news: • We are currently using all of them(!) Network bandwidth still a problem. • Research institutes have fast data networks. • What about your GP's surgery? UDT / UDRUDT / UDR rsync / sshrsync / ssh genetorrentgenetorrent
  • 21. Identity Access Unlikely that data archives are going to allow anonymous access. • Who are you? Federated identify providers. • Is everyone signed up to the same federation? • Does it include the right mix of cross-national co- operation? • Does your favourite bit of software support federated IDs? Janet MoonshotJanet Moonshot
  • 22. The LAW Legal • Theory: anonymised data can be stored and accessed without jumping through hoops. • Practice: Risk of re-identification. Becomes easier the more data you have. • Medical records are hard to anonymise and still be useful. Ethical • Medical consent process adds more restrictions above data-protection law. • Limits data use & access even if anonymised. Controlled data access? • No ad-hoc analysis. • Access via restricted API only (“trusted intermediary model”). Policy development ongoing. • Cross juristiction for added fun.
  • 23. Summary We know where we want to get to. • No shortage of Vision There are lots of interesting tools and technologies out there. • Getting them to work coherently together will be a challenge. • Prototyping efforts are underway. • Need to leverage expertese and experience in other fields. Not simply technical issues: • Significant policy issues need to be worked out. • We have to bring the public along.
  • 24. Acknowledgements ISG: • • James Beal • Helen Brimmer • Pete Clapham Global Alliance whitepaper: http://www.sanger.ac.uk/about/press/assets/130605-white-paper.pdf Million Cancer Genome Warehouse whitepaper http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-211.html