Future Architectures for genomics

  • 848 views
Uploaded on

This talk details some of the work we are doing looking at new CPU and storage architectures to support sequencing and bioinformatics.

This talk details some of the work we are doing looking at new CPU and storage architectures to support sequencing and bioinformatics.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
848
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
31
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Sequencing the start of most analysis
    People = Umanaged data
    Data in wrong place
    Duplicated
    Nobody can find anything
    Inc systems:Backups/security
    Capacity planning?
  • Sequencing the start of most analysis
    People = Umanaged data
    Data in wrong place
    Duplicated
    Nobody can find anything
    Inc systems:Backups/security
    Capacity planning?

Transcript

  • 1. Future Architectures for genomics Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
  • 2. The Sanger Institute Funded by Wellcome Trust. • 2nd largest research charity in the world. • ~700 employees. • Based in Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. • Sequenced 1/3 of the human genome. • (largest single contributor). Large scale sequencing with an impact on human and animal health. Data is freely available. • Websites, ftp, direct database access, programmatic APIs. • Some restrictions for potentially identifiable data. My team: • Scientific computing systems architects.
  • 3. The not quite so scary graph Peak Yearly capillary sequencing: 30 Gbase Current weekly sequencing: 7-10 Tbases. Not increasing as aggressively as the historical trend.
  • 4. New CPU Architectures
  • 5. MRSA Outbreak MRSA: • Antibiotic resistant bacteria. • Carried by many people with no symptoms. • But can cause serious diseases (boils, sepsis, necrotising fasciitis). • Bad news in hospitals (14 patients presented absesses during the outbreak) Infection control via traditional methods • Compare antibiotic resistence profiles of samples. • Epidemiology detective work to find common factors. • Close wards for deep cleans. 3 different outbreaks of MRSA over 6 months
  • 6. Sequencing Advantages Sequence samples from infections: • Compare sequence and build phylogenetic trees. F Identified extra cases, missed by traditional screening. • Traced infection to a staff member (asymptomatic). P25 H H H H(2) H(2) H H H H P15 H H H P24 H(5) + P8 H “Real Time” information: • culture → sequence → informatics within 48 hrs. P22 P9 P21 P17 P2, 3 P11, 12 P5 Loss ofermC plasmid P14, 23 P1 P6 P13 P7 P10 P20 P4, 16, and 19 P26 Aids, not replaces, current infection control protocols.
  • 7. Informatics in a hospital setting? Can we produce lab friendly informatics as well as sequencing? • Low power, low footprint compute. Informatics was run in our datacentre (20,000 cores, 20 PB storage) • Can we do it on something you can plug into a 13AMP socket in the lab?
  • 8. ARM for Bioinformatics Not as stupid as it first seems: Bioinformatics code: • Single threaded, integer, not floating point dominated. Bacteria have small genomes: • 3 Mbases (MRSA) vs 3 Gbases (Human) • May not need 64bit memory address space for some problems.
  • 9. Compute Farm Memory footprint Some jobs do have large memory footprints: • But large number of jobs below 1.5GB.
  • 10. Intel vs ARM ARM: • Calxeda Energy Core • 4 core 1.0 GHz ARM A9, 4GB RAM. Intel: • IBM HS22 • 6 core Xeon X5650 (Westmere) 2.67 GHz, 36 GB RAM • (Not the latest or greatest, but what we had available) OS: Ubuntu 12.04 LTS • Kernel 2.6.32 on Intel, 3.6 on ARM. • gcc 4.6.3 Porting code: • Surprisingly easy. • R / Python / Perl: Interpreted languages worked in our favour! • C “just worked”.
  • 11. Bioinformatics Codes Standard set of bioinformatics code: • ARM ~5x slower than the Intel. • Right ballpark to be more 8 7 efficient on performance/watt. 6 Speedup 5 ARM Intel 4 3 2 1 0 exonerate blastn blastp tblastn Some exceptions: • Exonerate 2.2: 250x slower. • Pointer chasing? Hmmer v3: uses sse2 on Intel. •
  • 12. SNP calling Pipeline MAP MAP (BWA) (BWA) Looks for single base changes in DNA sequence • C, java with perl glue. Data sizes: • 2 x 40 Mbyte sequence file files. • 3 Mbyte reference Sort Sort (samtools) (samtools) Intel is 4.2 speed of the ARM • 310 vs 1326 seconds. SNP call SNP call (mpileup) (mpileup) 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 0.5 0 ARM Intel
  • 13. Further Investigations Scaling tests: • ARM A9 showed poorer scaling than Intel as we loaded the cores up. • Kernel related? Explore compiler / JVM opts: • Built with the package defaults (typically gcc / -O2). • gcc vs icc etc. Lots of change in the ecosystem: ARM A15 vs ARM 64 vs Atom etc etc..
  • 14. Conclusions Initial tests look promising. • Code runs. Is ARM cost effective? • Numbers look to be in the right ballpark wrt performance / Watt. • going from datasheet power numbers. We don't know yet on price/performance. • We need to see real production systems so we can get real power / $ numbers.
  • 15. Object Storage
  • 16. Storage for Sequencing Sequencing produces lots of data. • Lots of data + lots of people quickly becomes un-manageble. We use iRODS to manage data. • Stores data + *metadata* in a storage agnostic system.
  • 17. Sequencing data flow. Sequencer Sequencer Processing/ Processing/ QC QC analysis analysis datastore datastore Structured data (databases) Unstructured (Flat files) Raw data (10 TB) Internet Internet Sequence (500GB) Alignments (200GB) Variation data (1GB) Feature (3MB)
  • 18. Sequencing data flow. Sequencer Sequencer Processing/ Processing/ QC QC analysis analysis Internet Internet Structured Unmanaged data (databases) Pbytes! Unstructured (Flat files) Raw data (10 TB) datastore datastore Sequence (500GB) Alignments (200GB) Variation data (1GB) Feature (3MB)
  • 19. iRODS Data Management User interface User interface WebDAV, icommands,fuse WebDAV, icommands,fuse Irods Server Irods Server Data in S3 Data in S3 ICAT ICAT Catalogue Catalogue database database Rule Engine Rule Engine Irods Server Irods Server Data in database Data in database Implements policies Implements policies Irods Server Irods Server Data on disk Data on disk
  • 20. Sanger Implementation Storage • 5PB Storage (2 x 2.5 PB). • Data stored on standard posix filesystems at the backend. • Mixture of vendors and filesystem sizes. • 40TB → 200TB chunks. Database: • Oracle 10g RAC. Replicated: • One copy in two sections of our datacentre. • (probably move 1 off site this year) Federated: • Split system to isolate research teams from one another. • Still single namespace. User interaction: • Via CLI tools (think command line ftp) or via C API. • Archival system, separate from our HPC lustre systems.
  • 21. Irods Filesystem Issues We have lots of filesystems behind irods. • 60 filesystems. Filesystems need TLC. • They get full. • Sometimes they go wrong. • Can you fsck a 200TB filesystem? Is there an alternative storage backend that is simpler or cheaper?
  • 22. Object Stores System for storing objects (files) • “put” and “get” semantics Not POSIX: • Fewer features, but simpler to implement; should be more scalable and • robust. No directory structure; just a set of object Ids. • You need to implement your own organisational schema on top of the object store. Lots of alternatives • Commercial, open source, hardware or software based. • Different approaches to data integrity / DR. • Replication vs erasure coding. No standard APIs. • Amazon S3 defacto API. • Lowest common denominator: • (S3 currently does not support seek operations, which is important if we are dealing with large structured files.)
  • 23. Object store and iRODS Object stores are interesting, but very different from our POSIX world. Conceptually object is a good fit for iRODS. Transparency: • iRODS is storage back-end agnostic. • Putting object store behind irods makes it transparent to the end user. Provides a good organisational schema. • Searchable metadata. (Potentially) simplifies storage administration.
  • 24. Questions we need to answer How does iRODS replication and object store replication interact? • iRODS knows how to replicate objects. • Most (all?) object stores have replication / erasure coding mechaisms. • What is the right level to do the replication at? How are seek operations handled? • We can currently pull records out of BAM files without having to download • the entire file. Will this still work on object store. Data locality • Important in multi-site / federated irods installations. • If I get an object from irods, I'd like to talk to the storage elements • “nearest” to me on the network Many ways to potentially tackle this: • Loadbalancers, proxies and other network tricks. • Make irods aware of the object store topology. • Not clear what the best mechanism will be.
  • 25. Acknowledgements My Team: • Pete Clapham • • (ARM & iRODS) James Beal Helen Brimmer • Karl Freund • Calxeda Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study Lancet Infect Dis. 2013 February; 13(2): 130–136. doi: 10.1016/S1473-3099(12)70268-2