2014 05-27 - Opinion: Computing for genomics sucks.

5,834 views
6,745 views

Published on

Some thoughts on 1. why the genomics bioinformaticians need hardware that differs from what traditional HPC providers provide 2. why its challenging to get it.

With input from @bmpvieira, @yeban, @gawbul .

Video: https://www.youtube.com/watch?v=mmMQw2gIozI

1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total views
5,834
On SlideShare
0
From Embeds
0
Number of Embeds
4,325
Actions
Shares
0
Downloads
16
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

2014 05-27 - Opinion: Computing for genomics sucks.

  1. 1. Why computing for genomics research sucks. y.wurm@qmul.ac.uk BaltiBio 2014-05-27
  2. 2. Example GenomicsTasks Repetitiveness “Disk” ! Input/Output Memory Duration per task Build 10,000 trees 10,000x low low short Trim FASTQ files 40-400x high low short One de novo genome assembly 1 high high long Many de novo genome assemblies 20-1000x high high long Determine which of 10 new tools that promise X can actually do X (once). ! “genome hacking” 1 depends depends depends
  3. 3. Traditional High Performance Computing (HPC) • Physics? Astronomy? Maths? Chemistry? • Traditional HPC infrastructures are great at small tasks: Repetitiveness “Disk” ! Input/Output Memory Duration per task Build 10,000 trees 10,000x low low short • And/or have mechanisms/tools that transform their challenges into many small tasks.
  4. 4. “We have 9999 cores!” - central IT admin but they are inadequate
  5. 5. Big Ass Servers • e.g.: 1.5TB ram; 48 cores - SSH into it and do whatever you want. Repetitiveness “Disk” ! Input/Output Memory Duration per task Build 10,000 trees 10,000x low low short Trim FASTQ files 40-400x high low short One de novo genome assembly 1 high high long Many de novo genome assemblies 20-1000x high high long Determine which of 10 new tools that promise X can actually do X 1 depends depends depends Jeremy Leipzig
  6. 6. Additional challenges for biologists • Datasets continue growing fast! • Generally: • We lack computational training. • Bioinformatics tools suck (badly written, badly tested, hard to install).
  7. 7. So what do we need? • access to machines of all shapes and sizes • big and small machines • direct access via ssh (for hacking & doing things few times) • indirect access via queue (for doing things many times) • fast I/O - cheap archival. • single login: all files “feel” like they’re in one place
  8. 8. Swiss Institute of Bioinformatics:Vital-IT
  9. 9. So what do we need? • access to machines of all shapes and sizes • big and small machines • direct access via ssh (for hacking & doing things few times) • indirect access via queue (for doing things many times) • fast I/O - cheap archival. • single login; all files “feel” like they’re in one place • easily changeable software & OS versions
  10. 10. Easily changeable OS & software versions https://www.docker.io >docker-switch bio-linux7 # do stuff >docker-switch pacbio-assembly-vm # do other stuff >docker-switch antlab-ubuntu # do more stuff @bmpvieira
  11. 11. Easily changeable OS & software versions https://www.docker.io >docker-switch bio-linux7 # do stuff >docker-switch pacbio-assembly-vm # do other stuff >docker-switch antlab-ubuntu # do more stuff FAKE @bmpvieira
  12. 12. What if Apple/Google made an idiot-proof cloud computing system for genomics?
  13. 13. What if Apple/Google made an idiot-proof cloud computing system for genomics? • Always on - single place to connect to: ssh mylab.awskiller.co.uk • Dropbox-like shared directories & file checksumming. • Easily switchable OS version / “VM”. • Automagically & transparently migrates: • from small to huge machines (and back) as CPU and RAM demands change.
  14. 14. What if Apple/Google made an idiot-proof cloud computing system for genomics? • Always on - single place to connect to: ssh mylab.awskiller.co.uk • Dropbox-like shared directories & file checksumming. • Easily switchable OS version / “VM”. • Automagically & transparently migrates: • from small to huge machines (and back) as CPU and RAM demands change. • from one physical site (huge dataset) to another
  15. 15. Summary • Broad range of needs:! • some similar to traditional HPC.! • some very different!! • Users are naive.! • Tools are experimental.! • Datasets are experimental.! • IT people have difficulty understanding this. • Do not trust them when they say things will just work! ! • A lot of potential to make things not suck.
  16. 16. Evolutionary Genetics group & Queen Mary U London Bruno Vieira - @bmpvieira Steve Moss - @gawbul Anurag Priyam - @yeban Richard Christie & ITS Research Support team @ Queen Mary U London Ioannis Xenarios & Vital-IT team @ Swiss Institute of Bioinformatics http://yannick.poulet.orgy.wurm@qmul.ac.uk

×