Your SlideShare is downloading. ×
Langmead bosc2010 cloud-genomics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Langmead bosc2010 cloud-genomics

688
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
688
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cloud-scale genomics: examples and lessons
    • Ben Langmead
    Department of Biostatistics
  • 2. Why?
    • Cost?
    • Elastic supply
    • Not my hardware
    • Our only hope?
    Why not?
    • Cost?
    • Harder to program
    • Less user-friendly
    • Data movement
    • Loosely-coupled only
    • Privacy (e.g. IRB)
    Cloud debate on 1 slide 1.6 Gbp/day 1 5 Gbp/day 1 25 Gbp/day 2 1. http://www.politigenomics.com/next-generation-sequencing-informatics 2. http://www.politigenomics.com/2010/01/hiseq-2000.html Conclusion: let’s try it but hedge our bets
  • 3. Crossbow GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Align Aggregate Reference Call: HET A, G p-value: 0.0023 GTCGCAGTATCTGTCT GTCGCAGTATCTGT NN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TAT A TCGCAGTATCT T TAT A TCGCAGTATCTG N AT A TCGCAGTAT N TG CCCTAT A TCGCAGTAT A CACCCTATGTCGCA A CACCCTAT C TCGCA A CACCCTATGTCGCA GA - CACCCTATGTCGC CCGGA - CACCCTAT A T CCGGA - CACCCTAT A T GCCGGA - CACCCTATG Statistics Parallel by read Handled by Hadoop Parallel by genome bin
  • 4. Myrna Gene 1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATAT GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Align Gene 1 differentially expressed?: YES p-value: 0.0012 TGTCGCAGTATCTGTC AGCACCCTATGTCGCA GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Sample A Sample B Align Aggregate Aggregate Overlap Aggregate Normalize Aggregate Normalize Aggregate Statistics Parallel by read Handled by Hadoop Parallel by genome bin Handled by Hadoop Parallel by sample Handled by Hadoop Parallel by gene
  • 5. Myrna Table 1 . Timing and cost for a Myrna experiment with 1.1 billion 35 bp unpaired reads from the Pickrell et al study as input. Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the Northern Virginia zone and $0.78 in other zones, plus a $0.12 per-node-per-hour surcharge for Elastic MapReduce in all zones. Times can vary subject to, for example, congestion and Internet traffic conditions. Data transfer adds about 1hr:15m, $11 Myrna Runtime, Cost for 1.1 billion reads from Pickrell et al study EC2 Nodes 1 master, 10 workers 1 master, 20 workers 1 master, 40 workers Worker CPU cores 80 160 320 Wall clock time 4h:20m 2h:32m 1h:38m Cluster setup 4m 4m 3m Align 2h:56m 1h:31m 54m Overlap 52m 31m 16m Normalize 6m 7m 6m Statistics 9m 6m 6m Summarize & Postprocess 13m 14m 13m Approximate cost (N. Virginia / Elsewhere) $44.00 / $49.50 $50.40 / $56.70 $65.60 / $73.80
  • 6. Myrna 71% 55%
  • 7. Bet-hedging architecture Cloud driver script Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Singleton driver script Wrapper bowtie Wrapper soapsnp Postprocess Perl, fork, sort Hadoop driver script Cloud mode Hadoop mode Single-computer mode
  • 8. Acknowledgements
    • Michael Schatz
    • Jimmy Lin
    • Mihai Pop
    • Steven Salzberg
    • Jeff Leek
    • Kasper Hansen
    • Hector Corrada Bravo
    • Rafael Irizarry
  • 9. Crossbow Data transfer adds about 1hr:15m, $28
  • 10. Crossbow 43% 58%