• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Langmead bosc2010 cloud-genomics
 

Langmead bosc2010 cloud-genomics

on

  • 967 views

 

Statistics

Views

Total Views
967
Views on SlideShare
967
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Langmead bosc2010 cloud-genomics Langmead bosc2010 cloud-genomics Presentation Transcript

    • Cloud-scale genomics: examples and lessons
      • Ben Langmead
      Department of Biostatistics
    • Why?
      • Cost?
      • Elastic supply
      • Not my hardware
      • Our only hope?
      Why not?
      • Cost?
      • Harder to program
      • Less user-friendly
      • Data movement
      • Loosely-coupled only
      • Privacy (e.g. IRB)
      Cloud debate on 1 slide 1.6 Gbp/day 1 5 Gbp/day 1 25 Gbp/day 2 1. http://www.politigenomics.com/next-generation-sequencing-informatics 2. http://www.politigenomics.com/2010/01/hiseq-2000.html Conclusion: let’s try it but hedge our bets
    • Crossbow GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Align Aggregate Reference Call: HET A, G p-value: 0.0023 GTCGCAGTATCTGTCT GTCGCAGTATCTGT NN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TAT A TCGCAGTATCT T TAT A TCGCAGTATCTG N AT A TCGCAGTAT N TG CCCTAT A TCGCAGTAT A CACCCTATGTCGCA A CACCCTAT C TCGCA A CACCCTATGTCGCA GA - CACCCTATGTCGC CCGGA - CACCCTAT A T CCGGA - CACCCTAT A T GCCGGA - CACCCTATG Statistics Parallel by read Handled by Hadoop Parallel by genome bin
    • Myrna Gene 1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATAT GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Align Gene 1 differentially expressed?: YES p-value: 0.0012 TGTCGCAGTATCTGTC AGCACCCTATGTCGCA GCCGGAGCACCCTATG GTCGCAGTA N CTGTCT ||||||||| |||||| GTCGCAGTA T CTGTCT GGATCT G CGATATACC |||||| ||||||||| GGATCT - CGATATACC AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT ATATATATATATATAT |||||||||||||||| ATATATATATATATAT TCTCTCCCA NN AGAGC ||||||||| ||||| TCTCTCCCA GG AGAGC Sample A Sample B Align Aggregate Aggregate Overlap Aggregate Normalize Aggregate Normalize Aggregate Statistics Parallel by read Handled by Hadoop Parallel by genome bin Handled by Hadoop Parallel by sample Handled by Hadoop Parallel by gene
    • Myrna Table 1 . Timing and cost for a Myrna experiment with 1.1 billion 35 bp unpaired reads from the Pickrell et al study as input. Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the Northern Virginia zone and $0.78 in other zones, plus a $0.12 per-node-per-hour surcharge for Elastic MapReduce in all zones. Times can vary subject to, for example, congestion and Internet traffic conditions. Data transfer adds about 1hr:15m, $11 Myrna Runtime, Cost for 1.1 billion reads from Pickrell et al study EC2 Nodes 1 master, 10 workers 1 master, 20 workers 1 master, 40 workers Worker CPU cores 80 160 320 Wall clock time 4h:20m 2h:32m 1h:38m Cluster setup 4m 4m 3m Align 2h:56m 1h:31m 54m Overlap 52m 31m 16m Normalize 6m 7m 6m Statistics 9m 6m 6m Summarize & Postprocess 13m 14m 13m Approximate cost (N. Virginia / Elsewhere) $44.00 / $49.50 $50.40 / $56.70 $65.60 / $73.80
    • Myrna 71% 55%
    • Bet-hedging architecture Cloud driver script Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Wrapper bowtie Wrapper soapsnp Postprocess Hadoop Singleton driver script Wrapper bowtie Wrapper soapsnp Postprocess Perl, fork, sort Hadoop driver script Cloud mode Hadoop mode Single-computer mode
    • Acknowledgements
      • Michael Schatz
      • Jimmy Lin
      • Mihai Pop
      • Steven Salzberg
      • Jeff Leek
      • Kasper Hansen
      • Hector Corrada Bravo
      • Rafael Irizarry
    • Crossbow Data transfer adds about 1hr:15m, $28
    • Crossbow 43% 58%