Files, Tools, and Bioinformatics in the Cloud


   Thomas Keane

   Vertebrate Resequencing Informatics
   WTSI
   thomas.keane@sanger.ac.uk




Vertebrate Resequencing Informatics   17th November, 2009
DATA is the problem!

 NGS means large volumes of raw data
     Previously SRF (~8-10bytes per bp), now BAM (~1.6bytes per bp)
 How much data can a sequencing machine produce?
     20Gbp per lane, 16 lanes per run (1 run = 1.5 weeks) => 11Tbp/year
     Small sequencing center: 4 machines?
     44Tbp per year!
 Raw data in BAM: 70Tbytes                             SV Calling: SVMerge
 Processed calls much smaller
     1000G pilot VCF < 1Gbyte

        Alignment + BAM improvement




Vertebrate Resequencing Informatics   17th November, 2009
Simplistic Model: Cloud as compute resource




                                                                               Processes
                                                                  1. Align


                            SRF/Fastq/BAM
                            (2Mbps/sec)                           Variant calling (n x SNP callers, n indel
                                                                  callers, SV callers)
Sequencing Center/Institute                                   BAM + VCF
                                                              (2Mbps/sec)


                                           BAM                                      3,240 days
                                           VCF                                      to upload!


  Vertebrate Resequencing Informatics   17th November, 2009
Move the raw data generation to the compute




                                                                Variant calling (n x SNP callers, n indel
                                                                callers, SV callers)



Sequencing Center/Institute
                                            VCF
                         BAM
                         VCF

    Vertebrate Resequencing Informatics   17th November, 2009
Large Collaborative Projects: Cloud centric model




                                                            VCF

                                               Analysis groups
Vertebrate Resequencing Informatics   17th November, 2009

Next generation sequencing in cloud computing era

  • 1.
    Files, Tools, andBioinformatics in the Cloud Thomas Keane Vertebrate Resequencing Informatics WTSI thomas.keane@sanger.ac.uk Vertebrate Resequencing Informatics 17th November, 2009
  • 2.
    DATA is theproblem! NGS means large volumes of raw data   Previously SRF (~8-10bytes per bp), now BAM (~1.6bytes per bp) How much data can a sequencing machine produce?   20Gbp per lane, 16 lanes per run (1 run = 1.5 weeks) => 11Tbp/year   Small sequencing center: 4 machines?   44Tbp per year! Raw data in BAM: 70Tbytes SV Calling: SVMerge Processed calls much smaller   1000G pilot VCF < 1Gbyte Alignment + BAM improvement Vertebrate Resequencing Informatics 17th November, 2009
  • 3.
    Simplistic Model: Cloudas compute resource Processes 1. Align SRF/Fastq/BAM (2Mbps/sec) Variant calling (n x SNP callers, n indel callers, SV callers) Sequencing Center/Institute BAM + VCF (2Mbps/sec) BAM 3,240 days VCF to upload! Vertebrate Resequencing Informatics 17th November, 2009
  • 4.
    Move the rawdata generation to the compute Variant calling (n x SNP callers, n indel callers, SV callers) Sequencing Center/Institute VCF BAM VCF Vertebrate Resequencing Informatics 17th November, 2009
  • 5.
    Large Collaborative Projects:Cloud centric model VCF Analysis groups Vertebrate Resequencing Informatics 17th November, 2009