Large Scale Resequencing: Approaches and
   Challenges


    Thomas Keane
    Vertebrate Resequencing Informatics group
    Wellcome Trust Sanger Institute
    Hinxton, Cambridge, UK

    thomas.keane@sanger.ac.uk



AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence (2007-2009)
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence to-date
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Vertebrate Resequencing Informatics Group

     Established in 2008 with Jim Stalker
         PIs: Richard Durbin and David Adams
     Initial projects
         1000 Genomes project (http://www.1000genomes.org)
               Data processing, releases, aligner evaluation, sequencing
               Pilot 2008-2009: ~5Tbp (Nature 2011;467)
               Phase 1 2009-2011: ~30Tbp
               Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
         Mouse Genomes Project (http://www.sanger.ac.uk/
           mousegenomes)
               Sequencing 17 laboratory mouse strains
               SNPs, indels, SVs, de novo assembly
               Approx. ~1.2Tbp (Nature 2011;477)


AGBT Tutorial Workshop   15th February, 2012
UK10K

 Investigating the role of rare genetic variants in health and disease
 Whole genome cohorts: 4,000 individuals across two well-established and deeply
 phenotyped UK cohorts with ongoing longitudinal phenotype collection:
     TWINSUK – 2,000
     ALSPAC – 2,000
     6x (18Gbp) per sample

 Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals
    Neurodevelopmental diseases – 3,000
        e.g. schizophrenia, autism spectrum disorders
    Obesity – 2,000
        e.g. severe childhood onset obesity
    Rare diseases – 1,000
        e.g. severe insulin resistance, congenital heart disease, ciliopathies
    5Gbp per sample

 Expect to generate ~100Tbp by end 2012
    ~40Tbp from BGI


AGBT Tutorial Workshop   15th February, 2012
Current Status




                  Recently passed 1000 genomes in terms of total Gbp
AGBT Tutorial Workshop   15th February, 2012
What are the challenges?



 Storage                                             Software/Workflows



                                               NGS


 Compute                                                  Power


AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow


         Sample                              NA34842                 NA87465                 Sample/Platform
         merge

                                                                                                  Merge Up
                                    BAM                   BAM                    BAM
      Library
      merge                                                                                  Library
Freeze


       BAM
                            BAM           BAM          BAM      ……       BAM           BAM

   Improvement
                            BAM                                 ……
   Alignment
                                          BAM          BAM               BAM           BAM
                                                                                                   Import
   (bwa, smalt etc)
                            Fastq         Fastq        Fastq    ……       Fastq     Fastq
                                                                                                       +
                                                                                             Improvement



   AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow

                                        Chr1                   Chr2            Chr3
                     NA19294                                                              …
                     NA18943
                                                                                          …      Merge
                     NA19305              .                        .            .
                         .
                         .
                                          .
                                          .
                                                                   .
                                                                   .
                                                                                .
                                                                                .                across
                     NA19309                                                              …

                 RG:NA19294
                 RG:NA18943
                 RG:NA19305
                                                                                          Cross-sample BAMs

                        SNPs/indels                                                 SVMerge
                samtools                GATK                    Genome STRiP



                              VQSR
                                                                                                  Variant
                              BEAGLE/
                              Impute2
                                                                                                  Calling

                                                       VEP Annotation

                                                       Final VCF 

AGBT Tutorial Workshop           15th February, 2012
Storage Challenges

 Expect ~200Tbp of sequence in 2011-2012
   Working estimate including processing, release, and variant calling
   10bytes per bp

 Storage considerations
   Scalability – can we easily add more storage units?
   Backup and disaster recovery – what do we really need to keep?
   Performance – sufficient I/O throughput to serve compute nodes
   Cost

 Data Formats
   Standardised formats – BAM & VCF 

 Minimise the number of copies
   Aim for two copies at most – original lanes + release (stripped) BAM

AGBT Tutorial Workshop   15th February, 2012
A Tiered Storage Solution


Cost          Size

 2               1                                                              3Gb/sec




                                                                                                  CPU Farm
 1               3                                                                    800Mb/sec




                                                          Off-       Off-
 2               2                                        site       site
       Level 1
           Data: Current release vertical BAMs
           Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
       Level 2
           Data: Lane level BAMs
           Processes: Alignment, recalibration, local realignment
       Level 3
           Data: Previous release BAMs + variant calls backup

     AGBT Tutorial Workshop   15th February, 2012
Data release + archiving: iRODs

 Rule-Oriented Data management systems                                                iRODs
     Open source – origins in particle physics world
     Most important feature of iRODS is the Rule Engine                      nfs02       nfs20
     Akin to source control system
 Customise own application level metadata                          nfs03
                                                                                 nfs01        Off-
     e.g. run, lane, plex, sample, library….                                                 site
 Stores/searches key-value metadata on files:
            List all files from UK10K studies:
                     imeta -z seq qu -d study like 'UK10K_%’!
                          /seq/5363/5363_1.bam!
                          /seq/5363/5363_2.bam (.....and a whole lot more)!
                Get metadata about a file:
                     imeta ls -d /seq/6534/6534_3#7.bam sample!
                          attribute: sample!
                          value: QTL191953!

 Sanger production: BAM files from runs per lane per plex deposited
      BMC Bioinformatics 2011, 12:361

 Recently adopted for UK10K internal data release and archiving
      Users use meta-data queries to find their data
      Files can be part of multiple releases
                                                                              http://www.irods.org

AGBT Tutorial Workshop    15th February, 2012
Compute Pipeline Management: VRPipe

 VRPipe
   Managed and automated execution of sequences of arbitrary
     software against massive datasets across large compute clusters
   Error handling, optimal memory requests, batching of jobs, retrying
     failures, failure reporting, highly extendable, detailed job statistics
 1000 Genomes Phase 2 processed through VRPipe
   Tracked ~1 million jobs
   Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs
   bwa_aln_fastq: ~2443 days total serial wall time
   Mean memory: 941MB/job (max 5637)
 2012                                                                sb10@sanger.ac.uk

   Fully migrate all NGS processes to VRPipe (data processing, SNP/
     indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)
   Management front-ends
   Create distributable VM for cloud rollout
 http://www.github.com/VertebrateResequencing/vr-pipe/wiki

AGBT Tutorial Workshop   15th February, 2012
Even more scale up in 2012 – HiSeq 2500

 Currently takes 1-2 weeks to sequence a human genome
   High depth human genomes in a single day – Illumina HiSeq
     2500
   Caucasian family with a severe T-cell deficiency in affected
     sibling
   Single run on HiSeq 2500 by Illumina per individual

                             PF
                                                      % ≥Q30 Mismatch Mismatch Run time
              Sample        Yield         % Align
                            (Gbp)                      value  R1 (%)   R2 (%)    (hrs)

              Father       117.7               89      92.6     0.4      0.5     25.5
              Mother       125.7               90.2    92.8     0.4      0.5     25.5

              Affected     124.4               90.3    92.4     0.4      0.5     25.5




AGBT Tutorial Workshop   15th February, 2012
What does the data look like?




AGBT Tutorial Workshop   15th February, 2012
Upcoming Changes in 2012

 We cannot keep all of the data
   2007-2008: Keep everything including images from runs
   2009: BAM/Fastq – all of the base quality information
   2010-2011: Stripping original qualities and other unused tags
   2012-: Current formats contain lots of repetition
       Reference based compression
       Reducing quality information e.g. quality binning or quality
       budgets
       Potential formats: CRAM and/or Reduced BAM




AGBT Tutorial Workshop   15th February, 2012
CRAM Format
                                        TGAGCTCTAAGTACC!
                                        329183050298757!


CRAM models for
compression                                                           TGAGCTCTAAGTACC!               TGAGCTCTAAGTACC!
                                                                      002020010022212!               -2---30---9---7!

                                                                            Horizontal                Vertical
                            Do nothing                     Lossless
                                                                                             Quality lossy


        100                                       10                                     1                                            0.1



CRAM current
                                  Untreated             CRAM                       CRAM               CRAM substitutions/insertions
performance                                            lossless                  combination                   model
                                                                                   model


    CRAM v0.6 released 13.2.12:                                        •    Option to preserve all unmapped reads
    •  Pairing information preservation regardless of distance         •    Performance and bug fixes
    •  Revised and improved lossless mode                              •    Arbitrary tags

                                  http://www.ebi.ac.uk/ena/about/cram_toolkit
                                                                                         Source: Ewan Birney/Guy Cochrane, EBI

   AGBT Tutorial Workshop   15th February, 2012
Any questions?




                                                                 Richard Durbin




 URLs
  •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe   David Adams
  •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361
  •  http://www.slideshare.net/thomaskeane

AGBT Tutorial Workshop   15th February, 2012

Large Scale Resequencing: Approaches and Challenges

  • 1.
    Large Scale Resequencing:Approaches and Challenges Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK thomas.keane@sanger.ac.uk AGBT Tutorial Workshop 15th February, 2012
  • 2.
    Sanger total sequence(2007-2009) Gbp AGBT Tutorial Workshop 15th February, 2012
  • 3.
    Sanger total sequenceto-date Gbp AGBT Tutorial Workshop 15th February, 2012
  • 4.
    Vertebrate Resequencing InformaticsGroup  Established in 2008 with Jim Stalker  PIs: Richard Durbin and David Adams  Initial projects  1000 Genomes project (http://www.1000genomes.org)  Data processing, releases, aligner evaluation, sequencing  Pilot 2008-2009: ~5Tbp (Nature 2011;467)  Phase 1 2009-2011: ~30Tbp  Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)  Mouse Genomes Project (http://www.sanger.ac.uk/ mousegenomes)  Sequencing 17 laboratory mouse strains  SNPs, indels, SVs, de novo assembly  Approx. ~1.2Tbp (Nature 2011;477) AGBT Tutorial Workshop 15th February, 2012
  • 5.
    UK10K Investigating therole of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection:   TWINSUK – 2,000   ALSPAC – 2,000   6x (18Gbp) per sample Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals   Neurodevelopmental diseases – 3,000  e.g. schizophrenia, autism spectrum disorders   Obesity – 2,000  e.g. severe childhood onset obesity   Rare diseases – 1,000  e.g. severe insulin resistance, congenital heart disease, ciliopathies   5Gbp per sample Expect to generate ~100Tbp by end 2012   ~40Tbp from BGI AGBT Tutorial Workshop 15th February, 2012
  • 6.
    Current Status Recently passed 1000 genomes in terms of total Gbp AGBT Tutorial Workshop 15th February, 2012
  • 7.
    What are thechallenges? Storage Software/Workflows NGS Compute Power AGBT Tutorial Workshop 15th February, 2012
  • 8.
    Data Production Workflow Sample NA34842 NA87465 Sample/Platform merge Merge Up BAM BAM BAM Library merge Library Freeze BAM BAM BAM BAM …… BAM BAM Improvement BAM …… Alignment BAM BAM BAM BAM Import (bwa, smalt etc) Fastq Fastq Fastq …… Fastq Fastq + Improvement AGBT Tutorial Workshop 15th February, 2012
  • 9.
    Data Production Workflow Chr1 Chr2 Chr3 NA19294 … NA18943 … Merge NA19305 . . . . . . . . . . . across NA19309 … RG:NA19294 RG:NA18943 RG:NA19305 Cross-sample BAMs SNPs/indels SVMerge samtools GATK Genome STRiP VQSR Variant BEAGLE/ Impute2 Calling VEP Annotation Final VCF  AGBT Tutorial Workshop 15th February, 2012
  • 10.
    Storage Challenges Expect~200Tbp of sequence in 2011-2012  Working estimate including processing, release, and variant calling  10bytes per bp Storage considerations  Scalability – can we easily add more storage units?  Backup and disaster recovery – what do we really need to keep?  Performance – sufficient I/O throughput to serve compute nodes  Cost Data Formats  Standardised formats – BAM & VCF  Minimise the number of copies  Aim for two copies at most – original lanes + release (stripped) BAM AGBT Tutorial Workshop 15th February, 2012
  • 11.
    A Tiered StorageSolution Cost Size 2 1 3Gb/sec CPU Farm 1 3 800Mb/sec Off- Off- 2 2 site site Level 1   Data: Current release vertical BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Previous release BAMs + variant calls backup AGBT Tutorial Workshop 15th February, 2012
  • 12.
    Data release +archiving: iRODs Rule-Oriented Data management systems iRODs   Open source – origins in particle physics world   Most important feature of iRODS is the Rule Engine nfs02 nfs20   Akin to source control system Customise own application level metadata nfs03 nfs01 Off-   e.g. run, lane, plex, sample, library…. site Stores/searches key-value metadata on files:   List all files from UK10K studies: imeta -z seq qu -d study like 'UK10K_%’! /seq/5363/5363_1.bam! /seq/5363/5363_2.bam (.....and a whole lot more)!   Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample! attribute: sample! value: QTL191953! Sanger production: BAM files from runs per lane per plex deposited   BMC Bioinformatics 2011, 12:361 Recently adopted for UK10K internal data release and archiving   Users use meta-data queries to find their data   Files can be part of multiple releases http://www.irods.org AGBT Tutorial Workshop 15th February, 2012
  • 13.
    Compute Pipeline Management:VRPipe VRPipe  Managed and automated execution of sequences of arbitrary software against massive datasets across large compute clusters  Error handling, optimal memory requests, batching of jobs, retrying failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe  Tracked ~1 million jobs  Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs  bwa_aln_fastq: ~2443 days total serial wall time  Mean memory: 941MB/job (max 5637) 2012 sb10@sanger.ac.uk  Fully migrate all NGS processes to VRPipe (data processing, SNP/ indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)  Management front-ends  Create distributable VM for cloud rollout http://www.github.com/VertebrateResequencing/vr-pipe/wiki AGBT Tutorial Workshop 15th February, 2012
  • 14.
    Even more scaleup in 2012 – HiSeq 2500 Currently takes 1-2 weeks to sequence a human genome  High depth human genomes in a single day – Illumina HiSeq 2500  Caucasian family with a severe T-cell deficiency in affected sibling  Single run on HiSeq 2500 by Illumina per individual PF % ≥Q30 Mismatch Mismatch Run time Sample Yield % Align (Gbp) value R1 (%) R2 (%) (hrs) Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5 Affected 124.4 90.3 92.4 0.4 0.5 25.5 AGBT Tutorial Workshop 15th February, 2012
  • 15.
    What does thedata look like? AGBT Tutorial Workshop 15th February, 2012
  • 16.
    Upcoming Changes in2012 We cannot keep all of the data  2007-2008: Keep everything including images from runs  2009: BAM/Fastq – all of the base quality information  2010-2011: Stripping original qualities and other unused tags  2012-: Current formats contain lots of repetition  Reference based compression  Reducing quality information e.g. quality binning or quality budgets  Potential formats: CRAM and/or Reduced BAM AGBT Tutorial Workshop 15th February, 2012
  • 17.
    CRAM Format TGAGCTCTAAGTACC! 329183050298757! CRAM models for compression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC! 002020010022212! -2---30---9---7! Horizontal Vertical Do nothing Lossless Quality lossy 100 10 1 0.1 CRAM current Untreated CRAM CRAM CRAM substitutions/insertions performance lossless combination model model CRAM v0.6 released 13.2.12: •  Option to preserve all unmapped reads •  Pairing information preservation regardless of distance •  Performance and bug fixes •  Revised and improved lossless mode •  Arbitrary tags http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI AGBT Tutorial Workshop 15th February, 2012
  • 18.
    Any questions? Richard Durbin URLs •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361 •  http://www.slideshare.net/thomaskeane AGBT Tutorial Workshop 15th February, 2012