Large Scale Resequencing: Approaches and Challenges

1,266 views

Published on

Published in: Technology, Health & Medicine
  • Be the first to comment

Large Scale Resequencing: Approaches and Challenges

  1. 1. Large Scale Resequencing: Approaches and Challenges Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK thomas.keane@sanger.ac.ukAGBT Tutorial Workshop 15th February, 2012
  2. 2. Sanger total sequence (2007-2009)Gbp AGBT Tutorial Workshop 15th February, 2012
  3. 3. Sanger total sequence to-dateGbp AGBT Tutorial Workshop 15th February, 2012
  4. 4. Vertebrate Resequencing Informatics Group  Established in 2008 with Jim Stalker  PIs: Richard Durbin and David Adams  Initial projects  1000 Genomes project (http://www.1000genomes.org)  Data processing, releases, aligner evaluation, sequencing  Pilot 2008-2009: ~5Tbp (Nature 2011;467)  Phase 1 2009-2011: ~30Tbp  Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)  Mouse Genomes Project (http://www.sanger.ac.uk/ mousegenomes)  Sequencing 17 laboratory mouse strains  SNPs, indels, SVs, de novo assembly  Approx. ~1.2Tbp (Nature 2011;477)AGBT Tutorial Workshop 15th February, 2012
  5. 5. UK10K Investigating the role of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection:   TWINSUK – 2,000   ALSPAC – 2,000   6x (18Gbp) per sample Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals   Neurodevelopmental diseases – 3,000  e.g. schizophrenia, autism spectrum disorders   Obesity – 2,000  e.g. severe childhood onset obesity   Rare diseases – 1,000  e.g. severe insulin resistance, congenital heart disease, ciliopathies   5Gbp per sample Expect to generate ~100Tbp by end 2012   ~40Tbp from BGIAGBT Tutorial Workshop 15th February, 2012
  6. 6. Current Status Recently passed 1000 genomes in terms of total GbpAGBT Tutorial Workshop 15th February, 2012
  7. 7. What are the challenges? Storage Software/Workflows NGS Compute PowerAGBT Tutorial Workshop 15th February, 2012
  8. 8. Data Production Workflow Sample NA34842 NA87465 Sample/Platform merge Merge Up BAM BAM BAM Library merge LibraryFreeze BAM BAM BAM BAM …… BAM BAM Improvement BAM …… Alignment BAM BAM BAM BAM Import (bwa, smalt etc) Fastq Fastq Fastq …… Fastq Fastq + Improvement AGBT Tutorial Workshop 15th February, 2012
  9. 9. Data Production Workflow Chr1 Chr2 Chr3 NA19294 … NA18943 … Merge NA19305 . . . . . . . . . . . across NA19309 … RG:NA19294 RG:NA18943 RG:NA19305 Cross-sample BAMs SNPs/indels SVMerge samtools GATK Genome STRiP VQSR Variant BEAGLE/ Impute2 Calling VEP Annotation Final VCF AGBT Tutorial Workshop 15th February, 2012
  10. 10. Storage Challenges Expect ~200Tbp of sequence in 2011-2012  Working estimate including processing, release, and variant calling  10bytes per bp Storage considerations  Scalability – can we easily add more storage units?  Backup and disaster recovery – what do we really need to keep?  Performance – sufficient I/O throughput to serve compute nodes  Cost Data Formats  Standardised formats – BAM & VCF  Minimise the number of copies  Aim for two copies at most – original lanes + release (stripped) BAMAGBT Tutorial Workshop 15th February, 2012
  11. 11. A Tiered Storage SolutionCost Size 2 1 3Gb/sec CPU Farm 1 3 800Mb/sec Off- Off- 2 2 site site Level 1   Data: Current release vertical BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Previous release BAMs + variant calls backup AGBT Tutorial Workshop 15th February, 2012
  12. 12. Data release + archiving: iRODs Rule-Oriented Data management systems iRODs   Open source – origins in particle physics world   Most important feature of iRODS is the Rule Engine nfs02 nfs20   Akin to source control system Customise own application level metadata nfs03 nfs01 Off-   e.g. run, lane, plex, sample, library…. site Stores/searches key-value metadata on files:   List all files from UK10K studies: imeta -z seq qu -d study like UK10K_%’! /seq/5363/5363_1.bam! /seq/5363/5363_2.bam (.....and a whole lot more)!   Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample! attribute: sample! value: QTL191953! Sanger production: BAM files from runs per lane per plex deposited   BMC Bioinformatics 2011, 12:361 Recently adopted for UK10K internal data release and archiving   Users use meta-data queries to find their data   Files can be part of multiple releases http://www.irods.orgAGBT Tutorial Workshop 15th February, 2012
  13. 13. Compute Pipeline Management: VRPipe VRPipe  Managed and automated execution of sequences of arbitrary software against massive datasets across large compute clusters  Error handling, optimal memory requests, batching of jobs, retrying failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe  Tracked ~1 million jobs  Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs  bwa_aln_fastq: ~2443 days total serial wall time  Mean memory: 941MB/job (max 5637) 2012 sb10@sanger.ac.uk  Fully migrate all NGS processes to VRPipe (data processing, SNP/ indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)  Management front-ends  Create distributable VM for cloud rollout http://www.github.com/VertebrateResequencing/vr-pipe/wikiAGBT Tutorial Workshop 15th February, 2012
  14. 14. Even more scale up in 2012 – HiSeq 2500 Currently takes 1-2 weeks to sequence a human genome  High depth human genomes in a single day – Illumina HiSeq 2500  Caucasian family with a severe T-cell deficiency in affected sibling  Single run on HiSeq 2500 by Illumina per individual PF % ≥Q30 Mismatch Mismatch Run time Sample Yield % Align (Gbp) value R1 (%) R2 (%) (hrs) Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5 Affected 124.4 90.3 92.4 0.4 0.5 25.5AGBT Tutorial Workshop 15th February, 2012
  15. 15. What does the data look like?AGBT Tutorial Workshop 15th February, 2012
  16. 16. Upcoming Changes in 2012 We cannot keep all of the data  2007-2008: Keep everything including images from runs  2009: BAM/Fastq – all of the base quality information  2010-2011: Stripping original qualities and other unused tags  2012-: Current formats contain lots of repetition  Reference based compression  Reducing quality information e.g. quality binning or quality budgets  Potential formats: CRAM and/or Reduced BAMAGBT Tutorial Workshop 15th February, 2012
  17. 17. CRAM Format TGAGCTCTAAGTACC! 329183050298757!CRAM models forcompression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC! 002020010022212! -2---30---9---7! Horizontal Vertical Do nothing Lossless Quality lossy 100 10 1 0.1CRAM current Untreated CRAM CRAM CRAM substitutions/insertionsperformance lossless combination model model CRAM v0.6 released 13.2.12: •  Option to preserve all unmapped reads •  Pairing information preservation regardless of distance •  Performance and bug fixes •  Revised and improved lossless mode •  Arbitrary tags http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI AGBT Tutorial Workshop 15th February, 2012
  18. 18. Any questions? Richard Durbin URLs •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361 •  http://www.slideshare.net/thomaskeaneAGBT Tutorial Workshop 15th February, 2012

×