Large Scale Resequencing: Approaches and Challenges

Large Scale Resequencing: Approaches and
Challenges

Thomas Keane
Vertebrate Resequencing Informatics group
Wellcome Trust Sanger Institute
Hinxton, Cambridge, UK

thomas.keane@sanger.ac.uk

AGBT Tutorial Workshop 15th February, 2012

Sanger total sequence (2007-2009)
Gbp


Sanger total sequence to-date
Gbp


Vertebrate Resequencing Informatics Group

 Established in 2008 with Jim Stalker
 PIs: Richard Durbin and David Adams
 Initial projects
 1000 Genomes project (http://www.1000genomes.org)
 Data processing, releases, aligner evaluation, sequencing
 Pilot 2008-2009: ~5Tbp (Nature 2011;467)
 Phase 1 2009-2011: ~30Tbp
 Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
 Mouse Genomes Project (http://www.sanger.ac.uk/
mousegenomes)
 Sequencing 17 laboratory mouse strains
 SNPs, indels, SVs, de novo assembly
 Approx. ~1.2Tbp (Nature 2011;477)


UK10K

Investigating the role of rare genetic variants in health and disease
Whole genome cohorts: 4,000 individuals across two well-established and deeply
phenotyped UK cohorts with ongoing longitudinal phenotype collection:
  TWINSUK – 2,000
  ALSPAC – 2,000
  6x (18Gbp) per sample

Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals
  Neurodevelopmental diseases – 3,000
 e.g. schizophrenia, autism spectrum disorders
  Obesity – 2,000
 e.g. severe childhood onset obesity
  Rare diseases – 1,000
 e.g. severe insulin resistance, congenital heart disease, ciliopathies
  5Gbp per sample

Expect to generate ~100Tbp by end 2012
  ~40Tbp from BGI


Current Status

Recently passed 1000 genomes in terms of total Gbp

What are the challenges?

Storage Software/Workflows

NGS

Compute Power


Data Production Workflow

Sample NA34842 NA87465 Sample/Platform
merge

Merge Up
BAM BAM BAM
Library
merge Library
Freeze

BAM
BAM BAM BAM …… BAM BAM

Improvement
BAM ……
Alignment
BAM BAM BAM BAM
Import
(bwa, smalt etc)
Fastq Fastq Fastq …… Fastq Fastq
+
Improvement


Data Production Workflow

Chr1 Chr2 Chr3
NA19294 …
NA18943
… Merge
NA19305 . . .
.
.
.
.
.
.
.
. across
NA19309 …

RG:NA19294
RG:NA18943
RG:NA19305
Cross-sample BAMs

SNPs/indels SVMerge
samtools GATK Genome STRiP

VQSR
Variant
BEAGLE/
Impute2
Calling

VEP Annotation

Final VCF 


Storage Challenges

Expect ~200Tbp of sequence in 2011-2012
 Working estimate including processing, release, and variant calling
 10bytes per bp

Storage considerations
 Scalability – can we easily add more storage units?
 Backup and disaster recovery – what do we really need to keep?
 Performance – sufficient I/O throughput to serve compute nodes
 Cost

Data Formats
 Standardised formats – BAM & VCF 

Minimise the number of copies
 Aim for two copies at most – original lanes + release (stripped) BAM


A Tiered Storage Solution

Cost Size

2 1 3Gb/sec

CPU Farm
1 3 800Mb/sec

Off- Off-
2 2 site site
Level 1
  Data: Current release vertical BAMs
  Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
Level 2
  Data: Lane level BAMs
  Processes: Alignment, recalibration, local realignment
Level 3
  Data: Previous release BAMs + variant calls backup


Data release + archiving: iRODs

Rule-Oriented Data management systems iRODs
  Open source – origins in particle physics world
  Most important feature of iRODS is the Rule Engine nfs02 nfs20
  Akin to source control system
Customise own application level metadata nfs03
nfs01 Off-
  e.g. run, lane, plex, sample, library…. site
Stores/searches key-value metadata on files:
  List all files from UK10K studies:
imeta -z seq qu -d study like 'UK10K_%’!
/seq/5363/5363_1.bam!
/seq/5363/5363_2.bam (.....and a whole lot more)!
  Get metadata about a file:
imeta ls -d /seq/6534/6534_3#7.bam sample!
attribute: sample!
value: QTL191953!

Sanger production: BAM files from runs per lane per plex deposited
  BMC Bioinformatics 2011, 12:361

Recently adopted for UK10K internal data release and archiving
  Users use meta-data queries to find their data
  Files can be part of multiple releases
http://www.irods.org


Compute Pipeline Management: VRPipe

VRPipe
 Managed and automated execution of sequences of arbitrary
software against massive datasets across large compute clusters
 Error handling, optimal memory requests, batching of jobs, retrying
failures, failure reporting, highly extendable, detailed job statistics
1000 Genomes Phase 2 processed through VRPipe
 Tracked ~1 million jobs
 Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs
 bwa_aln_fastq: ~2443 days total serial wall time
 Mean memory: 941MB/job (max 5637)
2012 sb10@sanger.ac.uk

 Fully migrate all NGS processes to VRPipe (data processing, SNP/
indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)
 Management front-ends
 Create distributable VM for cloud rollout
http://www.github.com/VertebrateResequencing/vr-pipe/wiki


Even more scale up in 2012 – HiSeq 2500

Currently takes 1-2 weeks to sequence a human genome
 High depth human genomes in a single day – Illumina HiSeq
2500
 Caucasian family with a severe T-cell deficiency in affected
sibling
 Single run on HiSeq 2500 by Illumina per individual

PF
% ≥Q30 Mismatch Mismatch Run time
Sample Yield % Align
(Gbp) value R1 (%) R2 (%) (hrs)

Father 117.7 89 92.6 0.4 0.5 25.5
Mother 125.7 90.2 92.8 0.4 0.5 25.5

Affected 124.4 90.3 92.4 0.4 0.5 25.5


What does the data look like?


Upcoming Changes in 2012

We cannot keep all of the data
 2007-2008: Keep everything including images from runs
 2009: BAM/Fastq – all of the base quality information
 2010-2011: Stripping original qualities and other unused tags
 2012-: Current formats contain lots of repetition
 Reference based compression
 Reducing quality information e.g. quality binning or quality
budgets
 Potential formats: CRAM and/or Reduced BAM


CRAM Format
TGAGCTCTAAGTACC!
329183050298757!

CRAM models for
compression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC!
002020010022212! -2---30---9---7!

Horizontal Vertical
Do nothing Lossless
Quality lossy

100 10 1 0.1

CRAM current
Untreated CRAM CRAM CRAM substitutions/insertions
performance lossless combination model
model

CRAM v0.6 released 13.2.12: •  Option to preserve all unmapped reads
•  Pairing information preservation regardless of distance •  Performance and bug fixes
•  Revised and improved lossless mode •  Arbitrary tags

http://www.ebi.ac.uk/ena/about/cram_toolkit
Source: Ewan Birney/Guy Cochrane, EBI


Any questions?

Richard Durbin

URLs
•  VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams
•  iRODS@Sanger: BMC Bioinformatics 2011, 12:361
•  http://www.slideshare.net/thomaskeane


Large Scale Resequencing: Approaches and Challenges

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Large Scale Resequencing: Approaches and Challenges

Similar to Large Scale Resequencing: Approaches and Challenges (16)

More from Thomas Keane

More from Thomas Keane (7)

Recently uploaded

Recently uploaded (20)

Large Scale Resequencing: Approaches and Challenges