1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

  1. 1. 1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing Thomas Keane, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK E: tk2@sanger.ac.ukVertebrate Resequencing Informatics 8th December, 2010
  2. 2. 1000G Update Total Number of Base 23,416GB Pairs Aligned Base Pairs 13,527GB Number of Samples 1103 Samples with > 10GB raw 1078 sequence Samples with > 10GB 718 aligned sequence Laura ClarkeVertebrate Resequencing Informatics 8th December, 2010
  3. 3. 1000G update – Raw Sequence Growth25000
 Laura Clarke 10/17/14
 Vertebrate Resequencing Informatics 8th December, 2010
  4. 4. UK10K Large scale population/medical based sequencing project UK10K project recently funded by WT   4,000 cohort samples genome wide @ 6x  Deeply phenotyped TwinsUK and ALSPAC cohorts   6,000 exomes from extreme samples  Protein coding exons from GenCode  Extreme end of traits of medical interest, and from collections of familial cases  Accumulation of rare variants within genes or pathways   Utilise computational methods, data formats and workflows developed during 1000 genomes project   Data release via EGA under access control   Estimating 100Tbp of raw sequence data   http://www.uk10k.orgVertebrate Resequencing Informatics 8th December, 2010
  5. 5. 1000G BAM File Evolutions BAM   Until now BAMs included all raw data   Recently tag removal  OQ: original qualities  Non-standard tags: XM, XG, XO   Also added BAQ differences to indicate non-confidently aligned bases   Space saving of 30%  E.g NA19625: 1.45 vs 0.98 bytes per bp  Primary gain is from removal of original qualities Further proposals   Replace base calls with ‘=‘ sign to indicate agreement with reference   Rejected due to lack of tool supportVertebrate Resequencing Informatics 8th December, 2010
  6. 6. Population/Transposed BAM Traditionally BAM files have been produced per sample with all of the lanes/libraries merged   Lanes -> Library -> Platform -> Sample (1 per individual) Problem: population based SNP calling needs to be aware of the reads across multiple samples at same loci   Problems with opening hundreds/thousands of file handles simultaneously   Distributed/parallel file systems like reading a few large striped files Solution: Transposed BAMs   Genome slices with multiple samples within single BAM  E.g. entire CEU population   Header information to separate read groups into samples  Samtools mpileup, GATK etc support this functionalityVertebrate Resequencing Informatics 8th December, 2010
  7. 7. Horizontal/Transposed BAM Transposed BAMsNA19294 Chr1 Chr2 ……..NA18943 Chr1 Chr2 …….. ……..NA19305 Chr1 Chr2 . …….. . . . . Key questions   Slice size – chromosome? 1Mbp, 10Mbp or 100Mbp?   Size of individual groupings – 10, 50, 100, 500 individuals? Vertebrate Resequencing Informatics 8th December, 2010
  8. 8. VCF Format Fully adopted by 1000G group as interchange format for variant calls   SNPs, indels, and recently SVs   Genotyping calls for all samples   Annotation of variants via user-defined tags   VCF APIs and tools via http://vcftools.sourceforge.net   Scaling issues with VCF – BCF format in development Petr DanecekVertebrate Resequencing Informatics 8th December, 2010
  9. 9. VCF (useful) Bloat Every release of 1000G adds more tags to VCF files   ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">   ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">   ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">   ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">   ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">   ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">   ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">   ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">   ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">   ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">   ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">   ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias"> UK10K propose rich annotation of VCF files   Known SNPs/indels  RS IDs, G1K unancessioned SNPs   Geographical information  Ensembl annotation (coding, exonic, intronic, UTR, splice..)  microRNA, eQTL, known disease loci   Coding consequences  Synonymous/non-synonymous, splice, stop, GERP score   Functional interpretation  Polyphen, Sift, PANTHERVertebrate Resequencing Informatics 8th December, 2010
  10. 10. Storage Challenges Storage   Try to reduce the proportion of raw data we keep (e.g. images, OQ in BAM, remove base calls in BAM etc.)   However there’s still a LOT of data to store and analyse!   Estimation for our group based on ~200Tbp of sequencing data over next 2-3 years  1.5 Pbytes   Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly releases, backup of lane BAMs, Variant calls   Transient: Library BAMs, Local assemblies   Storage type optimality criteria  Cost per Tbyte  Proximity to compute resources  Scalability – room for expansion/future proofing  I/O throughput  Disaster recoveryVertebrate Resequencing Informatics 8th December, 2010
  11. 11. A Tiered Solution 3 tiered storage model Trade off cost, quantity, i/o throughput Similar to caching strategies in computer design   Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g. local assemblies, reference files   Level 1: High-performance, highly parallel, close proximity to compute, expensive, suitable for high i/o tasks   Level 2: Mid-tier storage, some type of nfs technology, discrete units with some local compute, suitable for low i/o tasks that are compute intensive, scalable by adding more discrete units   Level 3: High latency storage, warehouse storage, not suitable to compute against, occasional access e.g. old data releases  (Level 3a: Off-site replication of data in level 3)Vertebrate Resequencing Informatics 8th December, 2010
  12. 12. A Tiered SolutionCost Size 2 1 Level 1: 3Gb/sec High performance CPU Farm 1 2 Level 2: Middle tier/nfs 800Mb/sec Level 3: Backup/warehouse 1 2 Level 3a: Off-site replication Level 1   Data: Current release horizontal + transposed BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Old release BAMs + variant calls backup Vertebrate Resequencing Informatics 8th December, 2010
  13. 13. Compute Challenges Compute   New algorithms continually developed for more accurate variant calling   2010 several new processes added into production pipeline  BAM Improvement   Local realignment around indels to correct mapping biases (e.g. GATK)   Adding BAQ differences up front  Indel calling by local assembly/alternative haplotype analysis (e.g. dindel)  Local reassembly of SV breakpoints   Easy to estimate runtime for known processes (e.g. mapping, recalibration, duplicate removal)  Challenge to estimate runtime for next 2-3 years for new algorithms  E.g. more use of assembly methods – more complex references? I/O has become a significant bottleneck and is most difficult thing to measure   All computations need to minimise I/O  E.g. transforming BAM files to different sort ordersVertebrate Resequencing Informatics 8th December, 2010
  14. 14. Project Data Release Do we need to release BAMs? Large scale human phenotype driven sequencing projects going forward   Participants are more interested in the variants than the raw data BAM files may contain too much data and too large to ship around amongst project members UK10K proposals   Lane BAM files submitted to the archives   Not release BAM files via project ftp   Project data release comprise solely of annotated VCF files   Raw data can be obtained from the archivesVertebrate Resequencing Informatics 8th December, 2010