Your SlideShare is downloading. ×
0
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

1,751

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,751
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
66
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing Thomas Keane, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK E: tk2@sanger.ac.ukVertebrate Resequencing Informatics 8th December, 2010
  • 2. 1000G Update Total Number of Base 23,416GB Pairs Aligned Base Pairs 13,527GB Number of Samples 1103 Samples with > 10GB raw 1078 sequence Samples with > 10GB 718 aligned sequence Laura ClarkeVertebrate Resequencing Informatics 8th December, 2010
  • 3. 1000G update – Raw Sequence Growth25000
20000
 CEU
 YRI
 JPT
 TSI
15000
 CHB
 ASW
 LWK
 MXL
10000
 GBR
 CHS
 FIN
 PUR
 5000
 CLM
 IBS
 0
 12/17/13
 1/17/14
 2/17/14
 3/17/14
 4/17/14
 5/17/14
 6/17/14
 7/17/14
 8/17/14
 9/17/14
 Laura Clarke 10/17/14
 Vertebrate Resequencing Informatics 8th December, 2010
  • 4. UK10K Large scale population/medical based sequencing project UK10K project recently funded by WT   4,000 cohort samples genome wide @ 6x  Deeply phenotyped TwinsUK and ALSPAC cohorts   6,000 exomes from extreme samples  Protein coding exons from GenCode  Extreme end of traits of medical interest, and from collections of familial cases  Accumulation of rare variants within genes or pathways   Utilise computational methods, data formats and workflows developed during 1000 genomes project   Data release via EGA under access control   Estimating 100Tbp of raw sequence data   http://www.uk10k.orgVertebrate Resequencing Informatics 8th December, 2010
  • 5. 1000G BAM File Evolutions BAM   Until now BAMs included all raw data   Recently tag removal  OQ: original qualities  Non-standard tags: XM, XG, XO   Also added BAQ differences to indicate non-confidently aligned bases   Space saving of 30%  E.g NA19625: 1.45 vs 0.98 bytes per bp  Primary gain is from removal of original qualities Further proposals   Replace base calls with ‘=‘ sign to indicate agreement with reference   Rejected due to lack of tool supportVertebrate Resequencing Informatics 8th December, 2010
  • 6. Population/Transposed BAM Traditionally BAM files have been produced per sample with all of the lanes/libraries merged   Lanes -> Library -> Platform -> Sample (1 per individual) Problem: population based SNP calling needs to be aware of the reads across multiple samples at same loci   Problems with opening hundreds/thousands of file handles simultaneously   Distributed/parallel file systems like reading a few large striped files Solution: Transposed BAMs   Genome slices with multiple samples within single BAM  E.g. entire CEU population   Header information to separate read groups into samples  Samtools mpileup, GATK etc support this functionalityVertebrate Resequencing Informatics 8th December, 2010
  • 7. Horizontal/Transposed BAM Transposed BAMsNA19294 Chr1 Chr2 ……..NA18943 Chr1 Chr2 …….. ……..NA19305 Chr1 Chr2 . …….. . . . . Key questions   Slice size – chromosome? 1Mbp, 10Mbp or 100Mbp?   Size of individual groupings – 10, 50, 100, 500 individuals? Vertebrate Resequencing Informatics 8th December, 2010
  • 8. VCF Format Fully adopted by 1000G group as interchange format for variant calls   SNPs, indels, and recently SVs   Genotyping calls for all samples   Annotation of variants via user-defined tags   VCF APIs and tools via http://vcftools.sourceforge.net   Scaling issues with VCF – BCF format in development Petr DanecekVertebrate Resequencing Informatics 8th December, 2010
  • 9. VCF (useful) Bloat Every release of 1000G adds more tags to VCF files   ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">   ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">   ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">   ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">   ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">   ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">   ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">   ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">   ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">   ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">   ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">   ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias"> UK10K propose rich annotation of VCF files   Known SNPs/indels  RS IDs, G1K unancessioned SNPs   Geographical information  Ensembl annotation (coding, exonic, intronic, UTR, splice..)  microRNA, eQTL, known disease loci   Coding consequences  Synonymous/non-synonymous, splice, stop, GERP score   Functional interpretation  Polyphen, Sift, PANTHERVertebrate Resequencing Informatics 8th December, 2010
  • 10. Storage Challenges Storage   Try to reduce the proportion of raw data we keep (e.g. images, OQ in BAM, remove base calls in BAM etc.)   However there’s still a LOT of data to store and analyse!   Estimation for our group based on ~200Tbp of sequencing data over next 2-3 years  1.5 Pbytes   Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly releases, backup of lane BAMs, Variant calls   Transient: Library BAMs, Local assemblies   Storage type optimality criteria  Cost per Tbyte  Proximity to compute resources  Scalability – room for expansion/future proofing  I/O throughput  Disaster recoveryVertebrate Resequencing Informatics 8th December, 2010
  • 11. A Tiered Solution 3 tiered storage model Trade off cost, quantity, i/o throughput Similar to caching strategies in computer design   Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g. local assemblies, reference files   Level 1: High-performance, highly parallel, close proximity to compute, expensive, suitable for high i/o tasks   Level 2: Mid-tier storage, some type of nfs technology, discrete units with some local compute, suitable for low i/o tasks that are compute intensive, scalable by adding more discrete units   Level 3: High latency storage, warehouse storage, not suitable to compute against, occasional access e.g. old data releases  (Level 3a: Off-site replication of data in level 3)Vertebrate Resequencing Informatics 8th December, 2010
  • 12. A Tiered SolutionCost Size 2 1 Level 1: 3Gb/sec High performance CPU Farm 1 2 Level 2: Middle tier/nfs 800Mb/sec Level 3: Backup/warehouse 1 2 Level 3a: Off-site replication Level 1   Data: Current release horizontal + transposed BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Old release BAMs + variant calls backup Vertebrate Resequencing Informatics 8th December, 2010
  • 13. Compute Challenges Compute   New algorithms continually developed for more accurate variant calling   2010 several new processes added into production pipeline  BAM Improvement   Local realignment around indels to correct mapping biases (e.g. GATK)   Adding BAQ differences up front  Indel calling by local assembly/alternative haplotype analysis (e.g. dindel)  Local reassembly of SV breakpoints   Easy to estimate runtime for known processes (e.g. mapping, recalibration, duplicate removal)  Challenge to estimate runtime for next 2-3 years for new algorithms  E.g. more use of assembly methods – more complex references? I/O has become a significant bottleneck and is most difficult thing to measure   All computations need to minimise I/O  E.g. transforming BAM files to different sort ordersVertebrate Resequencing Informatics 8th December, 2010
  • 14. Project Data Release Do we need to release BAMs? Large scale human phenotype driven sequencing projects going forward   Participants are more interested in the variants than the raw data BAM files may contain too much data and too large to ship around amongst project members UK10K proposals   Lane BAM files submitted to the archives   Not release BAM files via project ftp   Project data release comprise solely of annotated VCF files   Raw data can be obtained from the archivesVertebrate Resequencing Informatics 8th December, 2010

×