Your SlideShare is downloading. ×

Aug2014 giab status update and wg charge

231

Published on

giab status update and wg charge

giab status update and wg charge

Published in: Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
231
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Genome in a Bottle: Reference Materials to Enable Translation August 2014 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
  • 2. NIST Human Genome RMs in the pipeline • All 10 ug samples of DNA isolated from multistage large growth cell cultures – all are intended to act as stable, homogeneous references suitable for use in regulated applications – all genomes also available from Coriell repository • Pilot Genome – ~8400 tubes • Ashkenazim Jewish Trio – ~10000 son; ~2500 each parent • Asian Trio – ~10000 son; parents not yet planned as NIST RM
  • 3. Homogeneity Analysis First and last vial 3 libraries sequenced to ~33x each Use Varscan to detect differences in allele fraction of SNPs and indels between vials Significant differences only found in regions prone to alignment errors Use BIC-seq to detect differences in copy number between vials and libraries No consistent differences between comparisons of different libraries between vials 4 Random vials 2 libraries sequenced to 12.5x each Use BIC-seq to detect differences in copy number between vials and libraries Only one difference with p<10^-8, which is in a region prone to mapping errors. • Sequence multiple libraries from multiple vials • Use somatic mutation callers to detect differences in SNPs and CNVs
  • 4. 8week 8week 8week 2week 2week 2week 8week 8week 8week 8week 8week 8week 2week 2week 2week Run multiple gels for each condition Time = 0 Time = 8 weeks Freeze Thaw 2x Vortex (10sec) Freeze Thaw 2x Vortex (10sec) Vigorous Pipetting (full vol 10x) Vigorous Pipetting (full vol 10x) Freeze Thaw 2x Freeze Thaw 5x Freeze Thaw 5x Freeze Thaw 5x 8week Vortex (10sec) Vigorous Pipetting (full vol 10x) • Blinded qualitative analysis of gel by 5 NIST staff • Consensus that only vials stored at 37° C for 8 weeks had significantly decreased size Shipping cross- country
  • 5. Example Gel Images
  • 6. Goals for Data to Accompany RM • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform – take advantage of strengths of each platform • Avoid bias towards any particular bioinformatics algorithms 6
  • 7. Integrate 12 14 Datasets from 5 platforms 7
  • 8. Integration of Data to Form Highly Confident Genotype Calls Find all possible variant sites Find concordant sites across multiple datasets Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias For each site, remove datasets with decreasingly atypical characteristics until all datasets agree Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known segmental duplications, SVs, or long repeats Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level 8
  • 9. Integration Methods to Establish Reference Variant Calls Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Zook et al., Nature Biotechnology, 2014.
  • 10. Pedigree calls • RTG and Illumina Platinum Genomes developed these • Sequence NA12878, husband, and 11 children to identify high confidence variants – Identify cross-over events – Determine if genotypes are consistent with inheritance • Integrated these with NIST high-confidence genotypes • Should we find larger families for future genomes? Source: Mike Eberle, Illumina 10
  • 11. Assigning confidence to genotypes High-confidence sites • Sequencing/bioinformatics methods agree or we understand the biases causing disagreement • At least some methods have no evidence of bias • Inherited as expected Less confident sites • In a region known to be difficult for current technologies • State reasons for lower confidence • If a site is near a low confidence site, make it low confidence
  • 12. Performance Metrics Specification • Goal is to standardize performance metrics measured with respect to NIST RMs • Licensing • Definitions • Input formats • User interface • Accuracy outputs – FP, FN, Sens, Spec, etc. – Stratification • by variant type • by genome context • by functional regions • Characteristics of FP/FN • Working with Global Alliance for Genomic Health • See draft at genomeinabottle.org
  • 13. Working Group Charges RM Selection and Design • Derivative products based on NIST RMs • RMs for cancer and somatic variant calling? • Do we need another large family and/or more diversity? • What is the priority of transcriptome RMs? Characterization/Bioinformatics • What are the barriers to submitting data via SRA? • How should we use long read technologies? • How should we call structural variants? • Do we need targeted confirmation/validation of SNPs, indels, or SVs? • Integration of data for PGP trios
  • 14. Working Group Charges Performance Metrics • How should we coordinate with Global Alliance for Genomic Health Benchmarking group? • Feedback about Performance Metrics Specification – Stratification of performance by type of error, variant type, genome context, and functional region

×