140127 GIAB update and NIST high-confidence calls

1,289 views
1,090 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,289
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
23
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • ----- Meeting Notes (5/28/13 17:05) -----ask heng for decoy
  • 140127 GIAB update and NIST high-confidence calls

    1. 1. Genome in a Bottle Consortium Progress Update January 27, 2014 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
    2. 2. Whole Genome RMs vs. Current Validation Methods • Sanger confirmation – Limited by number of sites (and sometimes it’s wrong) • High depth NGS confirmation – May have same systematic errors • Genotyping microarrays – Limited to known (easier) variants – Problems with neighboring “complex” variants, duplications • Mendelian inheritance – Can’t account for some systematic errors • Simulated data – Generally not very representative of errors in real data • Ti/Tv – Varies by region of genome, and only gives overall statistic 2
    3. 3. Goals for Data to Accompany RM • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform – take advantage of strengths of each platform • Avoid bias towards any particular bioinformatics algorithms 3
    4. 4. Integrate 12 14 Datasets from 5 platforms 4
    5. 5. Integration of Data to Form Highly Confident Genotype Calls Candidate variants Find all possible variant sites Concordant variants Find concordant sites across multiple datasets Find characteristics of bias Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias Arbitrate using evidence of bias For each site, remove datasets with decreasingly atypical characteristics until all datasets agree Confidence Level Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known segmental duplications, SVs, or long repeats 5
    6. 6. Verification of “Highly Confident” Genotype accuracy • Sanger sequencing – 100% accuracy but only 100s of sites • X Prize Fosmid sequencing – Sometimes call only part of a complex variant • Microarrays – Differences appear to be FP or FN in arrays • Broad 250bp HaplotypeCaller – Very highly concordant • Platinum genomes pedigree SNPs – Some systematic errors are inherited; different representations of complex variants • Real Time Genomics SNPs and indels – Some interesting sites called by RTG complex caller 6
    7. 7. GCAT – Interactive Performance Metrics • NIST is working with GCAT to use our highly confident variant calls • Assess performance of many combinations of mappers and variant callers • www.bioplanet.com/gc at Improvement of FreeBayes over 1 year with indels 7
    8. 8. Why do calls differ from our highly confident genotypes? Apparent False Positives • Platform-specific systematic sequencing errors for SNPs • Analysis-specific • Difficult to map regions • Indels in long homopolymers Apparent False Negatives • Different complex variant representation • Near indels • Inside repeats 8
    9. 9. Complex variants have multiple correct unphased representations BWA T insertion CGTools Ref: FP indels TCTCT insertion Traditional comparison 0.38% (610) 100% (915) 6.5% (733) Comparison with realignment ssaha2 Novoalign FP SNPs FP MNPs 0.15% (249) 4.2% (38) 2.6% (298) • ~225,000 highly confident variants are within 10bp of another variant • FPs and FNs are significantly enriched for complex variants • RTG vcfeval can fix this issue! 9
    10. 10. Reasons we exclude regions from highconfidence set
    11. 11. Reasons we exclude regions from highconfidence set
    12. 12. Structural variant analytical approach Depth of coverage (DOC) Control-FREEC CnD Paired-end mapping (PEM) Breakdancer Split read (SR) Pindel Assembly based (AS) Velvet ABySS Combination Genome-STRiP SVMerge List of structural variant calls
    13. 13. Validation parameters for each SV • Coverage (mean and standard deviation) • Paired-end distance/insert size (mean and standard deviation) • # of discordant paired-ends • Soft clipping of the reads (mean and standard deviation) • Mapping quality (mean and standard deviation) • # of heterozygous and homozygous SNP genotype calls
    14. 14. Challenges with assessing performance • All variant types are not equal • All regions of the genome are not equal – Homopolymers, STRs, dupli cations – Can be similar or different in different genomes • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 15
    15. 15. Pedigree calls • RTG and Illumina Platinum Genomes working on this • Sequence NA12878, husband, and 11 children to identify high confidence variants – Identify cross-over events – Determine if genotypes are consistent with inheritance • Should we integrate these with the NIST high-confidence genotypes? • Should we find larger families for future genomes? • See afternoon presentations! Source: Mike Eberle, Illumina 16
    16. 16. Pedigree Calls in Uncertain Regions
    17. 17. GIAB Characterization of pilot RM • • • • NIST – 300x 150x150bp HiSeq (from 6 vials) NIST – 100x 75bp ECC SOLiD 5500W Illumina – 50x 100x100bp HiSeq Complete Genomics – Normal and LFR (nonRM) • Garvan Institute – Illumina exome • NCI – Ion Proton whole genome • INOVA – Infinium SNP/CNV array
    18. 18. Homogeneity and Stability Homogeneity • Multiplex First and last vial – 3 libraries x 33x HiSeq each • Multiplex 4 Random vials – 2 libraries x 12.5x HiSeq each • Compare variability due to: – – – – – – vial library day flow cell lane sampling • Run PFGE on each vial for size Stability • Run PFGE to detect DNA degradation • Freeze-thaw 2 and 5 times • Vortex for 10s • 4°C for 2 and 8 weeks • 37°C for 2 and 8 weeks
    19. 19. FTP site and Amazon S3 • NCBI is hosting fastq, bam, and vcf files on the giab ftp site • These data are mirrored to Amazon S3, so we encourage you to take advantage of this!
    20. 20. Pilot Reference Material • High-confidence calls are available on the ftp site and are already being used • NIST plans to release this as a NIST Reference Material in the next couple months
    21. 21. Future Directions • Characterize more “difficult” regions/variants • Structural variants • Compare to pedigree calls • Examine potentially clinically relevant regions/variants in RMs • Use long-read technologies – – – – – Moleculo CG LFR PacBio BioNano Genomics future technologies?? • Use glia/platypus to realign reads to candidate variants • Analyze interlaboratory study data • Characterize PGP genomes – – – – Ashkenazim trio son in Asian trio DNA at NIST in Jan-Feb 2014 Volunteers to sequence? • Select future genomes • Tumor-normal?
    22. 22. Topic #1: Moving beyond the easy regions/variants Presentations • Emerging Technologies – – – – PacBio Complete Genomics LFR Moleculo BioNano Genomics • Structural Variants – Bina Technologies Topics • Structural Variants • Phasing • Validation • Where should we set the threshold(s) for confidence?
    23. 23. Topic #2: Cancer and Future Genomes Cancer • Spike-ins • Mixtures of normal cell lines • Tumor-normal cell line pair • Transriptome controls Priorities for Future Genomes • Diverse ancestry groups • Larger families • Recruitment with consent for commercialization • How many genomes? • Should the parents be NIST Reference Materials, or only the child?
    24. 24. Working Group Questions RM Selection & Design • Spike-in controls • FFPE • Commercial RMs • ABRF interlaboratory study • Should we prioritize one or two genomes? RM Characterization • Production mode for new trios – Pilot was characterized by Illumina, SOLiD, Ion Proton, and Complete Genomics – What resources should we invest in measurements for each new family?
    25. 25. Working Group Questions Bioinformatics • Storing data/pipelines – Suggestions for ftp structure – Data submission/accessioning process – Data model for genomic data – Archiving pipelines and reproducible research • GRCh38 • How to use pedigree calls for pilot genome? • Clones for targeted regions (hard regions if not whole genome) • In which difficult regions should we focus our characterization? Performance Metrics • Target audience • Requirements for user interface – Establishing truth set(s) – Inputs/Outputs – Visualization • Integration with GeT-RM

    ×