Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tools for Using NIST Reference Materials

1,256 views

Published on

Tools for Using NIST Reference Materials

Published in: Health & Medicine
  • Be the first to comment

Tools for Using NIST Reference Materials

  1. 1. Genome in a Bottle: Tools for Using NIST Reference Materials Next Generation Diagnostics Summit Short Course August 2014 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
  2. 2. Learning Objectives • How can Genome in a Bottle Reference Materials help with validating NGS assays? • Comparing your variant calls to high- confidence calls • Tools available for understanding potential false positives and false negatives • Examples of how labs are using our high- confidence calls
  3. 3. NIST-hosted Genome in a Bottle Consortium • Infrastructure for performance assessment of NGS – support science-based regulatory oversight • No widely accepted set of metrics to characterize the fidelity of variant calls from NGS… • Genome in a Bottle Consortium is developing standards to address this… – human genomes as Reference Materials (RMs) • characterize and disseminate by NIST – tools and methods to use these RMs • common sequencing instruments • bioinformatics workflows. http://genomeinabottle.org
  4. 4. Whole genome sequencing technologies disagree about 100,000’s of variants 3,198,316 (80.05%) 125,574 (3.14%) Platform #1 Platform #2 Platform #3 230,311 (5.76%) 121,440 (3.04%) 208,038 (5.21%) 71,944 (1.80%) 39,604 (0.99%) # SNPs (% of SNPs detected by any platform)
  5. 5. Bioinformatics programs also disagree O’Rawe et al. Genome Medicine 2013, 5:28
  6. 6. Measurement Process Sample gDNA isolation Library Prep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis • gDNA reference materials will be developed to characterize performance of a part of process – materials will be certified for their variants against a reference sequence, with confidence estimates genericmeasurementprocess
  7. 7. NIST Human Genome RMs in the pipeline • All 10 ug samples of DNA isolated from multistage large growth cell cultures – all are intended to act as stable, homogeneous references suitable for use in regulated applications – all genomes also available from Coriell repository • Pilot Genome – ~8400 tubes • Ashkenazim Jewish Trio – ~10000 son; ~2500 each parent • Asian Trio – ~10000 son; parents not yet planned as NIST RM
  8. 8. Goals for Data to Accompany RM • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform – take advantage of strengths of each platform • Avoid bias towards any particular bioinformatics algorithms 8
  9. 9. Integration Methods to Establish Reference Variant Calls Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Zook et al., Nature Biotechnology, 2014.
  10. 10. Assigning confidence to genotypes High-confidence sites • Sequencing/bioinformatics methods agree or we understand the biases causing disagreement • At least some methods have no evidence of bias • Inherited as expected Less confident sites • In a region known to be difficult for current technologies • State reasons for lower confidence • If a site is near a low confidence site, make it low confidence
  11. 11. Reasons we exclude regions from high- confidence set
  12. 12. Challenges with assessing performance • All variant types are not equal • All regions of the genome are not equal – Homopolymers, STRs, duplications – Can be similar or different in different genomes • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 12
  13. 13. Preliminary uses of high-confidence NIST-GIAB genotypes for NA12878 • NIST have released several versions of high- confidence genotypes for its pilot RM • These data are presently being used for benchmarking – prior to release of RMs – SNPs & indels • ~77% of the genome
  14. 14. NIST Plays a Role in the First FDA Authorization for Next-Generation Sequencer November 20, 2013
  15. 15. Integrating NIST Call Sets into a Validation Workflow Validation Report False Positive Ratio FPR=FP/(FP+TN) False Discovery Rate FDR=FP/(FP + TP) Sensitivity Sens. = TP/(TP+FN) Specificity Spec. = TN/(FP +TN) Balanced Accuracy (Sens. + Spec.)/2
  16. 16. GCAT – Interactive Performance Metrics • NIST is working with GCAT to use our highly confident variant calls • Assess performance of many combinations of mappers and variant callers • Currently assesses only exome sequencing • www.bioplanet.com/gcat 16
  17. 17. GCAT Tests
  18. 18. GCAT Variant Calling Tests Pre-run Tests Upload your own variant calls
  19. 19. GCAT – Upload your own exome calls
  20. 20. Freebayes SNP calls changed very little in 2013 http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem- freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/snp/group-quality
  21. 21. Freebayes indel calls improved in 2013 http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem- freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/indel/group-quality
  22. 22. Background • Clinical laboratory – Division of Genomic Diagnostics Certified by regulatory agencies (CAP). • CWES test requires stringent validation per CAP criteria to establish performance metrics of the test. Utilizing NIST data in validation of CWES Test • Sequence and call variants of NA12878 at CHOP • CHOP ROI: Agilent SureSelect V5+ (SSV5+) baits file • Compare CHOP dataset to NIST data set for concordance NIST Data Set Details: *High quality reference data set on NA12878 (Dec. 2013) *NIST’s highly confident Region of Interests (ROI) *Variants called in 219,222 regions on hg19 assembly *: National Institute of Standards and Technology Analytical Validation of Clinical Whole-Exome Sequencing (CWES) Test
  23. 23. SENSITIVITY /SPECIFICITY RefGene +/- 15bp (SSV5+) CHOP NIST TP SNVs: 18480 INDELs: 396 FP SNVs: 26 INDELs: 3 FN SNVs: 63 INDELs: 30 FP: False Positive TP: True Positive FN: False Negative TN: True Negative SNVs INDELs Sensitivity (TP/TP+FN) 99.66% 92.96% Specificity (TN/TN+FP) ~100% ~100% FDR (FP/FP+TN) 0.02% 0.08% Accuracy (TP+TN/TP+TN+FP+FN) ~100% ~100% TN = NIST highly confident regions – CHOP ROIs
  24. 24. Further analysis on presumptive 93 FNs and 29 FPs 63 SNVs 30 INDELs 93 FNs 29 FPs 26 SNVs 3 INDELs
  25. 25. Using the GeT-RM Browser • http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/ • Allows visualization of questionable calls
  26. 26. GeT-RM Load alignments for visualization
  27. 27. Chr6:151669820 Chr6:151669828 Difficult site in homopolymer in intron of gene AKAP12
  28. 28. Chr1:1666303 SNP in Gene SLC35E2, which is also in a pseudogene and a segmental duplication
  29. 29. Segmental Duplication Pseudo- gene Structural Variant
  30. 30. Feedback from MoCha lab in NCI • We built a targeted amplicons NGS assay for detecting mutations in clinical tumor specimens • To assess the assay’s specificity, we compared 84 runs of CEPH NA12878 data from our assay with NIST’s consensus variant list (VCF v2.15) • We observed a high overall concordance with a few FP variants in homopolymeric regions unique in our platform • We concluded that NIST GIAB is a useful reference standard to evaluate assay specificity
  31. 31. Using Genome in a Bottle calls to benchmark clinical exome sequencing at Mount Sinai School of Medicine “We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”
  32. 32. Benchmarking somatic variant calling at Qiagen
  33. 33. HSPH – Brad Chapman Comparing variant callers http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection- methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/
  34. 34. NextSeq: New Chemistry – Does it work? Whole Genome Metrics NextSeq500 HiSeq2500 % Genome Covered (>= 10X in Q20 bases) 96% 96% Mean Coverage in Q20 Bases 28.3X 31.8X SNPs Called (% dbSNP 129) 3,643,998 (89%) 3,664,014 (88%) InDels Called (% dbSNP 129) 646,907 (65.7%) 686,547 (64.5%) Genome in a Bottle SNP Sensitivity & Precision 99.07% | 99.04% 99.25% | 99.90% Genome in a Bottle Indel Sensitivity & Precision 86.90% | 98.85% 93.29% | 97.54% NextSeq 500: Genomic Coverage in High Quality Bases Coverage in Bases with MQ>=20 and Q>=20 ProportionofGenomeatCoverage 0.000.010.020.030.040.05 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Mean: 28.33X Fraction at 2/3 Mean: 0.9 HiSeq 2000: Genomic Coverage in High Quality Bases Coverage in Bases with MQ>=20 and Q>=20 ProportionofGenomeatCoverage 0.000.010.020.030.040.05 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Mean: 31.86X Fraction at 2/3 Mean: 0.91 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●●● ● ●0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00 GC Content NormalizedCoverage Platform ● ● HiSeq 2000 NextSeq 500
  35. 35. Ion Benchmarking I
  36. 36. Ion Benchmarking II
  37. 37. Command-line tools for variant benchmarking • USeq VCFComparator – http://sourceforge.net/projects/useq/ • RTG vcfeval – ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/ • bcbio.variation – http://bcbio.wordpress.com/2013/05/06/framework- for-evaluating-variant-detection-methods- comparison-of-aligners-and-callers/ • SMaSH – http://smash.cs.berkeley.edu/
  38. 38. How Can I Get Involved? • Use our integrated SNP/indel genotypes for NA12878 and give us feedback – Cells and DNA currently available from Coriell – NIST RM available late 2014 • Sequencing/analyzing the new Genome in a Bottle samples • Help with Structural Variant calls • Help with analyzing data from long-read technologies • Attend our biannual workshops (January in CA, August in MD) • Help develop methods to measure performance using our well-characterized genomes http://genomeinabottle.org Email: Justin Zook - jzook@nist.gov Marc Salit – salit@nist.gov Slides on slideshare at: http://www.slideshare.net/Gen omeInABottle

×