Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New methods pacbio evaluation of draft v4alpha

102 views

Published on

New methods pacbio evaluation of draft v4alpha

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

New methods pacbio evaluation of draft v4alpha

  1. 1. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Reviewing benchmark v3.3.2 PacBio CCS discrepancies in benchmark v4! Billy Rowell @nothingclever
  2. 2. OVERALL BENCHMARK CHANGES Type Metric v3.3.2 v4! " SNP TP 3,034,399 3,404,145 369,746 FN 13,438 29,593 16,155 FP 16,248 17,351 1,103 INDEL TP 377,613 431,725 54,112 FN 87,151 104,232 17,081 FP 103,619 106,628 3,009 Type Metric v3.3.2 v4! " SNP Recall 99.56% 99.14% -0.42% Precision 99.47% 99.49% 0.02% F1 Score 99.51% 99.32% -0.19% INDEL Recall 81.25% 80.55% -0.70% Precision 78.98% 80.68% 1.70% F1 Score 80.10% 80.61% 0.51% Comparing GATK4HC calls on Sequel CCS 13.5kb library
  3. 3. SUPPLEMENTARY TABLE 9 - Supplementary Table 9. Manual curation of small variant discrepancies between CCS callset and Genome in a Bottle benchmark. For the “Discrepancy” column, “AM” means genotype difference, “FN” means false negative (in benchmark but not callset), and “FP s” means false positive (in callset but not benchmark). “Repeat family” column is from the RepeatMasker track from the UCSC Genome Browser. “Correct Call” column is “GIAB” when the benchmark was deemed correct by expert curators, and “CCS” when the CCS callset was deemed correct. Rows where the correct call is from the CCS callset are colored blue. bioRxiv 519025 doi:10.1101/519025
  4. 4. UPDATES TO DISCREPANCIES BETWEEN CCS AND V3.3.2 CHR POS Discrepancy 4 11468804AM 5 42740225AM 2 5143996AM 13 48291499AM 8 5930728FN 15 41943823FN 6 9737425FN 7 157385671FN 17 32064214FN 1 94256825FP 2 153864971FP 4 112819087FP 4 165026074FP 11 23338682FP 1 35034071FP 3 79181734FP 4 94532444FP 8 46873565FP 9 22350168FP 21 42288851FP
  5. 5. 13/20 SITES WERE REMOVED FROM HIGH CONFIDENCE REGION CHR POS Discrepancy high conf 4 11468804 AM BORDER 5 42740225 AM FALSE 2 5143996 AM FALSE 13 48291499 AM FALSE 8 5930728 FN FALSE 15 41943823 FN FALSE 6 9737425 FN FALSE 7 157385671 FN TRUE 17 32064214 FN FALSE 1 94256825 FP TRUE 2 153864971 FP FALSE 4 112819087 FP FALSE 4 165026074 FP TRUE 11 23338682 FP TRUE 1 35034071 FP FALSE 3 79181734 FP TRUE 4 94532444 FP FALSE 8 46873565 FP FALSE 9 22350168 FP TRUE 21 42288851 FP FALSE
  6. 6. 7/20 SITES WERE MODIFIED IN V4! TO MATCH GATK CALLS CHR POS Discrepancy high conf 4 11468804 AM BORDER 5 42740225 AM FALSE 2 5143996 AM FALSE 13 48291499 AM FALSE 8 5930728 FN FALSE 15 41943823 FN FALSE 6 9737425 FN FALSE 7 157385671 FN TRUE 17 32064214 FN FALSE 1 94256825 FP TRUE 2 153864971 FP FALSE 4 112819087 FP FALSE 4 165026074 FP TRUE 11 23338682 FP TRUE 1 35034071 FP FALSE 3 79181734 FP TRUE 4 94532444 FP FALSE 8 46873565 FP FALSE 9 22350168 FP TRUE 21 42288851 FP FALSE
  7. 7. SOME OF THESE REGIONS MAY BE ADDED BACK TO HIGH CONFIDENCE AFTER CLOSER INSPECTION CHR POS Discrepancy high conf Notes 4 11468804 AM BORDER fixed 5 42740225 AM FALSE complex variant; L1PA2 2 5143996 AM FALSE mis-mapped short reads; HERVH-int 13 48291499 AM FALSE highly variable region; L1PA3 8 5930728 FN FALSE long reads identify long insertion 15 41943823 FN FALSE simple repeat with some variability causes alignment issues 6 9737425 FN FALSE segmental dup 7 157385671 FN TRUE segmental dup, fixed 17 32064214 FN FALSE 1 94256825 FP TRUE fixed; L1PA2 2 153864971 FP FALSE supported by CCS, mate pairs, 10X, ONT; L1HS 4 112819087 FP FALSE L1HS 4 165026074 FP TRUE fixed; L1PA2 11 23338682 FP TRUE fixed; L1P1 1 35034071 FP FALSE L1HS 3 79181734 FP TRUE fixed; L1HS 4 94532444 FP FALSE supported by CCS, mate pairs, ONT; L1HS 8 46873565 FP FALSE ALR/Alpha 9 22350168 FP TRUE fixed; L1PA2 21 42288851 FP FALSE supported by CCS, mate pairs, ONT; L1PA2
  8. 8. CHR6:9737425 -V3.3.2 HET -V4 – removed from high confidence region -segmental duplication15kb CCS 2x250 6kb mate pair Linked-read Ultralong Seg dup V3.3.2 high conf V4 high conf
  9. 9. CHR13:48291499 -V3.3.2 - HOMALT -V4 - removed from high confidence region -Support for HET in CCS, linked- reads, mate pairs, and ultralong reads -L1PA3 15kb CCS 2x250 6kb mate pair Linked-read Ultralong Seg dup V3.3.2 high conf V4 high conf
  10. 10. CONCLUSION -v4! increases TP variant calls in CCS-GATK4HC call set by > 400k -slight decreases in recall, but increases in precision -v4! resolves all discrepancies reported in Wenger & Peluso et al. -There may be some cases where the high confidence region could be further expanded, based on agreement between CCS, 10X, ONT, and 6kb mate pairs.
  11. 11. For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies. All other trademarks are the sole property of their respective owners. www.pacb.com

×