Your SlideShare is downloading. ×
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
140127 rtg vcfeval vcf comparison tool
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

140127 rtg vcfeval vcf comparison tool

552

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
552
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Comparing Variant Calls GENOME- IN- A- BOTTLE W ORKSHOP Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc.
  • 2. rtgTools v1.0 A toolkit to compare and analyze VCFs • • • • • • • vcfeval – comparison of VCFs for ROC curves rocplot – draw ROC curves from vcfeval output medelian – counts of Mendelian inheritance errors in pedigrees vcfstats – basic statistics of VCF files vcffilter – filtering of VCFs by scores, etc. vcfannotate – annotation of VCF files vcfmerge – merge VCF files Java compiled code freely available at GiaB repository: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
  • 3. 3 Issues in representation of complex calls Indel in homopolymer MNPs Reference CAAAAAAG Reference Baseline Called C..AAAAG CAAAA..G After replay: Baseline Called CAAAAG CAAAAG Baseline Called CAACGTAAG CAATGTCAG CAATGTCAG
  • 4. Issues in representation of complex calls Dinucleotide repeat Reference Baseline Called ACGTACCAGATATCACAACATATATATA ACGGACCAG..ATCACAACATATATATATA ACGGACCAGAT..CACAACATATATATATA After replay: Baseline Called ACGGACCAGATCACAACATATATATATA ACGGACCAGATCACAACATATATATATA
  • 5. Comparison of variant call set with baseline set Basic rules • Match the baseline and called sequences so as to maximize true positives and minimize false positives and false negatives. • True positives + false negatives = total calls in the baseline • Heterozygous calls match: Both heterozygous and alleles must agree Best path Link mutations ROC Path creation • A path is a selection of subset of calls • Best path: paths that maximize true positives and minimize errors • In theory, exponential number of paths; in practice this can be solved by dynamic programing
  • 6. Path creation - simple homozygous case Reference Baseline a Called b c d e f g h
  • 7. Path creation - simple homozygous case Reference Baseline a b c d e f g h e f g h Called Best Path Baseline False negative (excluded) a b c Called False positive (excluded) d
  • 8. Path creation - simple heterozygous case (non-phased) Reference Baseline a Called b c d e f
  • 9. Path creation - simple heterozygous case (non-phased) Reference Baseline a b c d e f e f Called Best Path False negative (excluded) Baseline a b c d Called False positive (excluded)
  • 10. Why weighting is needed? TP + FN = Totalbaseline Reference CAACAACTATCCTC....ATCT....GC Baseline CAACAACTATCCTCATCTATCTATCTGC Called CAACAACTATCCTCATCTATCTATCTGC
  • 11. Sync points Reference Baseline Called ACAGTCACGG ACGGTCACTG ACGGTTACGG Reference Baseline Called AC AC AC AGT GGT GGT CAC CAC TAC GG TG GG
  • 12. Weighting where B is the number of baseline variants between the current (Sn) and previous sync points (Sn-1) and C is the number of called variants between the current and previous sync points.
  • 13. Simple homozygous weighting False negative (excluded) 1 Sync points Baseline Weights a1 b1 c1 d1 e1 f1 Called False positive (excluded) 1 Type TP Sync point Weighted total 6 FP 1 FN 1
  • 14. Simple heterozygous case (non-phased) weighting False negative (excluded) 2 Baseline a 1 b 1 c 1 d1 e f Called False positive (excluded) 1 Type Sync point Weighted total TP 4 FP 1 FN 2
  • 15. Complex weighting Baseline a 1 b 1 c 1 d1 e 0.5 f 0.5 Called Type TP 5 FP Sync point Weighted total 0 FN 0
  • 16. ROC Plot
  • 17. http://biorxiv.org/content/early/2014/01/24/001958
  • 18. Acknowledgements RTG, Hamilton, New Zealand  John Cleary  Len Trigg  Mehul Rathoud Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab) This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA. © 2014 Real Time Genomics, Inc. All rights reserved.

×