140127 rtg vcfeval vcf comparison tool

Comparing Variant Calls
GENOME- IN- A- BOTTLE W ORKSHOP

Francisco M. De La Vega, D.Sc.
Visiting Scholar, Department of Genetics
Stanford University School of Medicine
In collaboration with Real Time Genomics, Inc.
rtgTools v1.0
A toolkit to compare and analyze VCFs

•
•
•
•
•
•
•

vcfeval – comparison of VCFs for ROC curves
rocplot – draw ROC curves from vcfeval output
medelian – counts of Mendelian inheritance errors in pedigrees
vcfstats – basic statistics of VCF files
vcffilter – filtering of VCFs by scores, etc.
vcfannotate – annotation of VCF files
vcfmerge – merge VCF files

Java compiled code freely available at GiaB repository:
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
3

Issues in representation of complex calls
Indel in homopolymer

MNPs

Reference CAAAAAAG

Reference

Baseline
Called

C..AAAAG
CAAAA..G

After replay:
Baseline
Called

CAAAAG
CAAAAG

Baseline
Called

CAACGTAAG

CAATGTCAG
CAATGTCAG
Issues in representation of complex calls
Dinucleotide repeat

Reference
Baseline
Called

ACGTACCAGATATCACAACATATATATA
ACGGACCAG..ATCACAACATATATATATA
ACGGACCAGAT..CACAACATATATATATA

After replay:

Baseline
Called

ACGGACCAGATCACAACATATATATATA
ACGGACCAGATCACAACATATATATATA
Comparison of variant call set with baseline set
Basic rules
• Match the baseline and called sequences so as to maximize true
positives and minimize false positives and false negatives.
• True positives + false negatives = total calls in the baseline
• Heterozygous calls match: Both heterozygous and alleles must agree

Best path

Link
mutations

ROC

Path creation
• A path is a selection of subset of calls
• Best path: paths that maximize true positives and minimize errors
• In theory, exponential number of paths; in practice this can be solved by
dynamic programing
Path creation - simple homozygous case
Reference

Baseline

a
Called

b

c

d

e

f

g

h
Path creation - simple homozygous case
Reference

Baseline

a

b

c

d

e

f

g

h

e

f

g

h

Called

Best Path
Baseline
False negative (excluded)

a

b

c

Called

False positive (excluded)

d
Path creation - simple heterozygous case (non-phased)
Reference
Baseline

a
Called

b

c

d

e

f
Path creation - simple heterozygous case (non-phased)
Reference
Baseline

a

b

c

d

e

f

e

f

Called

Best Path
False negative (excluded)

Baseline

a

b

c

d

Called

False positive (excluded)
Why weighting is needed?
TP + FN = Totalbaseline

Reference

CAACAACTATCCTC....ATCT....GC

Baseline

CAACAACTATCCTCATCTATCTATCTGC

Called

CAACAACTATCCTCATCTATCTATCTGC
Sync points
Reference
Baseline
Called

ACAGTCACGG
ACGGTCACTG
ACGGTTACGG

Reference
Baseline
Called

AC
AC
AC

AGT
GGT
GGT

CAC
CAC
TAC

GG
TG
GG
Weighting

where B is the number of baseline variants between the current
(Sn) and previous sync points (Sn-1) and C is the number of called
variants between the current and previous sync points.
Simple homozygous weighting
False negative (excluded)

1

Sync
points

Baseline

Weights

a1

b1

c1

d1

e1

f1

Called

False positive (excluded)

1

Type
TP
Sync point

Weighted total
6

FP

1

FN

1
Simple heterozygous case (non-phased) weighting

False negative
(excluded)

2

Baseline

a 1

b 1

c 1

d1

e

f

Called

False positive
(excluded)

1

Type
Sync point

Weighted total

TP

4

FP

1

FN

2
Complex weighting

Baseline

a 1

b 1

c 1

d1

e 0.5

f 0.5

Called

Type
TP

5

FP
Sync point

Weighted total

0

FN

0
ROC Plot
http://biorxiv.org/content/early/2014/01/24/001958
Acknowledgements
RTG, Hamilton, New Zealand
 John Cleary
 Len Trigg
 Mehul Rathoud

Data and tools to compare with phased standard released publicly at NIST
Genome-in-a-Bottle repository (s3://giab)
This work was done while the presenter was employed by Real Time Genomics
Inc., San Bruno, CA.

© 2014 Real Time Genomics, Inc. All rights reserved.
1 of 18

More Related Content

What's hot(20)

Viewers also liked(8)

2017 agbt giab_poster2017 agbt giab_poster
2017 agbt giab_poster
GenomeInABottle303 views
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
GenomeInABottle1.6K views
Data warehousingData warehousing
Data warehousing
Subhanshu Verma920 views
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
Rishikese MR18.8K views
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
King Julian232.3K views

Similar to 140127 rtg vcfeval vcf comparison tool(20)

Pasteur deep seq_analysis_theory_2016Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016
Christophe Antoniewski140 views
Church gmod2012 pt2Church gmod2012 pt2
Church gmod2012 pt2
Deanna Church740 views
Compliance monitoring of multi-perspective declarative process modelsCompliance monitoring of multi-perspective declarative process models
Compliance monitoring of multi-perspective declarative process models
Faculty of Computer Science - Free University of Bozen-Bolzano109 views
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
Bioinformatics and Computational Biosciences Branch66 views
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
Chris Fregly576 views
BiochipBiochip
Biochip
nayakslideshare1.2K views
Assignment-2 -upload.pptxAssignment-2 -upload.pptx
Assignment-2 -upload.pptx
SathiyarajSrinivasan15 views
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
Genome Reference Consortium865 views
TransistorTransistor
Transistor
samiksha padgilwar10 views
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
Genome Reference Consortium738 views

Recently uploaded(20)

Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet44 views
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh31 views
ThroughputThroughput
Throughput
Moisés Armani Ramírez25 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation22 views

140127 rtg vcfeval vcf comparison tool

  • 1. Comparing Variant Calls GENOME- IN- A- BOTTLE W ORKSHOP Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc.
  • 2. rtgTools v1.0 A toolkit to compare and analyze VCFs • • • • • • • vcfeval – comparison of VCFs for ROC curves rocplot – draw ROC curves from vcfeval output medelian – counts of Mendelian inheritance errors in pedigrees vcfstats – basic statistics of VCF files vcffilter – filtering of VCFs by scores, etc. vcfannotate – annotation of VCF files vcfmerge – merge VCF files Java compiled code freely available at GiaB repository: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
  • 3. 3 Issues in representation of complex calls Indel in homopolymer MNPs Reference CAAAAAAG Reference Baseline Called C..AAAAG CAAAA..G After replay: Baseline Called CAAAAG CAAAAG Baseline Called CAACGTAAG CAATGTCAG CAATGTCAG
  • 4. Issues in representation of complex calls Dinucleotide repeat Reference Baseline Called ACGTACCAGATATCACAACATATATATA ACGGACCAG..ATCACAACATATATATATA ACGGACCAGAT..CACAACATATATATATA After replay: Baseline Called ACGGACCAGATCACAACATATATATATA ACGGACCAGATCACAACATATATATATA
  • 5. Comparison of variant call set with baseline set Basic rules • Match the baseline and called sequences so as to maximize true positives and minimize false positives and false negatives. • True positives + false negatives = total calls in the baseline • Heterozygous calls match: Both heterozygous and alleles must agree Best path Link mutations ROC Path creation • A path is a selection of subset of calls • Best path: paths that maximize true positives and minimize errors • In theory, exponential number of paths; in practice this can be solved by dynamic programing
  • 6. Path creation - simple homozygous case Reference Baseline a Called b c d e f g h
  • 7. Path creation - simple homozygous case Reference Baseline a b c d e f g h e f g h Called Best Path Baseline False negative (excluded) a b c Called False positive (excluded) d
  • 8. Path creation - simple heterozygous case (non-phased) Reference Baseline a Called b c d e f
  • 9. Path creation - simple heterozygous case (non-phased) Reference Baseline a b c d e f e f Called Best Path False negative (excluded) Baseline a b c d Called False positive (excluded)
  • 10. Why weighting is needed? TP + FN = Totalbaseline Reference CAACAACTATCCTC....ATCT....GC Baseline CAACAACTATCCTCATCTATCTATCTGC Called CAACAACTATCCTCATCTATCTATCTGC
  • 12. Weighting where B is the number of baseline variants between the current (Sn) and previous sync points (Sn-1) and C is the number of called variants between the current and previous sync points.
  • 13. Simple homozygous weighting False negative (excluded) 1 Sync points Baseline Weights a1 b1 c1 d1 e1 f1 Called False positive (excluded) 1 Type TP Sync point Weighted total 6 FP 1 FN 1
  • 14. Simple heterozygous case (non-phased) weighting False negative (excluded) 2 Baseline a 1 b 1 c 1 d1 e f Called False positive (excluded) 1 Type Sync point Weighted total TP 4 FP 1 FN 2
  • 15. Complex weighting Baseline a 1 b 1 c 1 d1 e 0.5 f 0.5 Called Type TP 5 FP Sync point Weighted total 0 FN 0
  • 18. Acknowledgements RTG, Hamilton, New Zealand  John Cleary  Len Trigg  Mehul Rathoud Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab) This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA. © 2014 Real Time Genomics, Inc. All rights reserved.