Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation•22 views
140127 rtg vcfeval vcf comparison tool
1. Comparing Variant Calls
GENOME- IN- A- BOTTLE W ORKSHOP
Francisco M. De La Vega, D.Sc.
Visiting Scholar, Department of Genetics
Stanford University School of Medicine
In collaboration with Real Time Genomics, Inc.
2. rtgTools v1.0
A toolkit to compare and analyze VCFs
•
•
•
•
•
•
•
vcfeval – comparison of VCFs for ROC curves
rocplot – draw ROC curves from vcfeval output
medelian – counts of Mendelian inheritance errors in pedigrees
vcfstats – basic statistics of VCF files
vcffilter – filtering of VCFs by scores, etc.
vcfannotate – annotation of VCF files
vcfmerge – merge VCF files
Java compiled code freely available at GiaB repository:
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
3. 3
Issues in representation of complex calls
Indel in homopolymer
MNPs
Reference CAAAAAAG
Reference
Baseline
Called
C..AAAAG
CAAAA..G
After replay:
Baseline
Called
CAAAAG
CAAAAG
Baseline
Called
CAACGTAAG
CAATGTCAG
CAATGTCAG
4. Issues in representation of complex calls
Dinucleotide repeat
Reference
Baseline
Called
ACGTACCAGATATCACAACATATATATA
ACGGACCAG..ATCACAACATATATATATA
ACGGACCAGAT..CACAACATATATATATA
After replay:
Baseline
Called
ACGGACCAGATCACAACATATATATATA
ACGGACCAGATCACAACATATATATATA
5. Comparison of variant call set with baseline set
Basic rules
• Match the baseline and called sequences so as to maximize true
positives and minimize false positives and false negatives.
• True positives + false negatives = total calls in the baseline
• Heterozygous calls match: Both heterozygous and alleles must agree
Best path
Link
mutations
ROC
Path creation
• A path is a selection of subset of calls
• Best path: paths that maximize true positives and minimize errors
• In theory, exponential number of paths; in practice this can be solved by
dynamic programing
6. Path creation - simple homozygous case
Reference
Baseline
a
Called
b
c
d
e
f
g
h
7. Path creation - simple homozygous case
Reference
Baseline
a
b
c
d
e
f
g
h
e
f
g
h
Called
Best Path
Baseline
False negative (excluded)
a
b
c
Called
False positive (excluded)
d
8. Path creation - simple heterozygous case (non-phased)
Reference
Baseline
a
Called
b
c
d
e
f
9. Path creation - simple heterozygous case (non-phased)
Reference
Baseline
a
b
c
d
e
f
e
f
Called
Best Path
False negative (excluded)
Baseline
a
b
c
d
Called
False positive (excluded)
10. Why weighting is needed?
TP + FN = Totalbaseline
Reference
CAACAACTATCCTC....ATCT....GC
Baseline
CAACAACTATCCTCATCTATCTATCTGC
Called
CAACAACTATCCTCATCTATCTATCTGC
12. Weighting
where B is the number of baseline variants between the current
(Sn) and previous sync points (Sn-1) and C is the number of called
variants between the current and previous sync points.
13. Simple homozygous weighting
False negative (excluded)
1
Sync
points
Baseline
Weights
a1
b1
c1
d1
e1
f1
Called
False positive (excluded)
1
Type
TP
Sync point
Weighted total
6
FP
1
FN
1
14. Simple heterozygous case (non-phased) weighting
False negative
(excluded)
2
Baseline
a 1
b 1
c 1
d1
e
f
Called
False positive
(excluded)
1
Type
Sync point
Weighted total
TP
4
FP
1
FN
2