Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
GIAB Sep2016 Lightning chen sun varmatch
1. VarMatch:
robust matching of small variant datasets
using flexible scoring schemes
Chen Sun, Paul Medvedev
Penn State
1
2. Variant Matching
• Different pipelines tends to report variants in different
representations
• Need to compare VCF files
• Evaluate variant callers
• Find overlap as high confident variants
• Add variants into database
• Two variant sets are equivalent if applying them separately to the
reference genome results in the same donor genome.
• Variant Matching Problem: given two call sets, identify the largest
equivalent subsets.
2
3. The Variant Matching problem
Seq A G C C G G
1 REF G C C G
ALT C C G A
2 REF G C G
ALT C G A
3 REF A G G
ALT A G A
Donor: A C C G A G
• Naïve approach
• Match two variants if location and alleles exactly
same
• Normalization (Tan et al 15)
• Guarantees to match equivalent singletons
• Complex Variants
• One variant matches multiple variants
• Multiple variants matches multiple variants
• Decomposition (Li 14, Zook et al 14)
• Creates fractional matches
• Does not always work (Example )
3
4. VarMatch Algorithm Overview
• Separator on reference genome sequence
• Variants on the left can not be equivalent to variants on the right
• Linear scan of reference genome to identify separators
• Solve independent small problem
• Branch and bound method for small problem
• Similar algorithm as Cleary et al., 2015
• Problem size small
• Require less memory and time
• Theorem for identifying separators
Software: https://github.com/medvedevgroup/varmatch
Preprint: VarMatch: robust matching of small variant datasets using flexible
scoring schemes (bioArxiv)
4
5. VarMatch supports flexible scoring schemes
• Maximize number of total matched variants or just in the baseline?
• Maximize number of calls or total edit distance?
• e.g. a call affecting changes 10 bases vs. 10 calls changing 1 base.
• Require genotypes to match or to just detect a variant is present?
Others possible?
5
6. Benchmark
CHM1 + bowtie (Li 14)
Freebayes GATK-HC
NA12878 + bowtie (Li 14)
Freebayes GATK-UG
Vt normalize 2,778,372 2,778,372 4,092,161 4,092,161
RTG Tools 2,843,396 2,912,641 4,197,070 4,321,997
VarMatch 2,843,396 2,912,641 4,197,138 4,322,083
RAM(Gb) Time(s)
RTG Tools 48 456
VarMatch 5 302
Memory and Running Time Evaluation
Number of Matched Variants
7. Matching in low-complexity regions
• Comparison of (1) BWA+FreeBayes and (2) Bowtie2+Platypus NA12878 callsets (Li 14)
• Using Bowtie2+GATK as baseline
• Focus on low-complexity region
• 12% more equivalent variants identified using VarMatch than normalization
Results of Vt-normalize Results of VarMatch
8. Matching in dense regions
• Comparison of Freebayes vs. Platypus NA12878 callsets (Li. 2014)
• using GIAB Gold Standard (Zook et al 14) as baseline
• Focus on “dense regions”
• 10 base regions that contain an INDEL and another variant
• Assessment genome wide differs from that in dense regions
Number of Matched Variants in Baseline
Freebayes Platypus
genome wide 2,896,841 2,891,849
dense regions 24,188 24,522
11. VarMatch Highlights
• Use less memory and running time
• Better performance matching complex variants
• Better performance in low-complexity regions
• Better performance in dense regions
• Flexible scoring schemes
11