Project out of the AMP-X group at UC Berkeley
Talwalkar et al., 2014, Bioinformatics
Create a unified way of benchmarking germline variant calling pipelines.
Codebase for comparing VCF callsets
Reads and ground truth datasets
Metrics for accuracy and computational performance
For benchmarking purposes, we compare a predicted callset against a
ground truth callset
Comparing two predicted callsets works exactly the same.
Indel (less than 50 base pairs)
SNPs and indels are strictly evaluated.
Structural variants are evaluated on:
Same type (insertion/deletion/other)
Length same as true variant within specified tolerance
Position same as true variant within specified tolerance
Evaluate variants as true positive, false positive, false negative
Evaluate accuracy of genotyping
Calculated on confidence in ground truth calls
Choose some upper bound on ground truth call error rate based on
E.g., 2 out of every 1000 SNPs is wrong.
Use this error rate to calculate upper/lower bounds on precision and
The VCF format is
SMaSH addresses this problem with two strategies:
Guiding principle: metrics should never be worse after
normalization/rescue than they were without them.
A single variant may be plausibly placed in many different positions but
describe the same change.
First, we remove the longest proper suffix from the ref and alt alleles.
Then, we "slide" the variants by adding a base from the reference to the
head and removing a base from the tail, until the last bases on both
alleles are no longer the same.
The same underlying haplotype can be represented by different sets of
For every false negative, we attempt rescue:
Build up a window around the variant positive for the true and
For all sets of non-overlapping variants, expand the underlying
haplotypes for the variants within those windows.
If the haplotypes match, mark all false negatives/false positives as
Statistics, including counts for all categories, in plain text, TSV and
Calculations for precision and recall, including error bars
VCF containing variants from both callsets, annotated with the callset
they came from and their categorization (TP/FP/FN/rescued)
Try it and let us know what you
git clone https://github.com/amplab/smash.git
Complete documentation available at smash.cs.berkeley.edu
Post feedback at the Google Group smash-benchmarking
Open source and BSD-licensed; pull requests and issues very welcome
The SMaSH paper proposed eight datasets, including synthetic, sampled
human, and mouse.
Other data to use as ground truth?
NIST pedigree calls for NA12878
the Illumina Platinum Genome