Aug2014 smash performance metrics tool


Published on


Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Aug2014 smash performance metrics tool

  1. 1. SMaSH a Benchmarking Toolkit for Variant Calling
  2. 2. SMaSH and GIAB: A Good Match High overlap with features in the Performance Metrics Specifications doc. Many of the features not currently supported are ones we'd like to integrate.
  3. 3. About me Worked at the UC Berkeley AMPlab for about a year Currently the primary SMaSH developer Starting a CS PhD at Berkeley in Programming Languages this fall
  4. 4. About this talk SMaSH as it is now SMaSH in the future SMaSH and GIAB
  5. 5. SMaSH as it is now
  6. 6. SMaSH Project out of the AMP-X group at UC Berkeley Talwalkar et al., 2014, Bioinformatics
  7. 7. Initial goal Create a unified way of benchmarking germline variant calling pipelines.
  8. 8. SMaSH components Codebase for comparing VCF callsets Reads and ground truth datasets Metrics for accuracy and computational performance
  9. 9. Codebase For benchmarking purposes, we compare a predicted callset against a ground truth callset Comparing two predicted callsets works exactly the same.
  10. 10. Variant Classification SNPs Indel (less than 50 base pairs) Structural variants
  11. 11. Evaluation SNPs and indels are strictly evaluated. Structural variants are evaluated on: Same type (insertion/deletion/other) Length same as true variant within specified tolerance Position same as true variant within specified tolerance
  12. 12. Accuracy Metrics Evaluate variants as true positive, false positive, false negative Evaluate accuracy of genotyping
  13. 13. Error bars Calculated on confidence in ground truth calls Choose some upper bound on ground truth call error rate based on validation methology E.g., 2 out of every 1000 SNPs is wrong. Use this error rate to calculate upper/lower bounds on precision and recall.
  14. 14. The VCF format is ambiguous! SMaSH addresses this problem with two strategies: Normalization Rescue Guiding principle: metrics should never be worse after normalization/rescue than they were without them.
  15. 15. Normalization A single variant may be plausibly placed in many different positions but describe the same change.
  16. 16. For example, we normalize this variant:
  17. 17. First, we remove the longest proper suffix from the ref and alt alleles.
  18. 18. Then, we "slide" the variants by adding a base from the reference to the head and removing a base from the tail, until the last bases on both alleles are no longer the same.
  19. 19. Rescue The same underlying haplotype can be represented by different sets of variants. True callset Predicted callset
  20. 20. Rescue Algorithm For every false negative, we attempt rescue: Build up a window around the variant positive for the true and predicted callsets For all sets of non-overlapping variants, expand the underlying haplotypes for the variants within those windows. If the haplotypes match, mark all false negatives/false positives as true positives.
  21. 21. Rescue Example
  22. 22. Outputs Statistics, including counts for all categories, in plain text, TSV and JSON formats Calculations for precision and recall, including error bars VCF containing variants from both callsets, annotated with the callset they came from and their categorization (TP/FP/FN/rescued)
  23. 23. Where is SMaSH headed?
  24. 24. Global Alliance for Genomics & Health The benchmarking task force includes: Illumina, Amazon, Google UC Berkeley, UC Santa Cruz, NIST
  25. 25. Development continues by GA4GH Chief maintainers will be Kelly Westbrooks and Cassie Doll (Google).
  26. 26. Feature Roadmap New variant types: complex variants, compound heterozygous variants, etc. Phasing evaluation Better handling of known false positives
  27. 27. SMaSH and GIAB
  28. 28. Try it and let us know what you think! git clone Complete documentation available at Post feedback at the Google Group smash-benchmarking
  29. 29. Code contributions Open source and BSD-licensed; pull requests and issues very welcome
  30. 30. Datasets The SMaSH paper proposed eight datasets, including synthetic, sampled human, and mouse. Other data to use as ground truth? NIST pedigree calls for NA12878 the Illumina Platinum Genome Others?
  31. 31. Interpretation of results
  32. 32. Tools for Downstream Analysis? Visualizations? Compatibility with genome browsers? Other?
  33. 33. Thanks!