The document proposes a statistical framework for benchmarking variant calling pipelines by using a benchmark call set as truth to compare variants called by a test pipeline, where accuracy is measured by calculating concordance between called variants and accounting for allele frequencies; the framework models the pipeline as introducing errors into sampling sequences from a mixture model representing true variant frequencies, and defines a max error probability metric for the pipeline based on its error matrix; this max error probability can then be used to obtain confidence intervals for estimated variant frequencies called by the pipeline compared to truth.
1. A statistical framework
for benchmarking
variant calling pipelines
Jiayong Li and Alexander Wait Zaranek
Curoverse
Jan 28, 2016
2. Use benchmark call set as truth to compare test pipeline
• Benchmark for germline variant calling
Allele frequency 50/50 (het) or 100 (hom), sufficient to count
concordance
• Benchmark for somatic variant calling
Allele frequency unknown
Introduction
3. Example: fix a variant site of a person, ref: “T”
• Germline: hom ref
• Somatic:
Test:
“C”: 10%, “G”: 30%, “TAC”: 60%
Benchmark:
“C”: 70%, “G”: 5%, “TAC”: 25%
Same called variants
Q: How good is the test pipeline?
Q: What is “good”?
What is “good”?
4. Pipeline: randomly sampling the bottle, make calls
• Bottle content: 𝑧 = (𝑧"𝐶", 𝑧"𝐺", 𝑧"𝑇𝐴𝐶")
Fixed, unknown, probability vector that describes sequence
proportions, e.g., 𝑧 = 60%, 30%, 10%
• 𝑋: random variable of sampling the bottle
𝑝 𝑋 = 𝑥 𝑧) = 𝑧 𝑥, 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
Sampling the bottle
5. "𝐶" "𝐺" "𝑇𝐴𝐶"
"𝐶" 0.92 0.07 0.02
"𝐺" 0.04 0.90 0.03
"𝑇𝐴𝐶" 0.04 0.03 0.95
Pipeline error matrix
= 𝐴
Pipeline error: what’s sampled≠what’s called
• 𝑋: random variable of sampling the bottle through the
pipeline
• Pipeline error matrix: stochastic matrix 𝐴 = (𝐴 𝑥,𝑦)
𝑝 𝑋 = 𝑥 𝑋 = 𝑦, 𝐴) = 𝐴 𝑥,𝑦, 𝑥, 𝑦 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
Fixed, unknown
Example:
probability of “C”
being called as “G”
6. "𝐶" "𝐺" "𝑇𝐴𝐶"
"𝐶" 0.92 0.07 0.02
"𝐺" 0.04 0.90 0.03
"𝑇𝐴𝐶" 0.04 0.03 0.95
Max error probability
= 𝐴
Max error probability: 𝑟 𝐴 = max
𝑥∈{" 𝐶 ", " 𝐺 ", " 𝑇 𝐴 𝐶 "}
1 − 𝐴 𝑥,𝑥
• Max probability of a sequence being called something else
• Measures how good a pipeline is
Example 1: 𝐴 = 𝐼𝑑, 𝑟 𝐴 = 0, error-free pipeline
Example 2:
𝑟 𝐴 = 0.1
7. Sampling through the pipeline 𝑋 → adjusted bottle content 𝑧:
𝑧 𝑥 = 𝑝 𝑋 = 𝑥 | 𝐴, 𝑧 , 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
A perturbed version of actual bottle content 𝑧
𝑧 𝑥 = 𝑝 𝑋 = 𝑥 𝑧), 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
Related by
𝑧 = 𝐴 𝑧
Pipeline:
• Sampling the bottle 𝑛 times independently
• Count sequences 𝑀 = (𝑀"𝐶", 𝑀"𝐺", 𝑀"𝑇𝐴𝐶") out of 𝑛 samples
𝑀 | 𝐴, 𝑧 ∼ Multinomial(𝑛, 𝑧)
𝑛=coverage – number of duplicates
Error matrix in pipeline
8. Q: What can we do with max error probability?
A: Obtain confidence interval of the bottle content
Example: 𝑛 = 30, 𝑀"𝐶", 𝑀"𝐺", 𝑀"𝑇𝐴𝐶" = 3, 9, 18 , 𝑟 𝐴 ≤ 0.05
Hoeffding’s inequality ⇒
𝑝
9
30
− 𝜀 ≤ 𝑧"𝐺" ≤
9
30
+ 𝜀 ≥ 1 − 2𝑒−2⋅30𝜀2
Definition of max error ⇒
𝑧"𝐺"−𝑟(𝐴)
1−𝑟(𝐴)
≤ 𝑧"𝐺" ≤
𝑧"𝐺"
1−𝑟(𝐴)
Set 𝜀 as 0.2, combine above
⇒ 𝑝 0.05 ≤ 𝑧"𝐺" ≤ 0.52 ≥ 0.81
Applying max error
9. Test:
𝑛 𝑡 = 30, 𝑀 𝑡
"𝐶", 𝑀 𝑡
"𝐺", 𝑀 𝑡
"𝑇𝐴𝐶" = 3, 9, 18 , 𝑟 𝐴 𝑡 ≤ 0.05
Benchmark:
𝑛 𝑏 = 100, 𝑀 𝑏
"𝐶", 𝑀 𝑏
"𝐺", 𝑀 𝑏
"𝑇𝐴𝐶" = 70, 5, 25 , 𝑟 𝐴 𝑏 = 0
How accurate is 𝑟 𝐴 𝑡
≤ 0.05?
Take Bayesian approach
• View bottle content 𝑍 and error matrices A 𝑡
, A 𝑏
as random
variables
• Choose their prior distributions
• Compute
𝑝 𝑟 A 𝑡
≤ 0.05 𝑀 𝑡 = 3, 9, 18 , 𝑀 𝑏 = 70, 5, 25 , 𝑟(A 𝑏
) = 0)
Benchmark max error
10. Decompose a reference genome into tiles
Implementation using tiling
𝑟𝑒𝑓
Tag
(24 mer, unique)
Tile
(at least 250 base pairs)
Tags → coordinate system → carry out computation on
each tile variant