SlideShare a Scribd company logo
A statistical framework
for benchmarking
variant calling pipelines
Jiayong Li and Alexander Wait Zaranek
Curoverse
Jan 28, 2016
Use benchmark call set as truth to compare test pipeline
• Benchmark for germline variant calling
Allele frequency 50/50 (het) or 100 (hom), sufficient to count
concordance
• Benchmark for somatic variant calling
Allele frequency unknown
Introduction
Example: fix a variant site of a person, ref: “T”
• Germline: hom ref
• Somatic:
Test:
“C”: 10%, “G”: 30%, “TAC”: 60%
Benchmark:
“C”: 70%, “G”: 5%, “TAC”: 25%
Same called variants
Q: How good is the test pipeline?
Q: What is “good”?
What is “good”?
Pipeline: randomly sampling the bottle, make calls
• Bottle content: 𝑧 = (𝑧"𝐶", 𝑧"𝐺", 𝑧"𝑇𝐴𝐶")
Fixed, unknown, probability vector that describes sequence
proportions, e.g., 𝑧 = 60%, 30%, 10%
• 𝑋: random variable of sampling the bottle
𝑝 𝑋 = 𝑥 𝑧) = 𝑧 𝑥, 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
Sampling the bottle
"𝐶" "𝐺" "𝑇𝐴𝐶"
"𝐶" 0.92 0.07 0.02
"𝐺" 0.04 0.90 0.03
"𝑇𝐴𝐶" 0.04 0.03 0.95
Pipeline error matrix
= 𝐴
Pipeline error: what’s sampled≠what’s called
• 𝑋: random variable of sampling the bottle through the
pipeline
• Pipeline error matrix: stochastic matrix 𝐴 = (𝐴 𝑥,𝑦)
𝑝 𝑋 = 𝑥 𝑋 = 𝑦, 𝐴) = 𝐴 𝑥,𝑦, 𝑥, 𝑦 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
Fixed, unknown
Example:
probability of “C”
being called as “G”
"𝐶" "𝐺" "𝑇𝐴𝐶"
"𝐶" 0.92 0.07 0.02
"𝐺" 0.04 0.90 0.03
"𝑇𝐴𝐶" 0.04 0.03 0.95
Max error probability
= 𝐴
Max error probability: 𝑟 𝐴 = max
𝑥∈{" 𝐶 ", " 𝐺 ", " 𝑇 𝐴 𝐶 "}
1 − 𝐴 𝑥,𝑥
• Max probability of a sequence being called something else
• Measures how good a pipeline is
Example 1: 𝐴 = 𝐼𝑑, 𝑟 𝐴 = 0, error-free pipeline
Example 2:
𝑟 𝐴 = 0.1
Sampling through the pipeline 𝑋 → adjusted bottle content 𝑧:
𝑧 𝑥 = 𝑝 𝑋 = 𝑥 | 𝐴, 𝑧 , 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
A perturbed version of actual bottle content 𝑧
𝑧 𝑥 = 𝑝 𝑋 = 𝑥 𝑧), 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"}
Related by
𝑧 = 𝐴 𝑧
Pipeline:
• Sampling the bottle 𝑛 times independently
• Count sequences 𝑀 = (𝑀"𝐶", 𝑀"𝐺", 𝑀"𝑇𝐴𝐶") out of 𝑛 samples
𝑀 | 𝐴, 𝑧 ∼ Multinomial(𝑛, 𝑧)
𝑛=coverage – number of duplicates
Error matrix in pipeline
Q: What can we do with max error probability?
A: Obtain confidence interval of the bottle content
Example: 𝑛 = 30, 𝑀"𝐶", 𝑀"𝐺", 𝑀"𝑇𝐴𝐶" = 3, 9, 18 , 𝑟 𝐴 ≤ 0.05
Hoeffding’s inequality ⇒
𝑝
9
30
− 𝜀 ≤ 𝑧"𝐺" ≤
9
30
+ 𝜀 ≥ 1 − 2𝑒−2⋅30𝜀2
Definition of max error ⇒
𝑧"𝐺"−𝑟(𝐴)
1−𝑟(𝐴)
≤ 𝑧"𝐺" ≤
𝑧"𝐺"
1−𝑟(𝐴)
Set 𝜀 as 0.2, combine above
⇒ 𝑝 0.05 ≤ 𝑧"𝐺" ≤ 0.52 ≥ 0.81
Applying max error
Test:
𝑛 𝑡 = 30, 𝑀 𝑡
"𝐶", 𝑀 𝑡
"𝐺", 𝑀 𝑡
"𝑇𝐴𝐶" = 3, 9, 18 , 𝑟 𝐴 𝑡 ≤ 0.05
Benchmark:
𝑛 𝑏 = 100, 𝑀 𝑏
"𝐶", 𝑀 𝑏
"𝐺", 𝑀 𝑏
"𝑇𝐴𝐶" = 70, 5, 25 , 𝑟 𝐴 𝑏 = 0
How accurate is 𝑟 𝐴 𝑡
≤ 0.05?
Take Bayesian approach
• View bottle content 𝑍 and error matrices A 𝑡
, A 𝑏
as random
variables
• Choose their prior distributions
• Compute
𝑝 𝑟 A 𝑡
≤ 0.05 𝑀 𝑡 = 3, 9, 18 , 𝑀 𝑏 = 70, 5, 25 , 𝑟(A 𝑏
) = 0)
Benchmark max error
Decompose a reference genome into tiles
Implementation using tiling
𝑟𝑒𝑓
Tag
(24 mer, unique)
Tile
(at least 250 base pairs)
Tags → coordinate system → carry out computation on
each tile variant

More Related Content

More from GenomeInABottle

GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
GenomeInABottle
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
GenomeInABottle
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
GenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
GenomeInABottle
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
GenomeInABottle
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GenomeInABottle
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
GenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
GenomeInABottle
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
GenomeInABottle
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
GenomeInABottle
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GenomeInABottle
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
GenomeInABottle
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GenomeInABottle
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
GenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
GenomeInABottle
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
GenomeInABottle
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417
GenomeInABottle
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
GenomeInABottle
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
GenomeInABottle
 

More from GenomeInABottle (20)

GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
 

Jan2016 curoverse benchmarking somatic variant_calling_pipelines

  • 1. A statistical framework for benchmarking variant calling pipelines Jiayong Li and Alexander Wait Zaranek Curoverse Jan 28, 2016
  • 2. Use benchmark call set as truth to compare test pipeline • Benchmark for germline variant calling Allele frequency 50/50 (het) or 100 (hom), sufficient to count concordance • Benchmark for somatic variant calling Allele frequency unknown Introduction
  • 3. Example: fix a variant site of a person, ref: “T” • Germline: hom ref • Somatic: Test: “C”: 10%, “G”: 30%, “TAC”: 60% Benchmark: “C”: 70%, “G”: 5%, “TAC”: 25% Same called variants Q: How good is the test pipeline? Q: What is “good”? What is “good”?
  • 4. Pipeline: randomly sampling the bottle, make calls • Bottle content: 𝑧 = (𝑧"𝐶", 𝑧"𝐺", 𝑧"𝑇𝐴𝐶") Fixed, unknown, probability vector that describes sequence proportions, e.g., 𝑧 = 60%, 30%, 10% • 𝑋: random variable of sampling the bottle 𝑝 𝑋 = 𝑥 𝑧) = 𝑧 𝑥, 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"} Sampling the bottle
  • 5. "𝐶" "𝐺" "𝑇𝐴𝐶" "𝐶" 0.92 0.07 0.02 "𝐺" 0.04 0.90 0.03 "𝑇𝐴𝐶" 0.04 0.03 0.95 Pipeline error matrix = 𝐴 Pipeline error: what’s sampled≠what’s called • 𝑋: random variable of sampling the bottle through the pipeline • Pipeline error matrix: stochastic matrix 𝐴 = (𝐴 𝑥,𝑦) 𝑝 𝑋 = 𝑥 𝑋 = 𝑦, 𝐴) = 𝐴 𝑥,𝑦, 𝑥, 𝑦 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"} Fixed, unknown Example: probability of “C” being called as “G”
  • 6. "𝐶" "𝐺" "𝑇𝐴𝐶" "𝐶" 0.92 0.07 0.02 "𝐺" 0.04 0.90 0.03 "𝑇𝐴𝐶" 0.04 0.03 0.95 Max error probability = 𝐴 Max error probability: 𝑟 𝐴 = max 𝑥∈{" 𝐶 ", " 𝐺 ", " 𝑇 𝐴 𝐶 "} 1 − 𝐴 𝑥,𝑥 • Max probability of a sequence being called something else • Measures how good a pipeline is Example 1: 𝐴 = 𝐼𝑑, 𝑟 𝐴 = 0, error-free pipeline Example 2: 𝑟 𝐴 = 0.1
  • 7. Sampling through the pipeline 𝑋 → adjusted bottle content 𝑧: 𝑧 𝑥 = 𝑝 𝑋 = 𝑥 | 𝐴, 𝑧 , 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"} A perturbed version of actual bottle content 𝑧 𝑧 𝑥 = 𝑝 𝑋 = 𝑥 𝑧), 𝑥 ∈ {"𝐶", "𝐺", "𝑇𝐴𝐶"} Related by 𝑧 = 𝐴 𝑧 Pipeline: • Sampling the bottle 𝑛 times independently • Count sequences 𝑀 = (𝑀"𝐶", 𝑀"𝐺", 𝑀"𝑇𝐴𝐶") out of 𝑛 samples 𝑀 | 𝐴, 𝑧 ∼ Multinomial(𝑛, 𝑧) 𝑛=coverage – number of duplicates Error matrix in pipeline
  • 8. Q: What can we do with max error probability? A: Obtain confidence interval of the bottle content Example: 𝑛 = 30, 𝑀"𝐶", 𝑀"𝐺", 𝑀"𝑇𝐴𝐶" = 3, 9, 18 , 𝑟 𝐴 ≤ 0.05 Hoeffding’s inequality ⇒ 𝑝 9 30 − 𝜀 ≤ 𝑧"𝐺" ≤ 9 30 + 𝜀 ≥ 1 − 2𝑒−2⋅30𝜀2 Definition of max error ⇒ 𝑧"𝐺"−𝑟(𝐴) 1−𝑟(𝐴) ≤ 𝑧"𝐺" ≤ 𝑧"𝐺" 1−𝑟(𝐴) Set 𝜀 as 0.2, combine above ⇒ 𝑝 0.05 ≤ 𝑧"𝐺" ≤ 0.52 ≥ 0.81 Applying max error
  • 9. Test: 𝑛 𝑡 = 30, 𝑀 𝑡 "𝐶", 𝑀 𝑡 "𝐺", 𝑀 𝑡 "𝑇𝐴𝐶" = 3, 9, 18 , 𝑟 𝐴 𝑡 ≤ 0.05 Benchmark: 𝑛 𝑏 = 100, 𝑀 𝑏 "𝐶", 𝑀 𝑏 "𝐺", 𝑀 𝑏 "𝑇𝐴𝐶" = 70, 5, 25 , 𝑟 𝐴 𝑏 = 0 How accurate is 𝑟 𝐴 𝑡 ≤ 0.05? Take Bayesian approach • View bottle content 𝑍 and error matrices A 𝑡 , A 𝑏 as random variables • Choose their prior distributions • Compute 𝑝 𝑟 A 𝑡 ≤ 0.05 𝑀 𝑡 = 3, 9, 18 , 𝑀 𝑏 = 70, 5, 25 , 𝑟(A 𝑏 ) = 0) Benchmark max error
  • 10. Decompose a reference genome into tiles Implementation using tiling 𝑟𝑒𝑓 Tag (24 mer, unique) Tile (at least 250 base pairs) Tags → coordinate system → carry out computation on each tile variant