Remaining benchmarking challenges even with “Q100” genomes
Benchmarking small and large variants in tandem repeats
Accurate detection of variants is important for clinical and research use. NIST hosts the Genome in a Bottle
Consortium, which develops metrology infrastructure for characterization of human whole genome variant
detection. GIAB has characterized increasingly challenging variants and regions since it was formed in 2012.
Consortium products include:
• Benchmarks and extensive WGS for seven broadly-consented human genomes, including 2 son-mother-
father trios, released as NIST Reference Materials (RMs)
• Benchmarking tools for robust and standardized variant comparison
Overview
Genome in a Bottle benchmarks in the era of complete human genomes
Nathan D. Olson1, Justin Wagner1, Nathan Dwarshuis1, Jennifer McDaniel1, Adam English2, Fritz Sedlazeck2, Justin M. Zook1, and the Genome in a Bottle Consortium
1: Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
2. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
Ongoing and Future work
How different repeat types cause challenges with variant calling
Led by Adam English and Fritz Sedlazeck at BCM
• Catalog of STR and VNTR regions derived from multiple
sources, annotated in a standardized way
• New truvari refine module to compare different representations
of complex variants in TRs
• v1.0 benchmark for indels and SVs >=5bp in TRs
• 124,728 small and 17,988 large variants
• ~8% of the genome, but ~25% of variants in HG002
Benchmark development: https://github.com/ACEnglish/adotto
Benchmarking tool: https://github.com/ACEnglish/truvari
Genome in a Bottle Consortium
• New broadly-consented Tumor-Normal pair (see poster PB5090)
• Expand use of genome stratifications (see poster PB3519 from Nate Dwarshuis)
• New GIABv3 GRCh38 reference with masked false duplications and new decoy sequences
• GIAB data portal under development to make it easier to find data
New collaborators welcome! Follow us via email lists at www.genomeinabottle.org and @GenomeInABottle
Recruiting experts in any variant calling method to evaluate benchmarks - please email: jzook@nist.gov
T2T X&Y Chromosomes
Benchmark
Towards a “Q100” Benchmark with T2T Consortium
Olson et al, Variant calling and benchmarking in an era of complete human genome sequences, Nat Rev Genetics 2023
Homopolymers and Tandem Repeats Segmental Duplications
GIAB data for other omics (RNA-seq and methylation)
• Based on complete X and Y assemblies of
HG002 from T2T Consortium
• Benchmark excludes homopolymers
>30bp and some shorter homopolymers
• Working with T2T-Q100 effort below
to correct these in the assembly
• Curated differences between 11 short and
long read callsets and benchmark to
ensure it reliably identifies errors
• Pilot RNA-seq experiment (led by Miten Jain at Northeastern and Fritz Sedlazeck at BCM)
• Cell lines: Lymphoblastoid cell line (LCL) and 2 iPSCs from HG002; LCL from HG004 and HG005
• Public Data: Illumina mRNA and total RNA; PacBio Iso-seq; ONT cDNA and direct RNA
• Analysis: LCL vs iPSC comparison and possible isoform benchmark
• Methylation data generated with bisulfite-seq, EM-seq, HiFi, and ONT from GIAB LCLs (Foox et al)
2023+
T2T-based
benchmarks for
whole genome
and new cancer
genomes
2023
v1.0 assembly-
based benchmark
for tandem
repeats
2023
v1.0 assembly-
based benchmark
for chromosomes
X & Y
2023
Draft benchmark
for mosaic
variants
2022
Challenging
medically relevant
gene assembly-
based benchmark
for small variants
and SVs
2022
Small variant
benchmarks
(v4.2.1), from
mapping
short+long reads
Expanding GIAB Benchmarks
Bases in Benchmark Regions SNVs INDELs
161,549,546 87,452 24,273
T2T diploid
assembly
• Start with curated trio-based verkko assembly of HiFi and ultralong ONT (Sergey Koren, Nancy Hansen, Adam Phillippy, et al)
Polishing
• Align short reads to individual haplotypes and long reads to combined haplotypes of the assembly
• Use parental assemblies to phase collapsed heterozygous variants in homopolymers in homozygous regions
• Trio-based variant calling to identify and phase errors with Element and Onso (see Fleharty poster PB3462)
Curation
• Structural errors due to assembly errors or low HiFi coverage
• “False heterozygous” variants in assembly, mostly in homopolymers and diTRs (Element & Onso help correct homopolymers)
• “Collapsed heterozygous” variants in assembly, mostly in highly homozygous regions
Benchmark
development
• Assemblies as "genome benchmarks"
• Curated alignments of assemblies to reference to benchmark reference-based small variant and SV calls
Genome in a Bottle
Consortium
Illumina&HiFi
kmer QV
kmer errors kmer switch
errors
(mat/pat)
Genotype
errors vs
GIABv4.2.1
(SNV/indel)
v0.7 66.9 27,142 0.027/0.022 811/1762
v0.9 71.8 8,239 0.0053/0.0019 84/353
v1.0 75.1 3,906 0.0037/0.0011 TBD
GRCh38
HG002
Mosaic variants (see draft
GIAB mosaic benchmark in
posters PB3382/PB5114)
https://www.nature.com/articles/s41586-023-06457-y/figures/2
How to represent TSPY2 moving 4Mbp and TSPY array copy number as variants?
New v3.3 stratifications for repeats
in GRCh37/38 and T2T-CHM13
https://doi.org/10.1101/2023.10.27.563846
Complex gene conversion-like
events included in XY benchmark:
Assemblies and data at:
https://github.com/marbl/HG002
Joint benchmarking of SNVs, indels, and SVs:
exploratory work with Tim Dunn using
https://github.com/TimD1/vcfdist
Short read (2020 pFDA) Long read (2020 pFDA)
Variant
Type
Region v4.2.1 CMRG XY v4.2.1 CMRG XY
SNV All benchmark 0.997 0.977 0.899 1.000 0.981 0.932
SNV
Segmental
duplications
0.951 0.835 0.600 0.991 0.893 0.785
INDEL All benchmark 0.997 0.963 0.815 0.996 0.967 0.738
INDEL TRs 0.993 0.915 0.721 0.997 0.955 0.645
INDEL
Homopolymers
>11bp
0.998 0.972 0.789 0.990 0.959 0.677
INS >15 All benchmark 0.960 0.821 0.538 0.997 0.919 0.505
F1 decreases as new benchmarks include more challenging variants

GIAB_ASHG_JZook_2023.pdf

  • 1.
    Remaining benchmarking challengeseven with “Q100” genomes Benchmarking small and large variants in tandem repeats Accurate detection of variants is important for clinical and research use. NIST hosts the Genome in a Bottle Consortium, which develops metrology infrastructure for characterization of human whole genome variant detection. GIAB has characterized increasingly challenging variants and regions since it was formed in 2012. Consortium products include: • Benchmarks and extensive WGS for seven broadly-consented human genomes, including 2 son-mother- father trios, released as NIST Reference Materials (RMs) • Benchmarking tools for robust and standardized variant comparison Overview Genome in a Bottle benchmarks in the era of complete human genomes Nathan D. Olson1, Justin Wagner1, Nathan Dwarshuis1, Jennifer McDaniel1, Adam English2, Fritz Sedlazeck2, Justin M. Zook1, and the Genome in a Bottle Consortium 1: Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA 2. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA Ongoing and Future work How different repeat types cause challenges with variant calling Led by Adam English and Fritz Sedlazeck at BCM • Catalog of STR and VNTR regions derived from multiple sources, annotated in a standardized way • New truvari refine module to compare different representations of complex variants in TRs • v1.0 benchmark for indels and SVs >=5bp in TRs • 124,728 small and 17,988 large variants • ~8% of the genome, but ~25% of variants in HG002 Benchmark development: https://github.com/ACEnglish/adotto Benchmarking tool: https://github.com/ACEnglish/truvari Genome in a Bottle Consortium • New broadly-consented Tumor-Normal pair (see poster PB5090) • Expand use of genome stratifications (see poster PB3519 from Nate Dwarshuis) • New GIABv3 GRCh38 reference with masked false duplications and new decoy sequences • GIAB data portal under development to make it easier to find data New collaborators welcome! Follow us via email lists at www.genomeinabottle.org and @GenomeInABottle Recruiting experts in any variant calling method to evaluate benchmarks - please email: jzook@nist.gov T2T X&Y Chromosomes Benchmark Towards a “Q100” Benchmark with T2T Consortium Olson et al, Variant calling and benchmarking in an era of complete human genome sequences, Nat Rev Genetics 2023 Homopolymers and Tandem Repeats Segmental Duplications GIAB data for other omics (RNA-seq and methylation) • Based on complete X and Y assemblies of HG002 from T2T Consortium • Benchmark excludes homopolymers >30bp and some shorter homopolymers • Working with T2T-Q100 effort below to correct these in the assembly • Curated differences between 11 short and long read callsets and benchmark to ensure it reliably identifies errors • Pilot RNA-seq experiment (led by Miten Jain at Northeastern and Fritz Sedlazeck at BCM) • Cell lines: Lymphoblastoid cell line (LCL) and 2 iPSCs from HG002; LCL from HG004 and HG005 • Public Data: Illumina mRNA and total RNA; PacBio Iso-seq; ONT cDNA and direct RNA • Analysis: LCL vs iPSC comparison and possible isoform benchmark • Methylation data generated with bisulfite-seq, EM-seq, HiFi, and ONT from GIAB LCLs (Foox et al) 2023+ T2T-based benchmarks for whole genome and new cancer genomes 2023 v1.0 assembly- based benchmark for tandem repeats 2023 v1.0 assembly- based benchmark for chromosomes X & Y 2023 Draft benchmark for mosaic variants 2022 Challenging medically relevant gene assembly- based benchmark for small variants and SVs 2022 Small variant benchmarks (v4.2.1), from mapping short+long reads Expanding GIAB Benchmarks Bases in Benchmark Regions SNVs INDELs 161,549,546 87,452 24,273 T2T diploid assembly • Start with curated trio-based verkko assembly of HiFi and ultralong ONT (Sergey Koren, Nancy Hansen, Adam Phillippy, et al) Polishing • Align short reads to individual haplotypes and long reads to combined haplotypes of the assembly • Use parental assemblies to phase collapsed heterozygous variants in homopolymers in homozygous regions • Trio-based variant calling to identify and phase errors with Element and Onso (see Fleharty poster PB3462) Curation • Structural errors due to assembly errors or low HiFi coverage • “False heterozygous” variants in assembly, mostly in homopolymers and diTRs (Element & Onso help correct homopolymers) • “Collapsed heterozygous” variants in assembly, mostly in highly homozygous regions Benchmark development • Assemblies as "genome benchmarks" • Curated alignments of assemblies to reference to benchmark reference-based small variant and SV calls Genome in a Bottle Consortium Illumina&HiFi kmer QV kmer errors kmer switch errors (mat/pat) Genotype errors vs GIABv4.2.1 (SNV/indel) v0.7 66.9 27,142 0.027/0.022 811/1762 v0.9 71.8 8,239 0.0053/0.0019 84/353 v1.0 75.1 3,906 0.0037/0.0011 TBD GRCh38 HG002 Mosaic variants (see draft GIAB mosaic benchmark in posters PB3382/PB5114) https://www.nature.com/articles/s41586-023-06457-y/figures/2 How to represent TSPY2 moving 4Mbp and TSPY array copy number as variants? New v3.3 stratifications for repeats in GRCh37/38 and T2T-CHM13 https://doi.org/10.1101/2023.10.27.563846 Complex gene conversion-like events included in XY benchmark: Assemblies and data at: https://github.com/marbl/HG002 Joint benchmarking of SNVs, indels, and SVs: exploratory work with Tim Dunn using https://github.com/TimD1/vcfdist Short read (2020 pFDA) Long read (2020 pFDA) Variant Type Region v4.2.1 CMRG XY v4.2.1 CMRG XY SNV All benchmark 0.997 0.977 0.899 1.000 0.981 0.932 SNV Segmental duplications 0.951 0.835 0.600 0.991 0.893 0.785 INDEL All benchmark 0.997 0.963 0.815 0.996 0.967 0.738 INDEL TRs 0.993 0.915 0.721 0.997 0.955 0.645 INDEL Homopolymers >11bp 0.998 0.972 0.789 0.990 0.959 0.677 INS >15 All benchmark 0.960 0.821 0.538 0.997 0.919 0.505 F1 decreases as new benchmarks include more challenging variants