Nathan D Olson
AMP RM Forum 2023
GIAB Update
New Benchmark Set, Samples, and Data Types
GIAB SAMPLES AND REFERENCE MATERIALS
● Genome in a Bottle Consortium develops metrology
infrastructure for benchmarking human whole genome variant
detection
● Characterization of seven broadly-consented human genomes
including two son-mother-father trios with several as NIST
RMs
2
OPEN CONSENT ENABLES SECONDARY REFERENCE SAMPLES TO
MEET SPECIFIC CLINICAL NEEDS
● >50 products now available based on broadly-
consented, well-characterized GIAB PGP cell
lines
● Genomic DNA + DNA spike-ins
○ Clinical variants
○ Somatic variants
○ Difficult variants
● Clinical matrix (FFPE)
● Circulating tumor DNA
● Stem cells (iPSCs)
● Genome editing
● …
3
GIAB IMPROVES CONFIDENCE IN
GENOME SEQUENCING AND VARIANT CALLING
REFERENCE
MATERIALS AND
SAMPLES
CHARACTERIZATIONS
(BENCHMARK SETS)
REFERENCE DATA BENCHMARKING
METHODS
4
GIAB BENCHMARK SETS
5
BENCHMARK DEVELOPMENT PROCESS
6
MOSAIC BENCHMARK
Led by Camille Daniels and Adetola Abdulkadir at MDIC
7
MOSAIC BENCHMARK SET: OBJECTIVE
• Identify and characterize low frequency variants in the well characterized HG002 genome (NIST reference
material 8391/NA24385) and create a mosaic benchmark
• Trio-based approach using Genome In A Bottle (GIAB) Ashkenazi Jewish genomes
- HG002 (son) - tumor
- HG003 + HG004 (combined parents) - normal
• In-silico mixtures of real data from HG002 and HG003 were analyzed to determine the theoretical limit of
detection (LOD): 5% variant allele frequency (VAF)
• New mosaic benchmark to include variants between 5% and 30% VAF
8
HG002 MOSAIC BENCHMARK GENERATION
HG003 + HG004
.bam
(normal)
HG002
.bam
(tumor)
custom scripts
AJ trio benchmark
and mosaic intersections to
exclude complex variants
Strelka2
(somatic)
.vcf
vcfeval
Callset against GIAB v.4.2.1
benchmark (squash-ploidy)
normals excluded
vcf reformat
false positive
.vcf
list of potential
mosaic variants
.vcf
.fastqs
Novoalign
(GRCh38)
300X
each
GIAB AJ trio
Illumina PCR free,
HiSeq 2500
son combined parents
Potential mosaics
366,728
True positives
389,494
False positives
425,679
Strelka2
1,273,474
Potential mosaic variants
overview
● 1,930 candidate variants
(passing)
- 1,915 SNVs
- 15 indels
- 178 [5%-30%] VAF
● 364,798 putative variants
(non-passing)
- 364,792 SNVs
- 6 indels
- 15,743 [5%-30%] VAF
• 125 potential mosaic variants passed
decision tree heuristics for manual curation
- 105 easy-to-map with combined
orthogonal lower confidence
interval >=5%
- 20 not easy-to-map, non-
homopolymer, Pacbio lower
confidence interval >=5%
9
TANDEM REPEAT BENCHMARK
Led by Adam English and Fritz Sedlazeck at Baylor College of Medicine
10
HG002 TANDEM REPEAT BENCHMARK
● Catalog of STR and VNTR regions
○ derived from multiple sources, annotated in a
standardized way
● v1.0 benchmark for indels and SVs >=5bp in TRs
○ 124,728 small and 17,988 large variants
○ ~8% of the genome, but ~25% of variants in
HG002
11
Preprint: https://doi.org/10.1101/2023.10.29.564632
Benchmark development: https://github.com/ACEnglish/adotto
X AND Y BENCHMARK
Led by Justin Wagner and NIST-GIAB Team
12
ASSEMBLY-BASED DRAFT BENCHMARK DEVELOPMENT PIPELINE
13
WHAT WE EXCLUDE FROM THE ASSEMBLY-BASED BENCHMARK
● Regions without the expected one contig aligned per
haplotype
○ Derived from dipcall bed file
● Large repeats if they are partially aligned
○ Segmental duplications
○ Long VNTRs and satellites
○ Assembly gaps
○ VDJ
14
HG002 CHROMOSOME XY V1.0 BENCHMARK NOW RELEASED
● Telomere to Telomere consortium generated a complete assembly of HG002 X and Y for first complete human genome
○ First T2T Y chromosome described in https://www.nature.com/articles/s41586-023-06457-y
● Preprint describing the benchmark https://doi.org/10.1101/2023.10.31.564997
Short read (2020 pFDA) Long read (2020 pFDA)
Variant
Type
Region v4.2.1 CMRG XY v4.2.1 CMRG XY
SNV All benchmark 0.997 0.977 0.899 1.000 0.981 0.932
SNV
Segmental
duplications
0.951 0.835 0.600 0.991 0.893 0.785
INDEL All benchmark 0.997 0.963 0.815 0.996 0.967 0.738
INDEL TRs 0.993 0.915 0.721 0.997 0.955 0.645
INDEL
Homopolymers
>11bp
0.998 0.972 0.789 0.990 0.959 0.677
INS >15 All benchmark 0.960 0.821 0.538 0.997 0.919 0.505
15
FUTURE BENCHMARKS
16
HG002 “Q100” PROJECT
- T2T-GIAB collaboration to create near-
perfect “Q100” diploid assembly and
associated benchmarks
- https://github.com/marbl/HG002
- T2T team just released HG002-T2Tv1.0
- Developing draft benchmarks for small
variants and SVs, which we’ll be evaluating
- How do we benchmark the extremely
complex regions and variants?
17
BENCHMARKING METHODS
18
19
CONSIDERATIONS WHEN GENERATING AND USING
BENCHMARK SETS FOR EVALUATING VARIANT-CALLING
METHODS.
Olson et al. Nature Reviews Genetics volume 24, pages 464–483 (2023) https://rdcu.be/dqA5m
20
BENCHMARKING METHODS
● Small variants
○ hap.py - able to handle different variant
representations, can quantify performance across
multiple stratifications, *no longer under active
development with minimal maintenance updates
○ rtgtools vcfeval - able to handle different variant
representations
● Structural variants
○ Hap_eval - new tool, able to handle different variant
representations and complex SVs
○ Truvari - New functionality to better handle complex
SVs, companion package laytr for benchmarking
performance interpretation
● Both small and structural variants
○ vcf_dist - able to simultaneously benchmark small
and structural variants, new tool under activate
development
21
TRUVARI REFINE AND LAYTR FOR SV BENCHMARKING AND
BENCHMARKING REPORT INTERPRETATION
22
NEW V3.3 STRATIFICATIONS FOR
T2T-CHM13V2.0 (AND GRCH37/38)
● NEW for CHM13: Mappability, MHC, KIR,
rDNA, telomere, GC content, coding regions
● NEW for all references: A/T vs. G/C
homopolymers
● Preprint now posted on biorxiv
https://doi.org/10.1101/2023.10.27.563846
● Snakemake pipeline developed for
automating stratification generation process:
https://github.com/ndwarshuis/giab-
stratifications
Credit: Nate Dwarshuis
23
NEW SAMPLES
24
SOMATIC BENCHMARKS
Tumor/Normal “21st Century Cell lines”
● Develop matched tumor and normal cell lines with
explicit consent for genomic data sharing
● Initial Illumina and HiC data for a pancreatic ductal
adenocarcinoma cell line developed by Andy Liss at
MGH
● Normal pancreatic and duodenal tissue from same
patient, but no normal cell line
● Liss lab grew ~70M cell batch of tumor cell line ->
distributed for sequencing in Aug 2023
● New data manifest at https://www.nist.gov/programs-
projects/cancer-genome-bottle and data available at
https://ftp-
trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_somatic
25
Credit: Jenny McDaniel, Justin Wagner, and Vaidehi Patel
SEQUENCING IN PROGRESS/ PLANNED FOR LARGE BATCH OF TUMOR CELL LINE
Range Tech T N-D N-P Read length Coverage
Long
ONT (UL) X X X ~100 - 300 kb pending
ONT (duplex) X ~10 - 100 kb pending
ONT (std) X ~35 kb 45X
PacBio HiFi (Revio) X X X ~10 - 20 kb pending
Bionano Optical Mapping X 150 kb - multi Mb 400X
Arima and PhaseGenomics HiC-Illumina X X 2x150 bp pending
Karyologic karyotyping X chomosomal NA
Short
Illumina WGS X X X 2x150 bp 180X (T), 150X (N)
Element X X 75 - 150 bp pending
PacBio Onso X X 100 - 200 bp pending
Bioskryb-Illumina single-cell WGS X 2x50 bp <1X (120 cells)
26
NEW DATA TYPES
27
RNA-SEQ
led by Miten Jain at Northeastern and Fritz Sedlazeck at BCM
○ Public Data:
■ Illumina mRNA and total RNA;
■ PacBio Iso-seq and Kinenx;
■ ONT cDNA and direct RNA
○ Analysis Plans:
■ LCL vs iPSC comparison and possible isoform benchmark
GIAB Sample Cell Line
LCL iPSC derived from LCL iPSC derived from PBMC
AJ Son - HG002 GM24385 GM26105 GM27730
AJ Mother - HG004 GM24143
HC Son - HG005 GM24631
28
WHOLE GENOME SEQUENCING RMS (HG001-HG008)
DATA FROM NEW TECHNOLOGIES
NIH Hosted FTP Site https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/
NIH SRA https://www.ncbi.nlm.nih.gov/bioproject/200694
HPRC S3 Bucket https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0
29
30
WORK IN PROGRESS - DATA REGISTRY
31
Queryable database with pointers to
publicly available GIAB data along
with summary statistics
Data Types
Sample
FASTQs
BAMs
VCFs
Capturing methods and linking
datasets for data provenance
32
REFERENCE MATERIALS
AVAILABLE FOR 5 INDIVIDUALS
WITH NEW TUMOR SAMPLE
MULTIPLE NEW BENCHMARK
SETS
NEW BENCHMARKING
METHODS ENABLING COMPLEX
SV BENCHMARKING
NEW DATA TYPES – RNA AND
DNA SEQ
TAKE-HOME MESSAGES
ACKNOWLEDGEMENTS
This work is not possible without our amazing collaborations
● Chromosome X/Y benchmark evaluators and Arang Rhie and many
others for the assembly
● Tandem repeat benchmark - led by Adam English and Fritz Sedlazek
● HG002 T2T “Q100” collaboration with Nancy Hansen, Adam Phillippy
group, and many others
● RNAseq project led by Miten Jain who also contributes lots of ONT data
● New tumor cell line developed by Andy Liss and being characterized by
many labs
● Mosaic variants with Camille Daniels, Adetola Abdulkadir, and
Maryellen De Mars at MDIC
33
NIST Genome In A Bottle Team
• Justin Zook (Group Leader)
• Justin Wagner
• Jenny McDaniel
• Nate Dwarshuis
• Vaidehi Patel
• Sierra Miller (NIST Genome Editing Team)
34
INTERESTED IN GETTING INVOLVED?
www.genomeinabottle.org - sign up for general
GIAB and Analysis Team google groups
GIAB slides:
www.slideshare.net/genomeinabottle
Public, Unembargoed
Data:
github.com/genome-in-a-bottle
Looking for
Post-Docs
Data Management,
Machine learning,
diploid assembly,
cancer genomes,
data science,
other ‘omics, …
Email: nolson@nist.gov

2023 GIAB AMP Update

  • 1.
    Nathan D Olson AMPRM Forum 2023 GIAB Update New Benchmark Set, Samples, and Data Types
  • 2.
    GIAB SAMPLES ANDREFERENCE MATERIALS ● Genome in a Bottle Consortium develops metrology infrastructure for benchmarking human whole genome variant detection ● Characterization of seven broadly-consented human genomes including two son-mother-father trios with several as NIST RMs 2
  • 3.
    OPEN CONSENT ENABLESSECONDARY REFERENCE SAMPLES TO MEET SPECIFIC CLINICAL NEEDS ● >50 products now available based on broadly- consented, well-characterized GIAB PGP cell lines ● Genomic DNA + DNA spike-ins ○ Clinical variants ○ Somatic variants ○ Difficult variants ● Clinical matrix (FFPE) ● Circulating tumor DNA ● Stem cells (iPSCs) ● Genome editing ● … 3
  • 4.
    GIAB IMPROVES CONFIDENCEIN GENOME SEQUENCING AND VARIANT CALLING REFERENCE MATERIALS AND SAMPLES CHARACTERIZATIONS (BENCHMARK SETS) REFERENCE DATA BENCHMARKING METHODS 4
  • 5.
  • 6.
  • 7.
    MOSAIC BENCHMARK Led byCamille Daniels and Adetola Abdulkadir at MDIC 7
  • 8.
    MOSAIC BENCHMARK SET:OBJECTIVE • Identify and characterize low frequency variants in the well characterized HG002 genome (NIST reference material 8391/NA24385) and create a mosaic benchmark • Trio-based approach using Genome In A Bottle (GIAB) Ashkenazi Jewish genomes - HG002 (son) - tumor - HG003 + HG004 (combined parents) - normal • In-silico mixtures of real data from HG002 and HG003 were analyzed to determine the theoretical limit of detection (LOD): 5% variant allele frequency (VAF) • New mosaic benchmark to include variants between 5% and 30% VAF 8
  • 9.
    HG002 MOSAIC BENCHMARKGENERATION HG003 + HG004 .bam (normal) HG002 .bam (tumor) custom scripts AJ trio benchmark and mosaic intersections to exclude complex variants Strelka2 (somatic) .vcf vcfeval Callset against GIAB v.4.2.1 benchmark (squash-ploidy) normals excluded vcf reformat false positive .vcf list of potential mosaic variants .vcf .fastqs Novoalign (GRCh38) 300X each GIAB AJ trio Illumina PCR free, HiSeq 2500 son combined parents Potential mosaics 366,728 True positives 389,494 False positives 425,679 Strelka2 1,273,474 Potential mosaic variants overview ● 1,930 candidate variants (passing) - 1,915 SNVs - 15 indels - 178 [5%-30%] VAF ● 364,798 putative variants (non-passing) - 364,792 SNVs - 6 indels - 15,743 [5%-30%] VAF • 125 potential mosaic variants passed decision tree heuristics for manual curation - 105 easy-to-map with combined orthogonal lower confidence interval >=5% - 20 not easy-to-map, non- homopolymer, Pacbio lower confidence interval >=5% 9
  • 10.
    TANDEM REPEAT BENCHMARK Ledby Adam English and Fritz Sedlazeck at Baylor College of Medicine 10
  • 11.
    HG002 TANDEM REPEATBENCHMARK ● Catalog of STR and VNTR regions ○ derived from multiple sources, annotated in a standardized way ● v1.0 benchmark for indels and SVs >=5bp in TRs ○ 124,728 small and 17,988 large variants ○ ~8% of the genome, but ~25% of variants in HG002 11 Preprint: https://doi.org/10.1101/2023.10.29.564632 Benchmark development: https://github.com/ACEnglish/adotto
  • 12.
    X AND YBENCHMARK Led by Justin Wagner and NIST-GIAB Team 12
  • 13.
    ASSEMBLY-BASED DRAFT BENCHMARKDEVELOPMENT PIPELINE 13
  • 14.
    WHAT WE EXCLUDEFROM THE ASSEMBLY-BASED BENCHMARK ● Regions without the expected one contig aligned per haplotype ○ Derived from dipcall bed file ● Large repeats if they are partially aligned ○ Segmental duplications ○ Long VNTRs and satellites ○ Assembly gaps ○ VDJ 14
  • 15.
    HG002 CHROMOSOME XYV1.0 BENCHMARK NOW RELEASED ● Telomere to Telomere consortium generated a complete assembly of HG002 X and Y for first complete human genome ○ First T2T Y chromosome described in https://www.nature.com/articles/s41586-023-06457-y ● Preprint describing the benchmark https://doi.org/10.1101/2023.10.31.564997 Short read (2020 pFDA) Long read (2020 pFDA) Variant Type Region v4.2.1 CMRG XY v4.2.1 CMRG XY SNV All benchmark 0.997 0.977 0.899 1.000 0.981 0.932 SNV Segmental duplications 0.951 0.835 0.600 0.991 0.893 0.785 INDEL All benchmark 0.997 0.963 0.815 0.996 0.967 0.738 INDEL TRs 0.993 0.915 0.721 0.997 0.955 0.645 INDEL Homopolymers >11bp 0.998 0.972 0.789 0.990 0.959 0.677 INS >15 All benchmark 0.960 0.821 0.538 0.997 0.919 0.505 15
  • 16.
  • 17.
    HG002 “Q100” PROJECT -T2T-GIAB collaboration to create near- perfect “Q100” diploid assembly and associated benchmarks - https://github.com/marbl/HG002 - T2T team just released HG002-T2Tv1.0 - Developing draft benchmarks for small variants and SVs, which we’ll be evaluating - How do we benchmark the extremely complex regions and variants? 17
  • 18.
  • 19.
  • 20.
    CONSIDERATIONS WHEN GENERATINGAND USING BENCHMARK SETS FOR EVALUATING VARIANT-CALLING METHODS. Olson et al. Nature Reviews Genetics volume 24, pages 464–483 (2023) https://rdcu.be/dqA5m 20
  • 21.
    BENCHMARKING METHODS ● Smallvariants ○ hap.py - able to handle different variant representations, can quantify performance across multiple stratifications, *no longer under active development with minimal maintenance updates ○ rtgtools vcfeval - able to handle different variant representations ● Structural variants ○ Hap_eval - new tool, able to handle different variant representations and complex SVs ○ Truvari - New functionality to better handle complex SVs, companion package laytr for benchmarking performance interpretation ● Both small and structural variants ○ vcf_dist - able to simultaneously benchmark small and structural variants, new tool under activate development 21
  • 22.
    TRUVARI REFINE ANDLAYTR FOR SV BENCHMARKING AND BENCHMARKING REPORT INTERPRETATION 22
  • 23.
    NEW V3.3 STRATIFICATIONSFOR T2T-CHM13V2.0 (AND GRCH37/38) ● NEW for CHM13: Mappability, MHC, KIR, rDNA, telomere, GC content, coding regions ● NEW for all references: A/T vs. G/C homopolymers ● Preprint now posted on biorxiv https://doi.org/10.1101/2023.10.27.563846 ● Snakemake pipeline developed for automating stratification generation process: https://github.com/ndwarshuis/giab- stratifications Credit: Nate Dwarshuis 23
  • 24.
  • 25.
    SOMATIC BENCHMARKS Tumor/Normal “21stCentury Cell lines” ● Develop matched tumor and normal cell lines with explicit consent for genomic data sharing ● Initial Illumina and HiC data for a pancreatic ductal adenocarcinoma cell line developed by Andy Liss at MGH ● Normal pancreatic and duodenal tissue from same patient, but no normal cell line ● Liss lab grew ~70M cell batch of tumor cell line -> distributed for sequencing in Aug 2023 ● New data manifest at https://www.nist.gov/programs- projects/cancer-genome-bottle and data available at https://ftp- trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_somatic 25 Credit: Jenny McDaniel, Justin Wagner, and Vaidehi Patel
  • 26.
    SEQUENCING IN PROGRESS/PLANNED FOR LARGE BATCH OF TUMOR CELL LINE Range Tech T N-D N-P Read length Coverage Long ONT (UL) X X X ~100 - 300 kb pending ONT (duplex) X ~10 - 100 kb pending ONT (std) X ~35 kb 45X PacBio HiFi (Revio) X X X ~10 - 20 kb pending Bionano Optical Mapping X 150 kb - multi Mb 400X Arima and PhaseGenomics HiC-Illumina X X 2x150 bp pending Karyologic karyotyping X chomosomal NA Short Illumina WGS X X X 2x150 bp 180X (T), 150X (N) Element X X 75 - 150 bp pending PacBio Onso X X 100 - 200 bp pending Bioskryb-Illumina single-cell WGS X 2x50 bp <1X (120 cells) 26
  • 27.
  • 28.
    RNA-SEQ led by MitenJain at Northeastern and Fritz Sedlazeck at BCM ○ Public Data: ■ Illumina mRNA and total RNA; ■ PacBio Iso-seq and Kinenx; ■ ONT cDNA and direct RNA ○ Analysis Plans: ■ LCL vs iPSC comparison and possible isoform benchmark GIAB Sample Cell Line LCL iPSC derived from LCL iPSC derived from PBMC AJ Son - HG002 GM24385 GM26105 GM27730 AJ Mother - HG004 GM24143 HC Son - HG005 GM24631 28
  • 29.
    WHOLE GENOME SEQUENCINGRMS (HG001-HG008) DATA FROM NEW TECHNOLOGIES NIH Hosted FTP Site https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/ NIH SRA https://www.ncbi.nlm.nih.gov/bioproject/200694 HPRC S3 Bucket https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0 29
  • 30.
  • 31.
    WORK IN PROGRESS- DATA REGISTRY 31 Queryable database with pointers to publicly available GIAB data along with summary statistics Data Types Sample FASTQs BAMs VCFs Capturing methods and linking datasets for data provenance
  • 32.
    32 REFERENCE MATERIALS AVAILABLE FOR5 INDIVIDUALS WITH NEW TUMOR SAMPLE MULTIPLE NEW BENCHMARK SETS NEW BENCHMARKING METHODS ENABLING COMPLEX SV BENCHMARKING NEW DATA TYPES – RNA AND DNA SEQ TAKE-HOME MESSAGES
  • 33.
    ACKNOWLEDGEMENTS This work isnot possible without our amazing collaborations ● Chromosome X/Y benchmark evaluators and Arang Rhie and many others for the assembly ● Tandem repeat benchmark - led by Adam English and Fritz Sedlazek ● HG002 T2T “Q100” collaboration with Nancy Hansen, Adam Phillippy group, and many others ● RNAseq project led by Miten Jain who also contributes lots of ONT data ● New tumor cell line developed by Andy Liss and being characterized by many labs ● Mosaic variants with Camille Daniels, Adetola Abdulkadir, and Maryellen De Mars at MDIC 33 NIST Genome In A Bottle Team • Justin Zook (Group Leader) • Justin Wagner • Jenny McDaniel • Nate Dwarshuis • Vaidehi Patel • Sierra Miller (NIST Genome Editing Team)
  • 34.
    34 INTERESTED IN GETTINGINVOLVED? www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups GIAB slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: github.com/genome-in-a-bottle Looking for Post-Docs Data Management, Machine learning, diploid assembly, cancer genomes, data science, other ‘omics, … Email: nolson@nist.gov