September 17, 2019
Genome in a Bottle: Developing
Benchmarks for Challenging
Variants With Long Reads
www.slideshare.net/genomeinabottle
NIST Human Genomics Team
• Purpose: Inspire trust in
human genome
measurements to enable
– Technology innovation
– Clinical translation
– Science-based regulatory
oversight
– Human health
• Values:
– Understand stakeholder
needs
– Collaborate with experts and
synthesize results
• Sequencing technologies
• Informatics developers
– Open science
• Open data
• Open analyses
• Open samples
Why start Genome in a Bottle?
• A map of every individual’s
genome will soon be possible, but
how will we know if it is correct?
• Diagnostics and precision
medicine require high levels of
confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• NIST and FDA funding for the work
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
Human Genome Sequencing needed a new class of
Reference Materials with billions of reference values
By Russ London at English Wikipedia, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=9923576
Many diverse contributors to GIAB
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
GIAB has characterized 7 human
genomes
• Pilot genome
– NA12878
• PGP Human
Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also
characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
Design of our human genome reference values
Benchmark
variant calls
(Reference
Values)
Variants from
any method
being evaluated
Benchmark
regions
(Reference
Values)
Goal for our human genome reference values
Benchmark
variant calls
(Reference
Values)
Variants from
any method
being evaluated
Benchmark
regions
(Reference
Values)
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Goal for our human genome reference values
Benchmark
variant calls
(Reference
Values)
Variants from
any method
being evaluated
Benchmark
regions
(Reference
Values)
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positivesMajority of
variants
unique to
benchmark
should be
false
negatives
This does not directly
give the accuracy of the
reference values, but
rather that they are fit
for purpose.
GIAB Recently Published Resources for
“Easier” Small Variants
GIAB has extensive public,
unembargoed data
Short reads
• BGISEQ
• Complete
Genomics
• Illumina
• Ion Torrent
• SOLiD
Linked reads
• 10x Genomics
• BGISEQ stLFR
• Illumina 6kb
mate-pair
• HiC
Long reads
• PacBio
• PacBio CCS
• Promethion
• Ultralong Oxford
Nanopore
Optical/electronic
mapping
• BioNano
• Nabsys
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
Now using linked and long reads for
difficult variants and regions
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI stLFR
• Long Reads
– PacBio Continuous Long Reads
– PacBio Circular Consensus Seq
– Oxford Nanopore “ultralong”
– Promethion
GIAB Use Cases
• Expand small variant
benchmark
• Develop structural variant
benchmark
• Diploid assembly of difficult
regions like MHC
Expand small variant
benchmark set to difficult to
map regions
Justin Wagner, NIST
Long+Linked Reads expand small variant benchmark
GRCh37 GRCh38
v3.3.2 v4beta
Base pairs 2,353,170,731 2,509,269,277
Reference
covered
85.4% 91.03%
SNPs 3,028,458 3,314,941
Indels 476,514 519,494
Base pairs in
Segmental
Duplications
5,382,891 73,819,342
v3.3.2 v4beta
Base pairs 2,358,060,765 2,504,027,936
Reference
covered
87.8% 93.2%
SNPs 3,046,933 3,323,773
Indels 465,670 519,152
Base pairs in
Segmental
Duplications
13,722,546 64,300,499
Benchmark includes more bases, variants, and segmental duplications in v4⍺
Small variant performance metrics
decrease vs. new benchmark
Comparison of Illumina GATK4 VCF against benchmark sets
• SNP FN rate increases by a factor of 10
– almost entirely due to new benchmark variants in difficult to
map regions
Subset v3.3.2 Recall v4 Recall v3.3.2 Precision v4 Precision
All SNPs 0.9995 0.9914 0.9981 0.9941
Difficult to map SNPs 0.9474 0.4916 0.8911 0.7171
Want to help us evaluate the
benchmark?
• Compare your small variants to the v4
benchmark
• Manually curate some FPs and FNs
– Pre-configured IGV sessions available!
• Are they actually FPs and FNs?
https://groups.google.com/forum/#!forum/giab-analysis-team
Develop sequence-resolved
structural variant benchmark set
GIAB Analysis Team and Nate Olson, NIST
50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp
discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06
Our benchmark sets are useful in evaluating SVs
from multiple technologies
Goal: When comparing any callset
to our vcf within the bed, most
putative FPs and FNs should be
errors in the tested callset
github.com/spiralgenetics/truvari
github.com/nhansen/SVanalyzer
Resolve MHC regions from
HG002
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC
Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng,
Erik Garrison, Shilpa Garg
Mar. 25-27, UCSC, The Human Pangenomics Hackathon
Goals
• Make the best haplotype-correct assemblies for
the MHC regions of HG002 from all available data
• Correct phasing for small and large variants
• Create GIAB small and structural variant
benchmarks for this complicated but
medically important region
• Used in latest v4.0 draft small variant benchmark
Integrating
assembly- and
mapping-
based calls
gives best
MHC
benchmark
• MHC assembly-based bed
includes 23187 variants in
4.64/4.97 Mbp, excluding:
• CYP21A2 and pseudogene
• Homopolymers >10bp
• SVs in assembly
• Very dense variants
• v4.0 mapping-based bed
includes 13964 variants in
4.16/4.97 Mbp, excluding:
• Short read callsets
• Conflicts between callers
• SVs from all methods
• Homopolymers >10bp
• Many clusters of variants,
including some HLA genes
• Only 11 differences
between assembly and
mapping based calls in
both beds
• 2 genotyping errors in
assembly-based
• 1 inaccurate complex allele
and cluster of 8 missed
variants in mapping-based
• Merged benchmark
includes 23229 variants in
4.67/4.97 Mbp
• Covers most HLA genes
and CYP21A2/TNXA/TNXB
Open consent enables secondary reference samples to
meet specific clinical needs
• >50 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
• Clinical variants
• Somatic variants
• Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …
The road
ahead... 2019
Integration pipeline
development for small and
structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/ telomere
...
Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://rdcu.be/bqpDT
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!
Diploid assembly,
cancer genomes,
other ‘omics, …

Giab for jax long read 190917

  • 1.
    September 17, 2019 Genomein a Bottle: Developing Benchmarks for Challenging Variants With Long Reads www.slideshare.net/genomeinabottle
  • 2.
    NIST Human GenomicsTeam • Purpose: Inspire trust in human genome measurements to enable – Technology innovation – Clinical translation – Science-based regulatory oversight – Human health • Values: – Understand stakeholder needs – Collaborate with experts and synthesize results • Sequencing technologies • Informatics developers – Open science • Open data • Open analyses • Open samples
  • 3.
    Why start Genomein a Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing • NIST and FDA funding for the work O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 4.
    Human Genome Sequencingneeded a new class of Reference Materials with billions of reference values By Russ London at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9923576
  • 5.
    Many diverse contributorsto GIAB Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  • 6.
    GIAB has characterized7 human genomes • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
  • 7.
    Design of ourhuman genome reference values Benchmark variant calls (Reference Values) Variants from any method being evaluated Benchmark regions (Reference Values)
  • 8.
    Goal for ourhuman genome reference values Benchmark variant calls (Reference Values) Variants from any method being evaluated Benchmark regions (Reference Values) Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives
  • 9.
    Goal for ourhuman genome reference values Benchmark variant calls (Reference Values) Variants from any method being evaluated Benchmark regions (Reference Values) Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positivesMajority of variants unique to benchmark should be false negatives This does not directly give the accuracy of the reference values, but rather that they are fit for purpose.
  • 10.
    GIAB Recently PublishedResources for “Easier” Small Variants
  • 11.
    GIAB has extensivepublic, unembargoed data Short reads • BGISEQ • Complete Genomics • Illumina • Ion Torrent • SOLiD Linked reads • 10x Genomics • BGISEQ stLFR • Illumina 6kb mate-pair • HiC Long reads • PacBio • PacBio CCS • Promethion • Ultralong Oxford Nanopore Optical/electronic mapping • BioNano • Nabsys ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
  • 12.
    Now using linkedand long reads for difficult variants and regions GIAB Public Data • Linked Reads – 10x Genomics – Complete Genomics/BGI stLFR • Long Reads – PacBio Continuous Long Reads – PacBio Circular Consensus Seq – Oxford Nanopore “ultralong” – Promethion GIAB Use Cases • Expand small variant benchmark • Develop structural variant benchmark • Diploid assembly of difficult regions like MHC
  • 13.
    Expand small variant benchmarkset to difficult to map regions Justin Wagner, NIST
  • 14.
    Long+Linked Reads expandsmall variant benchmark GRCh37 GRCh38 v3.3.2 v4beta Base pairs 2,353,170,731 2,509,269,277 Reference covered 85.4% 91.03% SNPs 3,028,458 3,314,941 Indels 476,514 519,494 Base pairs in Segmental Duplications 5,382,891 73,819,342 v3.3.2 v4beta Base pairs 2,358,060,765 2,504,027,936 Reference covered 87.8% 93.2% SNPs 3,046,933 3,323,773 Indels 465,670 519,152 Base pairs in Segmental Duplications 13,722,546 64,300,499 Benchmark includes more bases, variants, and segmental duplications in v4⍺
  • 15.
    Small variant performancemetrics decrease vs. new benchmark Comparison of Illumina GATK4 VCF against benchmark sets • SNP FN rate increases by a factor of 10 – almost entirely due to new benchmark variants in difficult to map regions Subset v3.3.2 Recall v4 Recall v3.3.2 Precision v4 Precision All SNPs 0.9995 0.9914 0.9981 0.9941 Difficult to map SNPs 0.9474 0.4916 0.8911 0.7171
  • 16.
    Want to helpus evaluate the benchmark? • Compare your small variants to the v4 benchmark • Manually curate some FPs and FNs – Pre-configured IGV sessions available! • Are they actually FPs and FNs? https://groups.google.com/forum/#!forum/giab-analysis-team
  • 17.
    Develop sequence-resolved structural variantbenchmark set GIAB Analysis Team and Nate Olson, NIST
  • 18.
    50 to 1000bp Alu Alu 1kbp to 10kbp LINE LINE Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 9641 SVs inside 2.66 Gbp benchmark regions supported by diploid assembly v0.6 tinyurl.com/GIABSV06
  • 19.
    Our benchmark setsare useful in evaluating SVs from multiple technologies Goal: When comparing any callset to our vcf within the bed, most putative FPs and FNs should be errors in the tested callset github.com/spiralgenetics/truvari github.com/nhansen/SVanalyzer
  • 20.
    Resolve MHC regionsfrom HG002 https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng, Erik Garrison, Shilpa Garg Mar. 25-27, UCSC, The Human Pangenomics Hackathon
  • 21.
    Goals • Make thebest haplotype-correct assemblies for the MHC regions of HG002 from all available data • Correct phasing for small and large variants • Create GIAB small and structural variant benchmarks for this complicated but medically important region • Used in latest v4.0 draft small variant benchmark
  • 22.
    Integrating assembly- and mapping- based calls givesbest MHC benchmark • MHC assembly-based bed includes 23187 variants in 4.64/4.97 Mbp, excluding: • CYP21A2 and pseudogene • Homopolymers >10bp • SVs in assembly • Very dense variants • v4.0 mapping-based bed includes 13964 variants in 4.16/4.97 Mbp, excluding: • Short read callsets • Conflicts between callers • SVs from all methods • Homopolymers >10bp • Many clusters of variants, including some HLA genes • Only 11 differences between assembly and mapping based calls in both beds • 2 genotyping errors in assembly-based • 1 inaccurate complex allele and cluster of 8 missed variants in mapping-based • Merged benchmark includes 23229 variants in 4.67/4.97 Mbp • Covers most HLA genes and CYP21A2/TNXA/TNXB
  • 23.
    Open consent enablessecondary reference samples to meet specific clinical needs • >50 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins • Clinical variants • Somatic variants • Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing • …
  • 24.
    The road ahead... 2019 Integrationpipeline development for small and structural variants Manuscripts for small and structural variants 2020 Difficult large variants Somatic sample development Germline samples from new ancestries Diploid assembly 2021+ Somatic integration pipeline Somatic structural variation Large segmental duplications Centromere/ telomere ...
  • 25.
    Acknowledgment of manyGIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  • 26.
    For More Information www.genomeinabottle.org- sign up for general GIAB and Analysis Team google group GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: – http://www.nature.com/articles/sdata201625 – ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ – github.com/genome-in-a-bottle Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – Web-based implementation at precision.fda.gov – Best Practices at https://rdcu.be/bqpDT Public workshops – Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA Justin Zook: jzook@nist.gov NIST postdoc opportunities available! Diploid assembly, cancer genomes, other ‘omics, …

Editor's Notes

  • #6 This is a good slide for 644: give a clinical anecdote Also numbers - attendance, publications, data, RM unit sales Reference sample distributors How much money from IAA? - sustained funding Quantify collaborators' input GIAB steering committee Examples of others contributing data, analyses How to describe emails
  • #16 false-negatives (FN) : variants present in the truth set, but missed in the query.
  • #26 This is a good slide for 644: give a clinical anecdote Also numbers - attendance, publications, data, RM unit sales Reference sample distributors How much money from IAA? - sustained funding Quantify collaborators' input GIAB steering committee Examples of others contributing data, analyses How to describe emails