October 15, 2019
Genome in a Bottle: Developing
Benchmarks for Challenging
Variants With Linked & Long Reads
www.slideshare.net/genomeinabottle
NIST Human Genomics Team
• Purpose: Inspire trust in
human genome
measurements to enable
– Technology innovation
– Clinical translation
– Science-based regulatory
oversight
– Human health
• Values:
– Understand stakeholder
needs
– Collaborate with experts and
synthesize results
• Sequencing technologies
• Informatics developers
– Open science
• Open data
• Open analyses
• Open samples
Why start Genome in a Bottle?
• A map of every individual’s
genome will soon be possible, but
how will we know if it is correct?
• Diagnostics and precision
medicine require high levels of
confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• NIST and FDA funding for the work
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
Human Genome Sequencing needed a new class of
Reference Materials with billions of reference values
By Russ London at English Wikipedia, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=9923576
Many diverse contributors to GIAB
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
GIAB has characterized 7 human
genomes
• Pilot genome
– NA12878
• PGP Human
Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also
characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
Open consent enables secondary reference samples to
meet specific clinical needs
• >50 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
• Clinical variants
• Somatic variants
• Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …
Reference Genomes vs. Benchmark Genomes
• Primary uses: mapping and
annotation
• De novo assembly without
reference
• Traditionally not diploid
• Combination of individuals that
often aren’t public samples
• Primary use: benchmarking and
optimization
• Variant calls and regions on
reference genome
• Diploid-aware is essential
• Widely available individual samples
Design of our human genome reference values
Benchmark
Variant
Calls
Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls
Design of our human genome reference values
Reference
Values
Benchmark
Variant
Calls
Design of our human genome reference values
Benchmark
Regions
Variants from
any method
being evaluated
Design of our human genome reference values
Benchmark
Regions
Benchmark
Variant
Calls
Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls
Design of our human genome reference values
Benchmark
Variant
Calls
Query
Variants
Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
This does not directly
give the accuracy of the
reference values, but
rather that they are fit
for purpose.
Design of our human genome reference values
GIAB Recently Published Resources for
“Easier” Small Variants
GIAB has extensive public,
unembargoed data
Short reads
• BGISEQ
• Complete
Genomics
• Illumina
• Ion Torrent
• SOLiD
Linked reads
• 10x Genomics
• BGISEQ stLFR
• Illumina 6kb
mate-pair
• HiC
• Strand-seq
Long reads
• PacBio
• PacBio CCS
• Promethion
• Ultralong Oxford
Nanopore
Optical/electronic
mapping
• BioNano
• Nabsys
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
GIAB has extensive public,
unembargoed data
Short reads
• BGISEQ
• Complete
Genomics
• Illumina
• Ion Torrent
• SOLiD
Linked reads
• 10x Genomics
• BGISEQ stLFR
• Illumina 6kb
mate-pair
• HiC
• Strand-seq
Long reads
• PacBio
• PacBio CCS
• Promethion
• Ultralong Oxford
Nanopore
Optical/electronic
mapping
• BioNano
• Nabsys
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
Now >100x
public 10-15kb
CCS data for
HG002
Extensive “ultralong” ONT Data
3 Promethion
flow cells
110 MinION
flow cells
3 Promethion
flow cells
110 MinION
flow cells
4x >250kb
7x
>100kb
David Catoe
Nate Olson
Noah Spies
Marc Salit
Matt Loose
Nick Loman
Josh Quick
Extensive “ultralong” ONT Data
3 Promethion
flow cells
110 MinION
flow cells
3 Promethion
flow cells
110 MinION
flow cells
4x >250kb
7x
>100kb
David Catoe
Nate Olson
Noah Spies
Marc Salit
Matt Loose
Nick Loman
Josh Quick
Now using linked and long reads for
difficult variants and regions
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI stLFR
– Hi-C
– Strand-seq (underway)
• Long Reads
– PacBio Continuous Long Reads
– PacBio Circular Consensus Seq
– Oxford Nanopore “ultralong”
– Promethion
GIAB Use Cases
• Develop structural variant
benchmark
• Diploid assembly of difficult
regions like MHC
• Expand small variant benchmark
50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp
discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06
Reference genomes and benchmark genomes
are converging
Reference genomes
that are polished
diploid assemblies of
open cell lines
Benchmark genomes
and tools to stratify
by genome context
and variant type
New diploid
assembly-derived
benchmarks
New tools to assess
diploid assembly
quality
The road
ahead... 2019
Integration pipeline development
for small and structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/telomere
Diploid assembly benchmarking
...
Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://rdcu.be/bqpDT
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!
Diploid assembly,
cancer genomes,
other ‘omics, …

GIAB update for GRC GIAB workshop 191015

  • 1.
    October 15, 2019 Genomein a Bottle: Developing Benchmarks for Challenging Variants With Linked & Long Reads www.slideshare.net/genomeinabottle
  • 2.
    NIST Human GenomicsTeam • Purpose: Inspire trust in human genome measurements to enable – Technology innovation – Clinical translation – Science-based regulatory oversight – Human health • Values: – Understand stakeholder needs – Collaborate with experts and synthesize results • Sequencing technologies • Informatics developers – Open science • Open data • Open analyses • Open samples
  • 3.
    Why start Genomein a Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing • NIST and FDA funding for the work O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 4.
    Human Genome Sequencingneeded a new class of Reference Materials with billions of reference values By Russ London at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9923576
  • 5.
    Many diverse contributorsto GIAB Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  • 6.
    GIAB has characterized7 human genomes • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
  • 7.
    Open consent enablessecondary reference samples to meet specific clinical needs • >50 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins • Clinical variants • Somatic variants • Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing • …
  • 8.
    Reference Genomes vs.Benchmark Genomes • Primary uses: mapping and annotation • De novo assembly without reference • Traditionally not diploid • Combination of individuals that often aren’t public samples • Primary use: benchmarking and optimization • Variant calls and regions on reference genome • Diploid-aware is essential • Widely available individual samples
  • 9.
    Design of ourhuman genome reference values Benchmark Variant Calls
  • 10.
    Benchmark Regions – regions inwhich the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  • 11.
    Reference Values Benchmark Variant Calls Design of ourhuman genome reference values Benchmark Regions
  • 12.
    Variants from any method beingevaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  • 13.
    Benchmark Regions Variants outside benchmark regions are not assessed Majorityof variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values
  • 14.
    Benchmark Variant Calls Query Variants Benchmark Regions Variants outside benchmark regions are not assessed Majorityof variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives This does not directly give the accuracy of the reference values, but rather that they are fit for purpose. Design of our human genome reference values
  • 15.
    GIAB Recently PublishedResources for “Easier” Small Variants
  • 16.
    GIAB has extensivepublic, unembargoed data Short reads • BGISEQ • Complete Genomics • Illumina • Ion Torrent • SOLiD Linked reads • 10x Genomics • BGISEQ stLFR • Illumina 6kb mate-pair • HiC • Strand-seq Long reads • PacBio • PacBio CCS • Promethion • Ultralong Oxford Nanopore Optical/electronic mapping • BioNano • Nabsys ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
  • 17.
    GIAB has extensivepublic, unembargoed data Short reads • BGISEQ • Complete Genomics • Illumina • Ion Torrent • SOLiD Linked reads • 10x Genomics • BGISEQ stLFR • Illumina 6kb mate-pair • HiC • Strand-seq Long reads • PacBio • PacBio CCS • Promethion • Ultralong Oxford Nanopore Optical/electronic mapping • BioNano • Nabsys ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ Now >100x public 10-15kb CCS data for HG002
  • 18.
    Extensive “ultralong” ONTData 3 Promethion flow cells 110 MinION flow cells 3 Promethion flow cells 110 MinION flow cells 4x >250kb 7x >100kb David Catoe Nate Olson Noah Spies Marc Salit Matt Loose Nick Loman Josh Quick
  • 19.
    Extensive “ultralong” ONTData 3 Promethion flow cells 110 MinION flow cells 3 Promethion flow cells 110 MinION flow cells 4x >250kb 7x >100kb David Catoe Nate Olson Noah Spies Marc Salit Matt Loose Nick Loman Josh Quick
  • 20.
    Now using linkedand long reads for difficult variants and regions GIAB Public Data • Linked Reads – 10x Genomics – Complete Genomics/BGI stLFR – Hi-C – Strand-seq (underway) • Long Reads – PacBio Continuous Long Reads – PacBio Circular Consensus Seq – Oxford Nanopore “ultralong” – Promethion GIAB Use Cases • Develop structural variant benchmark • Diploid assembly of difficult regions like MHC • Expand small variant benchmark
  • 21.
    50 to 1000bp Alu Alu 1kbp to 10kbp LINE LINE Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 9641 SVs inside 2.66 Gbp benchmark regions supported by diploid assembly v0.6 tinyurl.com/GIABSV06
  • 22.
    Reference genomes andbenchmark genomes are converging Reference genomes that are polished diploid assemblies of open cell lines Benchmark genomes and tools to stratify by genome context and variant type New diploid assembly-derived benchmarks New tools to assess diploid assembly quality
  • 23.
    The road ahead... 2019 Integrationpipeline development for small and structural variants Manuscripts for small and structural variants 2020 Difficult large variants Somatic sample development Germline samples from new ancestries Diploid assembly 2021+ Somatic integration pipeline Somatic structural variation Large segmental duplications Centromere/telomere Diploid assembly benchmarking ...
  • 24.
    Acknowledgment of manyGIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  • 25.
    For More Information www.genomeinabottle.org- sign up for general GIAB and Analysis Team google groups GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: – http://www.nature.com/articles/sdata201625 – ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ – github.com/genome-in-a-bottle Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – Web-based implementation at precision.fda.gov – Best Practices at https://rdcu.be/bqpDT Public workshops – Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA Justin Zook: jzook@nist.gov NIST postdoc opportunities available! Diploid assembly, cancer genomes, other ‘omics, …

Editor's Notes

  • #6 This is a good slide for 644: give a clinical anecdote Also numbers - attendance, publications, data, RM unit sales Reference sample distributors How much money from IAA? - sustained funding Quantify collaborators' input GIAB steering committee Examples of others contributing data, analyses How to describe emails
  • #25 This is a good slide for 644: give a clinical anecdote Also numbers - attendance, publications, data, RM unit sales Reference sample distributors How much money from IAA? - sustained funding Quantify collaborators' input GIAB steering committee Examples of others contributing data, analyses How to describe emails