April 17, 2019
Long Read Sequencing and the
Genome in a Bottle Consortium
What’s Genome in a Bottle?
• Authoritative Characterization of Human
Genomes
– enduring commitment to resource
availability
• Samples
• Data
– widely available open resources
– all data made available without
embargo
• Enable technology and tool-building with
benchmark samples and methods for…
– development
– optimization
– demonstration
• Germline samples available now
• Developing capacity for somatic sample
development
GIAB Recently Published Resources for
“Easier” Small Variants
Now using linked and long reads
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI
stLFR
• Long Reads
– PacBio continuous long reads
– PacBio circular consensus seq
– Oxford Nanopore “ultralong”
GIAB Use Cases
• Expand small variant
benchmark
• Develop structural variant
benchmark
• Diploid assembly of difficult
regions like MHC
Linked Reads
• Short reads, but
barcodes give long
range information
>100kb
• Most useful for:
– Phasing variants & reads
– Difficult-to-map regions
– De novo assembly
https://dx.doi.org/10.1038%2Fnbt.3432
PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS)
Double-stranded DNA
Ligate adapters
Anneal primer and bind
DNA polymerase
Sequence
Generate
consensus HiFi read
Subreads
(passes)
Subread errors
Passes
5 10 15 200
30
0
10
20
40
50
Accuracy(Phred)
Wenger, Peluso, et al. (2018). bioRxiv. doi:10.1101/519025
Read accuracy improves
with more passes
15X Coverage by reads > 100Kb
Oxford Nanopore Can Produce “Ultralong” Reads
Expand small variant
benchmark set to difficult to
map regions
Justin Wenger, NIST
Long+Linked Reads expand small
variant benchmark set
Benchmark includes more bases, variants, and segmental duplications in v4⍺
v3.3.2 v4⍺ In v4⍺ not in
v3.3.2
In v3.3.2 not in
v4⍺
Base pairs
covered
2,358,060,765 2,572,421,057 225,990,474 11,630,182
Percent of
GRCh37 covered
87.84% 95.82% 8.42% 0.43%
SNPs 3,046,933 3,432,698 385,765 25,219
Indels 465,670 537,035 71,365 15,382
Base pairs in
Segmental
Duplications
13,722,546 116,687,703 103,466,431 501,274
Small variant performance metrics
decrease vs. new benchmark
Comparison of Illumina GATK4 VCF against benchmark sets
• SNP FN rate increases by a factor of 10
– almost entirely due to new benchmark variants in difficult to
map regions (lowmap) and segmental duplications (segdups)
Subset v3.3.2 Recall v4⍺ Recall v3.3.2 Precision v4⍺ Precision
All SNPs 0.9995 0.9914 0.9981 0.9941
Lowmap 100 bp 0.9799 0.7911 0.9623 0.8582
Lowmap 250 bp no mismatch 0.9474 0.4916 0.8911 0.7171
Segdups 0.9982 0.9103 0.9910 0.9014
Error in current
benchmark excluded
in new benchmark
v4⍺
v3.3.2
Illumina
PacBio
CCS
10X
ONT
v4⍺
v3.3.2
Develop sequence-resolved
structural variant benchmark set
GIAB Analysis Team
50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360
unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4
technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 11869 SVs inside
2.69 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06
Resolve MHC regions from
HG002
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC
Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng,
Erik Garrison, Shilpa Garg
Mar. 25-27, UCSC, The Human Pangenomics Hackathon
Goals
• Make the best haplotype correct
assemblies for the MHC regions of
HG002 from all available data
• Fewest gaps
• Correct phasing for both SNPs and
SVs
• Provide the best genomic sequences
for future GIAB SNP and SV
benchmark for this complicated but
medically important region
Preliminary MHC Diploid Assembly Results
MHC region MHC region
Haplotype II
(3 contigs spanning the region)
Haplotype I
(2 contigs spanning the region)
A loop in the assembly
graph
Missing Sequence?
The road
ahead... 2019
Integration pipeline
development for small and
structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/ telomere
...
Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!

GIAB and long reads for bio it world 190417

  • 1.
    April 17, 2019 LongRead Sequencing and the Genome in a Bottle Consortium
  • 2.
    What’s Genome ina Bottle? • Authoritative Characterization of Human Genomes – enduring commitment to resource availability • Samples • Data – widely available open resources – all data made available without embargo • Enable technology and tool-building with benchmark samples and methods for… – development – optimization – demonstration • Germline samples available now • Developing capacity for somatic sample development
  • 3.
    GIAB Recently PublishedResources for “Easier” Small Variants
  • 4.
    Now using linkedand long reads GIAB Public Data • Linked Reads – 10x Genomics – Complete Genomics/BGI stLFR • Long Reads – PacBio continuous long reads – PacBio circular consensus seq – Oxford Nanopore “ultralong” GIAB Use Cases • Expand small variant benchmark • Develop structural variant benchmark • Diploid assembly of difficult regions like MHC
  • 5.
    Linked Reads • Shortreads, but barcodes give long range information >100kb • Most useful for: – Phasing variants & reads – Difficult-to-map regions – De novo assembly https://dx.doi.org/10.1038%2Fnbt.3432
  • 6.
    PACBIO CIRCULAR CONSENSUSSEQUENCING (CCS) Double-stranded DNA Ligate adapters Anneal primer and bind DNA polymerase Sequence Generate consensus HiFi read Subreads (passes) Subread errors Passes 5 10 15 200 30 0 10 20 40 50 Accuracy(Phred) Wenger, Peluso, et al. (2018). bioRxiv. doi:10.1101/519025 Read accuracy improves with more passes
  • 7.
    15X Coverage byreads > 100Kb Oxford Nanopore Can Produce “Ultralong” Reads
  • 8.
    Expand small variant benchmarkset to difficult to map regions Justin Wenger, NIST
  • 9.
    Long+Linked Reads expandsmall variant benchmark set Benchmark includes more bases, variants, and segmental duplications in v4⍺ v3.3.2 v4⍺ In v4⍺ not in v3.3.2 In v3.3.2 not in v4⍺ Base pairs covered 2,358,060,765 2,572,421,057 225,990,474 11,630,182 Percent of GRCh37 covered 87.84% 95.82% 8.42% 0.43% SNPs 3,046,933 3,432,698 385,765 25,219 Indels 465,670 537,035 71,365 15,382 Base pairs in Segmental Duplications 13,722,546 116,687,703 103,466,431 501,274
  • 10.
    Small variant performancemetrics decrease vs. new benchmark Comparison of Illumina GATK4 VCF against benchmark sets • SNP FN rate increases by a factor of 10 – almost entirely due to new benchmark variants in difficult to map regions (lowmap) and segmental duplications (segdups) Subset v3.3.2 Recall v4⍺ Recall v3.3.2 Precision v4⍺ Precision All SNPs 0.9995 0.9914 0.9981 0.9941 Lowmap 100 bp 0.9799 0.7911 0.9623 0.8582 Lowmap 250 bp no mismatch 0.9474 0.4916 0.8911 0.7171 Segdups 0.9982 0.9103 0.9910 0.9014
  • 11.
    Error in current benchmarkexcluded in new benchmark v4⍺ v3.3.2 Illumina PacBio CCS 10X ONT v4⍺ v3.3.2
  • 12.
    Develop sequence-resolved structural variantbenchmark set GIAB Analysis Team
  • 13.
    50 to 1000bp Alu Alu 1kbp to 10kbp LINE LINE Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 11869 SVs inside 2.69 Gbp benchmark regions supported by diploid assembly v0.6 tinyurl.com/GIABSV06
  • 14.
    Resolve MHC regionsfrom HG002 https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng, Erik Garrison, Shilpa Garg Mar. 25-27, UCSC, The Human Pangenomics Hackathon
  • 15.
    Goals • Make thebest haplotype correct assemblies for the MHC regions of HG002 from all available data • Fewest gaps • Correct phasing for both SNPs and SVs • Provide the best genomic sequences for future GIAB SNP and SV benchmark for this complicated but medically important region
  • 16.
    Preliminary MHC DiploidAssembly Results MHC region MHC region Haplotype II (3 contigs spanning the region) Haplotype I (2 contigs spanning the region) A loop in the assembly graph Missing Sequence?
  • 17.
    The road ahead... 2019 Integrationpipeline development for small and structural variants Manuscripts for small and structural variants 2020 Difficult large variants Somatic sample development Germline samples from new ancestries Diploid assembly 2021+ Somatic integration pipeline Somatic structural variation Large segmental duplications Centromere/ telomere ...
  • 18.
    Acknowledgment of manyGIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples
  • 19.
    For More Information www.genomeinabottle.org- sign up for general GIAB and Analysis Team google group GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: – http://www.nature.com/articles/sdata201625 – ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ – github.com/genome-in-a-bottle Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – Web-based implementation at precision.fda.gov – Best Practices at https://doi.org/10.1101/270157 Public workshops – Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA Justin Zook: jzook@nist.gov NIST postdoc opportunities available!

Editor's Notes

  • #10 Non RefN bases in GRCh37 HG002: 2684573005
  • #11 false-negatives (FN) : variants present in the truth set, but missed in the query.
  • #19 This is a good slide for 644: give a clinical anecdote Also numbers - attendance, publications, data, RM unit sales Reference sample distributors How much money from IAA? - sustained funding Quantify collaborators' input GIAB steering committee Examples of others contributing data, analyses How to describe emails