Genome in a Bottle: Integrating human sequence data sets provides
a resource of benchmark SNP and indel genotype calls
Justin

1,
Zook

Brad

2,
Chapman

Oliver

2,
Hofmann

Winston

2,
Hide

Jason

3,
Wang

David

3,
Mittelman

1National

Institute of Standards and Technology, Gaithersburg, MD
2Harvard School of Public Health, Cambridge, MA; 3Arpeggi, Inc., Austin, TX
1

Integrating SNPs & indels

Genome in a Bottle
Consortium
• As sequencing moves to clinical
applications, assessing accuracy
becomes very important.
• With the Genome in a Bottle
Consortium, NIST is developing
methods to characterize whole
genome Reference Materials that
can be used to assess the
performance of whole genome
sequencing
Samples

Spike-ins
Sample
Preparation

Unified
Genotyper

Force calls
with Unified
Genotyper

• Data from multiple sequencing
platforms and runs can be used to
understand and compensate for
errors and biases of each method

Force de novo
assembly with
Haplotype Caller

…

Unified
Genotyper

Haplotype
Caller

Force calls
with Unified
Genotyper

…

Force de novo
assembly with
Haplotype Caller

NA12878 Data sets

•
•

www.bioplanet.com/gcat
Interactive comparison of bioinformatics
methods to our integrated calls

• Using microarrays to assess
performance underestimates FN rate
•

Integrated calls have >20x higher percentage
of low complexity regions than microarrays

SNPs

indels

Find high-confidence SNP & indel sites
HomRef
SNP
VQSR

HomRef
indel
VQSR
HomVar
SNP
VQSR

HomVar
indel
VQSR

Het
indel
VQSR

…

HomRef
SNP
VQSR
Het SNP
VQSR

HomRef
indel
VQSR
HomVar
SNP
VQSR

HomVar
indel
VQSR

Het
indel
VQSR

Arbitrate using characteristics of mapping and
alignment bias and systematic sequencing
errors to find consensus SNP & indel sites

Indels/Complex Variants

Filter sites if <2 datasets are free of bias

• Multiple correct
representations of
complex variants
often exist
• Comparing complex CAGTGA > TCTCT complex variant
variants is difficult. Try RTG’s vcfeval!

Characteristics of bias
used for arbitration
•
•

• We propose a method using 14
datasets for CEPH/HapMap sample
NA12878 to find characteristics of
highly confident genotype calls and
use these characteristics to arbitrate
between discordant calls

Performance assessment
using integrated calls

• Freebayes has significantly improved
its indel calls over the past year:

Integrate UG
and HC calls for
dataset #11

• Systematic sequencing errors (SSEs)

Overlap of SNP calls for NA12878 between three variant call files.
(a) The three variant calls come from: (1) Illumina HiSeq reads mapped with bwa and
with variants called by GATK; (2) the same Illumina HiSeq reads mapped with bwa but
with variants called by samtools; (3) Complete Genomics called with CGTools 2.0.
(b) The samtools calls are replaced by SOLiD 4 reads called with GATK.
The gray numbers in parentheses are the numbers of variants that are not filtered in
the other datasets.

Genome in a
Bottle
Consortium

• Calls hosted on GCAT website

Haplotype
Caller

Integrate UG
and HC calls
for dataset #1

Sequencing
Variant list,
Performance
metrics

Cortex

Dataset #14

Candidate SNP & indel sites

Het SNP
VQSR

Bioinformatics

…
…

Dataset #1

Marc Salit1

Strand bias
Base Quality Rank Sum

• Local Alignment
•
•
•
•
•

• Mapping problems
•
•
•

Complete
Genomics

Distance from end of read
Mean position within read
Read Position Rank Sum
HaplotypeScore
Length of aligned reads

Illumina
HiSeq

Mapping Quality
Abnormal coverage – CNV
Length of aligned reads

• Abnormal allele balance
•
•

Allele Balance
Quality/Depth

Performance Assessment
• Within “highly confident” regions, all
datasets are highly sensitive and
specific
• Most “false” positives and negatives
appear to be microarray errors

Pedigree Methods
• Real Time Genomics and Illumina
Platinum Genomes have developed
methods to use the 11 children of
NA12878
• High-confidence variants are in
haplotypes that are properly
inherited in the children

Structural Variants
• Can we use similar methods for SVs?
• Arbitrate using coverage, insert
size, discordant paired
ends, mapping quality, softclipping, heterozygous/homozygous
ratio, allele fraction, …
• How to use long-read technologies?

Discussion

a http://genomeinabottle.org/blog-entry/existing-and-future-na12878-datasets.

• Genome in a Bottle Consortium
• New members welcome!
• www.genomeinabottle.org

2014 agbt giab data integration poster 140206

  • 1.
    Genome in aBottle: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls Justin 1, Zook Brad 2, Chapman Oliver 2, Hofmann Winston 2, Hide Jason 3, Wang David 3, Mittelman 1National Institute of Standards and Technology, Gaithersburg, MD 2Harvard School of Public Health, Cambridge, MA; 3Arpeggi, Inc., Austin, TX 1 Integrating SNPs & indels Genome in a Bottle Consortium • As sequencing moves to clinical applications, assessing accuracy becomes very important. • With the Genome in a Bottle Consortium, NIST is developing methods to characterize whole genome Reference Materials that can be used to assess the performance of whole genome sequencing Samples Spike-ins Sample Preparation Unified Genotyper Force calls with Unified Genotyper • Data from multiple sequencing platforms and runs can be used to understand and compensate for errors and biases of each method Force de novo assembly with Haplotype Caller … Unified Genotyper Haplotype Caller Force calls with Unified Genotyper … Force de novo assembly with Haplotype Caller NA12878 Data sets • • www.bioplanet.com/gcat Interactive comparison of bioinformatics methods to our integrated calls • Using microarrays to assess performance underestimates FN rate • Integrated calls have >20x higher percentage of low complexity regions than microarrays SNPs indels Find high-confidence SNP & indel sites HomRef SNP VQSR HomRef indel VQSR HomVar SNP VQSR HomVar indel VQSR Het indel VQSR … HomRef SNP VQSR Het SNP VQSR HomRef indel VQSR HomVar SNP VQSR HomVar indel VQSR Het indel VQSR Arbitrate using characteristics of mapping and alignment bias and systematic sequencing errors to find consensus SNP & indel sites Indels/Complex Variants Filter sites if <2 datasets are free of bias • Multiple correct representations of complex variants often exist • Comparing complex CAGTGA > TCTCT complex variant variants is difficult. Try RTG’s vcfeval! Characteristics of bias used for arbitration • • • We propose a method using 14 datasets for CEPH/HapMap sample NA12878 to find characteristics of highly confident genotype calls and use these characteristics to arbitrate between discordant calls Performance assessment using integrated calls • Freebayes has significantly improved its indel calls over the past year: Integrate UG and HC calls for dataset #11 • Systematic sequencing errors (SSEs) Overlap of SNP calls for NA12878 between three variant call files. (a) The three variant calls come from: (1) Illumina HiSeq reads mapped with bwa and with variants called by GATK; (2) the same Illumina HiSeq reads mapped with bwa but with variants called by samtools; (3) Complete Genomics called with CGTools 2.0. (b) The samtools calls are replaced by SOLiD 4 reads called with GATK. The gray numbers in parentheses are the numbers of variants that are not filtered in the other datasets. Genome in a Bottle Consortium • Calls hosted on GCAT website Haplotype Caller Integrate UG and HC calls for dataset #1 Sequencing Variant list, Performance metrics Cortex Dataset #14 Candidate SNP & indel sites Het SNP VQSR Bioinformatics … … Dataset #1 Marc Salit1 Strand bias Base Quality Rank Sum • Local Alignment • • • • • • Mapping problems • • • Complete Genomics Distance from end of read Mean position within read Read Position Rank Sum HaplotypeScore Length of aligned reads Illumina HiSeq Mapping Quality Abnormal coverage – CNV Length of aligned reads • Abnormal allele balance • • Allele Balance Quality/Depth Performance Assessment • Within “highly confident” regions, all datasets are highly sensitive and specific • Most “false” positives and negatives appear to be microarray errors Pedigree Methods • Real Time Genomics and Illumina Platinum Genomes have developed methods to use the 11 children of NA12878 • High-confidence variants are in haplotypes that are properly inherited in the children Structural Variants • Can we use similar methods for SVs? • Arbitrate using coverage, insert size, discordant paired ends, mapping quality, softclipping, heterozygous/homozygous ratio, allele fraction, … • How to use long-read technologies? Discussion a http://genomeinabottle.org/blog-entry/existing-and-future-na12878-datasets. • Genome in a Bottle Consortium • New members welcome! • www.genomeinabottle.org