March 2013 NIST Reference Material Program and Data Integration

NIST Program for Human
Genome Reference Materials
Marc Salit and Justin Zook
NIST

Some use cases for a
well-characterized, stable RM
• Obtain metrics for validation,
QC, QA, PT
• Determine sources and types
of bias/error
• Learn to resolve difficult
structural variants
• Improve reference genome
assembly
• Optimization
– integration of data from
multiple platforms
– sequencing and analysis
Comparison of SNP Calls for
• Enable regulated applications
NA12878 on 2 platforms, 3
analysis methods

Measurement Process
• gDNA reference Sample

materials will be gDNA isolation
developed to

generic measurement process
characterize Library Prep

performance of a part Sequencing
of process
– materials will be Alignment/Mapping

certified for their
Variant Calling
variants against a
reference sequence, Confidence Estimates
with confidence
estimates Downstream Analysis

Variants of Interest
• SNPs (and larger 5’#
A"G"G"C"%%%"T"C"A"T"
Reference:( 3’# 3’#
5’#
polymorphisms) Inversion:( 5’#A"G"G"A"%%%"G"C"A"T"
3’#
3’#

• Indels 5’#

5’#
• Longer insertions/deletions Reference:( 3’# 3’#
5’#
5’#
• Inversions Inser+on:( A"G"G"C"%%%"T"G"G"A"C"A"T"
3’#
3’#
5’#

• Rearrangements 5’#
A"G"G"C"&&&"T"C"A"T"
3’#
3’#
5’#

• CNV (different lengths)
– Deletions, tandem and
5’#
3’# (#
A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T"
3’#
)#
n#
5’#

dispersed dups 5’#
3’#
3’#
– duplications with SNPs/indels 5’#
5’#

A"G"C"T"
3’#

• Mobile Element Insertions
3’#
5’#

Putting “Genomes” in Bottles
• NIST working with GiaB CEPH Utah Pedigree 1463
to select genomes 12889 12890 12891 12892

• Current plan
– NA12878 HapMap 12877 12878

sample as Pilot sample
• part of 17-member
pedigree
12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893

– trios from PGP as more
complete set
• 8 trios, focus on children
• varying biogeographic
ancestry

Consenting Genomes for use as
Reference Materials
• Risk of re-identification
– this is a real risk
– privacy
– implications for family members
• Meaning of possibility of
withdrawal
• Commercial application
– indirect, research
– direct, derived products
• PGP project currently state-of-art
– broad and direct
– test to demonstrate understanding
• “Wild West”

Characterization Methods
Whole Genome Sequencing Other
• ABI 5500 (1kb, 6kb, and 10kb • Genotyping microarrays
mate-pair libraries)
• Array CGH
• Illumina
• Complete Genomics • Targeted sequencing
– including LFR • Fosmid sequencing?
• Emerging technologies • Optical Mapping?
– Ion Proton
– nanopore?
• 3x replication of sequencing (3 Father Mother
library preps)
Husband NA12878
• …

Son Daughter

Timeline
Consortium Activity NIST RM Activity
• WG Telecons • 80 mg gDNA for NA12878
– Starting up in April expected @ NIST 4/2013
– Info to be posted on – 8000 samples
www.genomeinabottle.org – available for characterization
within GiaB immediately
• schedules
– target for release as NIST RM
• agendas 2/2014
• summaries • SNPs, small indels
• Website forums • PGP Samples coming
– general and supporting each • IRB Status
WG – working to establish policy
• Upcoming Workshops • looks good for release of NA12878
as pilot RM
– Proposed 8/2013 • PGP samples expected to gain
• NIST, Gaithersburg, MD approval

Artificial Constructs
• useful as spike-ins
– QC on clinical samples 5’#
5’#

• a panel of druggable targets 5’#
Inversion:( A"G"G"A"%%%"G"C"A"T"
3’#
3’#

in development at NCI 5’#

– pDNA with a mutation insert 5’#

• ‘barcoded’ adjacent to
5’#

mutation of interest Inser+on:( 5’#A"G"G"C"%%%"T"G"G"A"C"A"T"
3’#
3’#
5’#

• large-scale constructs may 5’#
be useful for SV and specific 3’#
3’#
5’#

contexts 5’#
(#
A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T"
3’#
3’#
)# 5’#

• recapitulate “difficult” n#

sequence contexts 5’#
3’#
3’#
5’#
– simple sequence 5’#
A"G"C"T"
3’#
3’#
– duplications 5’#

Microbial Genome RMs
Reference Samples
Extracted DNA
Sample
Preparation

Sequencing

Variant List,
Bioinformatics Performance
Metrics

With multiple data sets, both opportunity for integration and question of
just how to do it.

DATA INTEGRATION

Datasets
• 9 whole genome – Illumina, CG, 454, SOLiD
• 3 whole exome – Illumina, Ion Torrent

Integration of Data to
Form “Gold Standard” Genotype Calls
Candidate variants Find all possible variant sites

Confident variants Find highly confident sites across multiple datasets

Find characteristics Identify sites with atypical characteristics signifying
of bias sequencing, mapping, or alignment bias

For each site, remove datasets with decreasingly atypical
Arbitration characteristics until all datasets agree

Even if all datasets agree, identify them as uncertain if
Confidence Level few have typical characteristics

Characteristics of Sequence
Data/Genotype associated with bias
• Systematic sequencing • Mapping problems
errors – Mapping Quality
– Strand bias – Higher (or lower) than
– Base Quality Rank Sum expected coverage – CNV
Test – Length of aligned reads
• Local Alignment problems • Abnormal allele balance
– Distance from end of read or Quality/Depth
– Mean position within read – Allele Balance
– Read Position Rank Sum – Quality/Depth
– HaplotypeScore
– Mean length of aligned
reads

Example of Arbitration: SSE suspected
from strand bias
Platform A

Strand Bias
Platform B

(SNP overrepresented
on reverse strands)

Homopolymer

Performance Assessment
of Genotype Calling
• For our purposes, we • Fourth category:
consider three categories Uncertain Genotype
of genotype calls – developing
– homozygous reference • Three performance
– heterozygous assessments:
– homozygous variant – Individual dataset and
• by convention Consensus calls against
– Negative: homozygous Omni SNP Array
reference – Individual dataset against
– Positive: anything else Omni SNP Array and
Consensus
• our approach looks at 3x3 – Individual dataset with two
matrix of call different genotype callers
concordance against Consensus

Genotype Comparison Tables
Method as “Truth”
Hom. Ref Heterozygous Hom. Variant Uncertain
Hom. Ref.

?
Method being Assessed

?
Het.
Hom. Var.

?
Uncertain

?
? ? ? ?
* current state of research: only consensus process has “Uncertain” category

Consensus has lower FN rate than
individual datasets
Illumina Omni SNP Array
Homozygous Homozygous
Heterozygous Uncertain
HiSeq – GATK

Reference Variant
Homozygous “FNs”
Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
No Call “FPs*”
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A
Homozygous
154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Variant
Integrated Consensus

Reference Variant
“FNs”
Genotypes

Homozygous
1.45M 613 (0.09%) 977 (0.15%) N/A
Reference
“FPs*”
Homozygous
152 (0.02%) 61 (0.01%) 249k (36.9%) N/A
Variant
Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A

* Note that most or all of the putative FPs seem to actually be FNs on the microarray

SNP arrays overestimate performance
HiSeq – GATK

Reference Variant
Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
No Call “FPs*”
Homozygous
154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Variant

Integrated Consensus Genotypes
HiSeq – GATK

Reference Variant
Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
No Call “FPs”
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)
Homozygous
1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Variant

Samtools has higher FP and lower FN
than GATK
HiSeq – samtools

Reference Variant
Reference/ 1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M
No Call “FPs”
Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%)
Homozygous
21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%)
Variant

HiSeq – GATK

Reference Variant
Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
No Call “FPs”
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)
Homozygous
1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Variant

Performance Metrics:
Characteristics of Mis-calls
Consensus Genotypes
Hom. Ref. Heterozygous Hom. Variant Uncertain
Heterozygous Hom. Ref./No call
HiSeq/GATK
Hom. Variant

QUAL/Depth of Coverage
Strand Bias
...

Challenges with assessing
performance
• All variant types are not • Genotypes fall in 3+
equal categories (not
• Nearby variants are often positive/negative)
difficult to align – standard diagnostic accuracy
measures not well posed
• All regions of the genome
are not equal • Data from multiple
– Homopolymers, STRs,
platforms and library
duplications preparations
– Can be similar or different in – when characterizing a
different genomes Reference Material
• Labeling difficult variants as – when assessing performance
of a test platform
uncertain leads to higher
apparent accuracy when
assessing performance

Genome-in-a-Bottle Consortium
• Genome-in-a-Bottle • Developing genomic DNA
– www.genomeinabottle.org reference materials for
• newsletters, blogs, forums, small number of
announcements microbial species
– new partners welcome! – to enable performance
– targeting pilot reference assessment of sequencing
material availability in 2013 platforms
– working to identify best – range of GC
practice for consent of – range of complexity
subject genome as a
whole-genome reference
material

Microbial Reference Material
Considerations
• Variation in GC Content
– Genomes with a range of GC to
challenge platforms
– Within genome variation to challenge
analytical process to define mobile
genetic and insertion elements
• Structural variations to challenge the
ability to recognize
– Repetitive sequences (e.g. palindromic
repeats)
– Homopolymers (>14 bases)
– Insertion elements
– Chromosomal rearrangements
– SNP calls (e.g. variant silencing due to
motifs)
• Reference data available on multiple
platforms
• Pedigree/phylogeny of strains
• Phenotypic characterization

Interesting work on assessing
performance for microbial sequencing
• Quail et al. at Sanger report on using
4 different microbial genomes to
characterize sequencer performance
– ~20% - ~68% GC overall
– Bordetella pertussis
• 67.7 % GC, with some regions in excess
of 90 % GC content
– Salmonella Pullorum
• 52 % GC
– Staphylococcus aureus
• 33 % GC
– Plasmodium falciparum
• 19.3 % GC, with some regions close to 0
% GC content
• “We routinely use these to test new Quail, M. et al. A tale of three next
sequencing technologies, as together generation sequencing platforms:
their sequences represent the range
of genomic landscapes that one comparison of Ion Torrent, Pacific
might encounter.” Biosciences and Illumina MiSeq
sequencers. BMC Genomics 13, 341
(2012).

March 2013 NIST Reference Material Program and Data Integration

More Related Content

Similar to March 2013 NIST Reference Material Program and Data Integration

More from GenomeInABottle

March 2013 NIST Reference Material Program and Data Integration

Editor's Notes