How giab fits in the rest of the world seqc2 tumor normal
March 2013 NIST Reference Material Program and Data Integration
1. NIST Program for Human
Genome Reference Materials
Marc Salit and Justin Zook
NIST
2. Some use cases for a
well-characterized, stable RM
• Obtain metrics for validation,
QC, QA, PT
• Determine sources and types
of bias/error
• Learn to resolve difficult
structural variants
• Improve reference genome
assembly
• Optimization
– integration of data from
multiple platforms
– sequencing and analysis
Comparison of SNP Calls for
• Enable regulated applications
NA12878 on 2 platforms, 3
analysis methods
3. Some use cases for a
well-characterized, stable RM
• Obtain metrics for validation,
QC, QA, PT
• Determine sources and types
of bias/error
• Learn to resolve difficult
structural variants
• Improve reference genome
assembly
• Optimization
– integration of data from
multiple platforms
– sequencing and analysis
Comparison of SNP Calls for
• Enable regulated applications
NA12878 on 2 platforms, 3
analysis methods
4. Measurement Process
• gDNA reference Sample
materials will be gDNA isolation
developed to
generic measurement process
characterize Library Prep
performance of a part Sequencing
of process
– materials will be Alignment/Mapping
certified for their
Variant Calling
variants against a
reference sequence, Confidence Estimates
with confidence
estimates Downstream Analysis
6. Putting “Genomes” in Bottles
• NIST working with GiaB CEPH Utah Pedigree 1463
to select genomes 12889 12890 12891 12892
• Current plan
– NA12878 HapMap 12877 12878
sample as Pilot sample
• part of 17-member
pedigree
12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893
– trios from PGP as more
complete set
• 8 trios, focus on children
• varying biogeographic
ancestry
7. Consenting Genomes for use as
Reference Materials
• Risk of re-identification
– this is a real risk
– privacy
– implications for family members
• Meaning of possibility of
withdrawal
• Commercial application
– indirect, research
– direct, derived products
• PGP project currently state-of-art
– broad and direct
– test to demonstrate understanding
• “Wild West”
8. Characterization Methods
Whole Genome Sequencing Other
• ABI 5500 (1kb, 6kb, and 10kb • Genotyping microarrays
mate-pair libraries)
• Array CGH
• Illumina
• Complete Genomics • Targeted sequencing
– including LFR • Fosmid sequencing?
• Emerging technologies • Optical Mapping?
– Ion Proton
– nanopore?
• 3x replication of sequencing (3 Father Mother
library preps)
Husband NA12878
• …
Son Daughter
9. Timeline
Consortium Activity NIST RM Activity
• WG Telecons • 80 mg gDNA for NA12878
– Starting up in April expected @ NIST 4/2013
– Info to be posted on – 8000 samples
www.genomeinabottle.org – available for characterization
within GiaB immediately
• schedules
– target for release as NIST RM
• agendas 2/2014
• summaries • SNPs, small indels
• Website forums • PGP Samples coming
– general and supporting each • IRB Status
WG – working to establish policy
• Upcoming Workshops • looks good for release of NA12878
as pilot RM
– Proposed 8/2013 • PGP samples expected to gain
• NIST, Gaithersburg, MD approval
10. Artificial Constructs
• useful as spike-ins
– QC on clinical samples 5’#
A"G"G"C"%%%"T"C"A"T"
Reference:( 3’# 3’#
5’#
• a panel of druggable targets 5’#
Inversion:( A"G"G"A"%%%"G"C"A"T"
3’#
3’#
in development at NCI 5’#
– pDNA with a mutation insert 5’#
A"G"G"C"%%%"T"C"A"T"
Reference:( 3’# 3’#
• ‘barcoded’ adjacent to
5’#
mutation of interest Inser+on:( 5’#A"G"G"C"%%%"T"G"G"A"C"A"T"
3’#
3’#
5’#
• large-scale constructs may 5’#
A"G"G"C"&&&"T"C"A"T"
be useful for SV and specific 3’#
3’#
5’#
contexts 5’#
(#
A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T"
3’#
3’#
)# 5’#
• recapitulate “difficult” n#
sequence contexts 5’#
A"G"G"C"&&&"T"C"A"T"
3’#
3’#
5’#
– simple sequence 5’#
A"G"C"T"
3’#
3’#
– duplications 5’#
14. Integration of Data to
Form “Gold Standard” Genotype Calls
Candidate variants Find all possible variant sites
Confident variants Find highly confident sites across multiple datasets
Find characteristics Identify sites with atypical characteristics signifying
of bias sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical
Arbitration characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if
Confidence Level few have typical characteristics
15. Characteristics of Sequence
Data/Genotype associated with bias
• Systematic sequencing • Mapping problems
errors – Mapping Quality
– Strand bias – Higher (or lower) than
– Base Quality Rank Sum expected coverage – CNV
Test – Length of aligned reads
• Local Alignment problems • Abnormal allele balance
– Distance from end of read or Quality/Depth
– Mean position within read – Allele Balance
– Read Position Rank Sum – Quality/Depth
– HaplotypeScore
– Mean length of aligned
reads
16. Example of Arbitration: SSE suspected
from strand bias
Platform A
Strand Bias
Platform B
(SNP overrepresented
on reverse strands)
Homopolymer
17. Performance Assessment
of Genotype Calling
• For our purposes, we • Fourth category:
consider three categories Uncertain Genotype
of genotype calls – developing
– homozygous reference • Three performance
– heterozygous assessments:
– homozygous variant – Individual dataset and
• by convention Consensus calls against
– Negative: homozygous Omni SNP Array
reference – Individual dataset against
– Positive: anything else Omni SNP Array and
Consensus
• our approach looks at 3x3 – Individual dataset with two
matrix of call different genotype callers
concordance against Consensus
18. Genotype Comparison Tables
Method as “Truth”
Hom. Ref Heterozygous Hom. Variant Uncertain
Hom. Ref.
?
Method being Assessed
?
Het.
Hom. Var.
?
Uncertain
?
? ? ? ?
* current state of research: only consensus process has “Uncertain” category
19. Consensus has lower FN rate than
individual datasets
Illumina Omni SNP Array
Homozygous Homozygous
Heterozygous Uncertain
HiSeq – GATK
Reference Variant
Homozygous “FNs”
Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
No Call “FPs*”
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A
Homozygous
154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Variant
Illumina Omni SNP Array
Homozygous Homozygous
Integrated Consensus
Heterozygous Uncertain
Reference Variant
“FNs”
Genotypes
Homozygous
1.45M 613 (0.09%) 977 (0.15%) N/A
Reference
“FPs*”
Heterozygous 241 (0.04%) 414k (61.5%) 173 (0.03%) N/A
Homozygous
152 (0.02%) 61 (0.01%) 249k (36.9%) N/A
Variant
Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A
* Note that most or all of the putative FPs seem to actually be FNs on the microarray
23. Challenges with assessing
performance
• All variant types are not • Genotypes fall in 3+
equal categories (not
• Nearby variants are often positive/negative)
difficult to align – standard diagnostic accuracy
measures not well posed
• All regions of the genome
are not equal • Data from multiple
– Homopolymers, STRs,
platforms and library
duplications preparations
– Can be similar or different in – when characterizing a
different genomes Reference Material
• Labeling difficult variants as – when assessing performance
of a test platform
uncertain leads to higher
apparent accuracy when
assessing performance
24. Genome-in-a-Bottle Consortium
• Genome-in-a-Bottle • Developing genomic DNA
– www.genomeinabottle.org reference materials for
• newsletters, blogs, forums, small number of
announcements microbial species
– new partners welcome! – to enable performance
– targeting pilot reference assessment of sequencing
material availability in 2013 platforms
– working to identify best – range of GC
practice for consent of – range of complexity
subject genome as a
whole-genome reference
material
26. Microbial Reference Material
Considerations
• Variation in GC Content
– Genomes with a range of GC to
challenge platforms
– Within genome variation to challenge
analytical process to define mobile
genetic and insertion elements
• Structural variations to challenge the
ability to recognize
– Repetitive sequences (e.g. palindromic
repeats)
– Homopolymers (>14 bases)
– Insertion elements
– Chromosomal rearrangements
– SNP calls (e.g. variant silencing due to
motifs)
• Reference data available on multiple
platforms
• Pedigree/phylogeny of strains
• Phenotypic characterization
27. Interesting work on assessing
performance for microbial sequencing
• Quail et al. at Sanger report on using
4 different microbial genomes to
characterize sequencer performance
– ~20% - ~68% GC overall
– Bordetella pertussis
• 67.7 % GC, with some regions in excess
of 90 % GC content
– Salmonella Pullorum
• 52 % GC
– Staphylococcus aureus
• 33 % GC
– Plasmodium falciparum
• 19.3 % GC, with some regions close to 0
% GC content
• “We routinely use these to test new Quail, M. et al. A tale of three next
sequencing technologies, as together generation sequencing platforms:
their sequences represent the range
of genomic landscapes that one comparison of Ion Torrent, Pacific
might encounter.” Biosciences and Illumina MiSeq
sequencers. BMC Genomics 13, 341
(2012).
Editor's Notes
focus on the diagnostic power of these characteristic analyses; useful for identifying problems and optimizing, as well as to identify characteristics that are more prone to mis-calls