Using Highly Confident Genotype
Calls for NA12878 to understand
sequencing accuracy
Genome in a Bottle Consortium
Justin Z...
Why create a set of highly confident
genotypes for a genome?
• Current validation methods have limited purview or accuracy...
Goals for Data Integration
• Carefully define highly confident regions of the
genome
– distinguish between Hom Ref and Unc...
Integrate 12 Datasets from 5 platforms
4
Integration of Data to
Form Highly Confident Genotype Calls
Find all possible variant sites
Find highly confident sites ac...
Characteristics of Sequence
Data/Genotype associated with bias
• Systematic sequencing
errors
– Strand bias
– Base Quality...
Regions excluded as uncertain
7
More recently, we also exclude homopolymers and long STRs, and 30 bp on each side of
uncer...
Example of Arbitration: SSE suspected
from strand biasPlatformBPlatformA
Homopolymer
Strand Bias
(SNP overrepresented
on r...
Verification of “Highly Confident”
Genotype accuracy
• Sanger sequencing
– 100% accuracy but only 100s of sites
• X Prize ...
GCAT – Interactive Performance
Metrics
• NIST is working with
GCAT to use our highly
confident variant calls
• Assess perf...
Why do calls differ from our highly
confident genotypes?
Calls not in Integration
• Platform-specific systematic
sequencin...
Illumina-specific Systematic Sequencing Errors
12
Complex variants have multiple correct
representations
BWA
ssaha2
CGTools
Novo-
align
Ref:
T
insertion
TCTCT
insertion
13
...
Uncertain variants: Difficult to map regions
14
Uncertain variants: Indels in long homopolymers
15
Uncertain variants: Regions with “decoy sequence”
16
Challenges with assessing
performance
• All variant types are not
equal
• Nearby variants are often
difficult to align
– M...
How to incorporate inheritance in
multi-platform integration
• Adding confidence
– Site follows expected
inheritance patte...
Availability of data, genotype calls, and
methods
• Data for NA12878 is
available on NCBI GIAB
ftp site (see blogs on
geno...
Acknowledgements
• GCAT – David Mittelman and Jason Wang
• FDA HPC – Mike Mikailov, Brian Fitzgerald, et al.
• HSPH – Brad...
Upcoming SlideShare
Loading in …5
×

Aug2013 NIST highly confident genotype calls for NA12878

1,983 views

Published on

Published in: Technology
  • Be the first to comment

Aug2013 NIST highly confident genotype calls for NA12878

  1. 1. Using Highly Confident Genotype Calls for NA12878 to understand sequencing accuracy Genome in a Bottle Consortium Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology 1
  2. 2. Why create a set of highly confident genotypes for a genome? • Current validation methods have limited purview or accuracy • Sanger confirmation – Limited by number of sites (and sometimes it’s wrong) • High depth NGS confirmation – May have same systematic errors • Genotyping microarrays – Limited to known (easier) variants – Problems with neighboring variants, homopolymers, duplications • Mendelian inheritance – Can’t account for some systematic errors • Simulated data – Generally not very representative of errors in real data • Ti/Tv – Varies by region of genome, and only gives overall statistic 2
  3. 3. Goals for Data Integration • Carefully define highly confident regions of the genome – distinguish between Hom Ref and Uncertain • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform • Avoid bias towards any particular bioinformatics algorithms 3
  4. 4. Integrate 12 Datasets from 5 platforms 4
  5. 5. Integration of Data to Form Highly Confident Genotype Calls Find all possible variant sites Find highly confident sites across multiple datasets Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias For each site, remove datasets with decreasingly atypical characteristics until all datasets agree Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known segmental duplications or long repeats Candidate variants Confident variants Find characteristics of bias Arbitration Confidence Level 5
  6. 6. Characteristics of Sequence Data/Genotype associated with bias • Systematic sequencing errors – Strand bias – Base Quality Rank Sum Test • Local Alignment problems – Distance from end of read – Read Position Rank Sum – HaplotypeScore • Mapping problems – Mapping Quality – Higher (or lower) than expected coverage – CNV – Length of aligned reads • Abnormal allele balance or Quality/Depth – Allele Balance – Quality/Depth 6
  7. 7. Regions excluded as uncertain 7 More recently, we also exclude homopolymers and long STRs, and 30 bp on each side of uncertain heterozygous and homozygous variant positions
  8. 8. Example of Arbitration: SSE suspected from strand biasPlatformBPlatformA Homopolymer Strand Bias (SNP overrepresented on reverse strands) 8
  9. 9. Verification of “Highly Confident” Genotype accuracy • Sanger sequencing – 100% accuracy but only 100s of sites • X Prize Fosmid sequencing – Artifacts at end of fosmids • Microarrays – Differences appear to be FP or FN in arrays • Broad 250bp HaplotypeCaller – Very highly concordant, except a few systematic errors and homopolymers • Platinum genomes pedigree SNPs – Some systematic errors are inherited; different representations of complex variants • Real Time Genomics Trio SNPs and indels – Some interesting sites called by RTG complex caller but have no evidence in mapped reads 9
  10. 10. GCAT – Interactive Performance Metrics • NIST is working with GCAT to use our highly confident variant calls • Assess performance of many combinations of mappers and variant callers • www.bioplanet.com/gc at 10
  11. 11. Why do calls differ from our highly confident genotypes? Calls not in Integration • Platform-specific systematic sequencing errors for SNPs • Analysis-specific • Difficult to map regions • Indels in long homopolymers Calls specific to Integration • Different complex variant representation • Some are incorrectly filtered as suspected FPs 11
  12. 12. Illumina-specific Systematic Sequencing Errors 12
  13. 13. Complex variants have multiple correct representations BWA ssaha2 CGTools Novo- align Ref: T insertion TCTCT insertion 13 FP SNPs FP MNPs FP indels Traditional comparison 0.38% (610) 100% (915) 6.5% (733) Comparison with realignment 0.15% (249) 4.2% (38) 2.6% (298)
  14. 14. Uncertain variants: Difficult to map regions 14
  15. 15. Uncertain variants: Indels in long homopolymers 15
  16. 16. Uncertain variants: Regions with “decoy sequence” 16
  17. 17. Challenges with assessing performance • All variant types are not equal • Nearby variants are often difficult to align – Multiple representations • All regions of the genome are not equal – Homopolymers, STRs, dupli cations – Can be similar or different in different genomes • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 17
  18. 18. How to incorporate inheritance in multi-platform integration • Adding confidence – Site follows expected inheritance pattern (and not all homozygous) • Identifying errors – Mendelian inheritance errors – Sites where all family members are heterozygous – Some CNVs • Limitations of inheritance – All homozygous sites can still be systematic errors – Some errors can follow inheritance pattern (e.g., incorrect alignment around indel, some CNVs) 18
  19. 19. Availability of data, genotype calls, and methods • Data for NA12878 is available on NCBI GIAB ftp site (see blogs on genomeinabottle.org) – mirrored to Amazon today • Highly confident genotype calls and bed files available on GIAB ftp site • Pre-print of manuscript available on arxiv.org • See genomeinabottle.org blog posts for more information 19
  20. 20. Acknowledgements • GCAT – David Mittelman and Jason Wang • FDA HPC – Mike Mikailov, Brian Fitzgerald, et al. • HSPH – Brad Chapman, Oliver Hofmann, Win Hide • Genome in a Bottle Consortium – www.genomeinabottle.org • newsletters, blogs, forums, announcements – new partners welcome! Open to anyone – targeting pilot reference material availability in early 2014 20

×