March 2013 NIST Reference Material Program and Data Integration

738 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
738
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • focus on the diagnostic power of these characteristic analyses; useful for identifying problems and optimizing, as well as to identify characteristics that are more prone to mis-calls
  • March 2013 NIST Reference Material Program and Data Integration

    1. 1. NIST Program for HumanGenome Reference Materials Marc Salit and Justin Zook NIST
    2. 2. Some use cases for a well-characterized, stable RM• Obtain metrics for validation, QC, QA, PT• Determine sources and types of bias/error• Learn to resolve difficult structural variants• Improve reference genome assembly• Optimization – integration of data from multiple platforms – sequencing and analysis Comparison of SNP Calls for• Enable regulated applications NA12878 on 2 platforms, 3 analysis methods
    3. 3. Some use cases for a well-characterized, stable RM• Obtain metrics for validation, QC, QA, PT• Determine sources and types of bias/error• Learn to resolve difficult structural variants• Improve reference genome assembly• Optimization – integration of data from multiple platforms – sequencing and analysis Comparison of SNP Calls for• Enable regulated applications NA12878 on 2 platforms, 3 analysis methods
    4. 4. Measurement Process• gDNA reference Sample materials will be gDNA isolation developed to generic measurement process characterize Library Prep performance of a part Sequencing of process – materials will be Alignment/Mapping certified for their Variant Calling variants against a reference sequence, Confidence Estimates with confidence estimates Downstream Analysis
    5. 5. Variants of Interest• SNPs (and larger 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# 5’# polymorphisms) Inversion:( 5’#A"G"G"A"%%%"G"C"A"T" 3’# 3’#• Indels 5’# 5’# A"G"G"C"%%%"T"C"A"T"• Longer insertions/deletions Reference:( 3’# 3’# 5’# 5’#• Inversions Inser+on:( A"G"G"C"%%%"T"G"G"A"C"A"T" 3’# 3’# 5’#• Rearrangements 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# 5’#• CNV (different lengths) – Deletions, tandem and 5’# 3’# (# A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T" 3’# )# n# 5’# dispersed dups 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# – duplications with SNPs/indels 5’# 5’# A"G"C"T" 3’#• Mobile Element Insertions 3’# 5’#
    6. 6. Putting “Genomes” in Bottles• NIST working with GiaB CEPH Utah Pedigree 1463 to select genomes 12889 12890 12891 12892• Current plan – NA12878 HapMap 12877 12878 sample as Pilot sample • part of 17-member pedigree 12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893 – trios from PGP as more complete set • 8 trios, focus on children • varying biogeographic ancestry
    7. 7. Consenting Genomes for use as Reference Materials• Risk of re-identification – this is a real risk – privacy – implications for family members• Meaning of possibility of withdrawal• Commercial application – indirect, research – direct, derived products• PGP project currently state-of-art – broad and direct – test to demonstrate understanding• “Wild West”
    8. 8. Characterization MethodsWhole Genome Sequencing Other• ABI 5500 (1kb, 6kb, and 10kb • Genotyping microarrays mate-pair libraries) • Array CGH• Illumina• Complete Genomics • Targeted sequencing – including LFR • Fosmid sequencing?• Emerging technologies • Optical Mapping? – Ion Proton – nanopore?• 3x replication of sequencing (3 Father Mother library preps) Husband NA12878• … Son Daughter
    9. 9. TimelineConsortium Activity NIST RM Activity• WG Telecons • 80 mg gDNA for NA12878 – Starting up in April expected @ NIST 4/2013 – Info to be posted on – 8000 samples www.genomeinabottle.org – available for characterization within GiaB immediately • schedules – target for release as NIST RM • agendas 2/2014 • summaries • SNPs, small indels• Website forums • PGP Samples coming – general and supporting each • IRB Status WG – working to establish policy• Upcoming Workshops • looks good for release of NA12878 as pilot RM – Proposed 8/2013 • PGP samples expected to gain • NIST, Gaithersburg, MD approval
    10. 10. Artificial Constructs• useful as spike-ins – QC on clinical samples 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# 5’#• a panel of druggable targets 5’# Inversion:( A"G"G"A"%%%"G"C"A"T" 3’# 3’# in development at NCI 5’# – pDNA with a mutation insert 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# • ‘barcoded’ adjacent to 5’# mutation of interest Inser+on:( 5’#A"G"G"C"%%%"T"G"G"A"C"A"T" 3’# 3’# 5’#• large-scale constructs may 5’# A"G"G"C"&&&"T"C"A"T" be useful for SV and specific 3’# 3’# 5’# contexts 5’# (# A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T" 3’# 3’# )# 5’#• recapitulate “difficult” n# sequence contexts 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# 5’# – simple sequence 5’# A"G"C"T" 3’# 3’# – duplications 5’#
    11. 11. Microbial Genome RMsReference Samples Extracted DNA Sample Preparation Sequencing Variant List, Bioinformatics Performance Metrics
    12. 12. With multiple data sets, both opportunity for integration and question ofjust how to do it.DATA INTEGRATION
    13. 13. Datasets• 9 whole genome – Illumina, CG, 454, SOLiD• 3 whole exome – Illumina, Ion Torrent
    14. 14. Integration of Data to Form “Gold Standard” Genotype CallsCandidate variants Find all possible variant sitesConfident variants Find highly confident sites across multiple datasetsFind characteristics Identify sites with atypical characteristics signifying of bias sequencing, mapping, or alignment bias For each site, remove datasets with decreasingly atypical Arbitration characteristics until all datasets agree Even if all datasets agree, identify them as uncertain if Confidence Level few have typical characteristics
    15. 15. Characteristics of Sequence Data/Genotype associated with bias• Systematic sequencing • Mapping problems errors – Mapping Quality – Strand bias – Higher (or lower) than – Base Quality Rank Sum expected coverage – CNV Test – Length of aligned reads• Local Alignment problems • Abnormal allele balance – Distance from end of read or Quality/Depth – Mean position within read – Allele Balance – Read Position Rank Sum – Quality/Depth – HaplotypeScore – Mean length of aligned reads
    16. 16. Example of Arbitration: SSE suspected from strand bias Platform A Strand Bias Platform B (SNP overrepresented on reverse strands) Homopolymer
    17. 17. Performance Assessment of Genotype Calling• For our purposes, we • Fourth category: consider three categories Uncertain Genotype of genotype calls – developing – homozygous reference • Three performance – heterozygous assessments: – homozygous variant – Individual dataset and• by convention Consensus calls against – Negative: homozygous Omni SNP Array reference – Individual dataset against – Positive: anything else Omni SNP Array and Consensus• our approach looks at 3x3 – Individual dataset with two matrix of call different genotype callers concordance against Consensus
    18. 18. Genotype Comparison Tables Method as “Truth” Hom. Ref Heterozygous Hom. Variant Uncertain Hom. Ref. ?Method being Assessed ? Het. Hom. Var. ? Uncertain ? ? ? ? ? * current state of research: only consensus process has “Uncertain” category
    19. 19. Consensus has lower FN rate than individual datasets Illumina Omni SNP Array Homozygous Homozygous Heterozygous UncertainHiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A No Call “FPs*” Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A Variant Illumina Omni SNP Array Homozygous HomozygousIntegrated Consensus Heterozygous Uncertain Reference Variant “FNs” Genotypes Homozygous 1.45M 613 (0.09%) 977 (0.15%) N/A Reference “FPs*” Heterozygous 241 (0.04%) 414k (61.5%) 173 (0.03%) N/A Homozygous 152 (0.02%) 61 (0.01%) 249k (36.9%) N/A Variant Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A * Note that most or all of the putative FPs seem to actually be FNs on the microarray
    20. 20. SNP arrays overestimate performance Illumina Omni SNP Array Homozygous Homozygous Heterozygous UncertainHiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A No Call “FPs*” Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A Variant Integrated Consensus Genotypes Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M No Call “FPs” Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%) Variant
    21. 21. Samtools has higher FP and lower FN than GATK Integrated Consensus GenotypesHiSeq – samtools Homozygous Homozygous Heterozygous Uncertain Reference Variant Homozygous “FNs” Reference/ 1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M No Call “FPs” Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%) Homozygous 21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%) Variant Integrated Consensus Genotypes Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M No Call “FPs” Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%) Variant
    22. 22. Performance Metrics: Characteristics of Mis-calls Consensus Genotypes Hom. Ref. Heterozygous Hom. Variant Uncertain Heterozygous Hom. Ref./No callHiSeq/GATK Hom. Variant QUAL/Depth of Coverage Strand Bias ...
    23. 23. Challenges with assessing performance• All variant types are not • Genotypes fall in 3+ equal categories (not• Nearby variants are often positive/negative) difficult to align – standard diagnostic accuracy measures not well posed• All regions of the genome are not equal • Data from multiple – Homopolymers, STRs, platforms and library duplications preparations – Can be similar or different in – when characterizing a different genomes Reference Material• Labeling difficult variants as – when assessing performance of a test platform uncertain leads to higher apparent accuracy when assessing performance
    24. 24. Genome-in-a-Bottle Consortium• Genome-in-a-Bottle • Developing genomic DNA – www.genomeinabottle.org reference materials for • newsletters, blogs, forums, small number of announcements microbial species – new partners welcome! – to enable performance – targeting pilot reference assessment of sequencing material availability in 2013 platforms – working to identify best – range of GC practice for consent of – range of complexity subject genome as a whole-genome reference material
    25. 25. QUESTIONS?
    26. 26. Microbial Reference Material Considerations• Variation in GC Content – Genomes with a range of GC to challenge platforms – Within genome variation to challenge analytical process to define mobile genetic and insertion elements• Structural variations to challenge the ability to recognize – Repetitive sequences (e.g. palindromic repeats) – Homopolymers (>14 bases) – Insertion elements – Chromosomal rearrangements – SNP calls (e.g. variant silencing due to motifs)• Reference data available on multiple platforms• Pedigree/phylogeny of strains• Phenotypic characterization
    27. 27. Interesting work on assessing performance for microbial sequencing• Quail et al. at Sanger report on using 4 different microbial genomes to characterize sequencer performance – ~20% - ~68% GC overall – Bordetella pertussis • 67.7 % GC, with some regions in excess of 90 % GC content – Salmonella Pullorum • 52 % GC – Staphylococcus aureus • 33 % GC – Plasmodium falciparum • 19.3 % GC, with some regions close to 0 % GC content• “We routinely use these to test new Quail, M. et al. A tale of three next sequencing technologies, as together generation sequencing platforms: their sequences represent the range of genomic landscapes that one comparison of Ion Torrent, Pacific might encounter.” Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).

    ×