NIST Program for Human
Genome Reference Materials
    Marc Salit and Justin Zook
               NIST
Some use cases for a
         well-characterized, stable RM
• Obtain metrics for validation,
  QC, QA, PT
• Determine sources and types
  of bias/error
• Learn to resolve difficult
  structural variants
• Improve reference genome
  assembly
• Optimization
    – integration of data from
      multiple platforms
    – sequencing and analysis
                                   Comparison of SNP Calls for
• Enable regulated applications
                                   NA12878 on 2 platforms, 3
                                       analysis methods
Some use cases for a
         well-characterized, stable RM
• Obtain metrics for validation,
  QC, QA, PT
• Determine sources and types
  of bias/error
• Learn to resolve difficult
  structural variants
• Improve reference genome
  assembly
• Optimization
    – integration of data from
      multiple platforms
    – sequencing and analysis
                                   Comparison of SNP Calls for
• Enable regulated applications
                                   NA12878 on 2 platforms, 3
                                       analysis methods
Measurement Process
• gDNA reference                                              Sample

  materials will be                                        gDNA isolation
  developed to




                          generic measurement process
  characterize                                              Library Prep

  performance of a part                                     Sequencing
  of process
  – materials will be                                   Alignment/Mapping

    certified for their
                                                           Variant Calling
    variants against a
    reference sequence,                                 Confidence Estimates
    with confidence
    estimates                                           Downstream Analysis
Variants of Interest
• SNPs (and larger                               5’#
                                                   A"G"G"C"%%%"T"C"A"T"
                                     Reference:( 3’#                  3’#
                                                                             5’#
  polymorphisms)                     Inversion:( 5’#A"G"G"A"%%%"G"C"A"T"
                                                                       3’#
                                               3’#

• Indels                                                                     5’#


                                                 5’#
                                                   A"G"G"C"%%%"T"C"A"T"
• Longer insertions/deletions        Reference:( 3’#                  3’#
                                                                             5’#
                                               5’#
• Inversions                          Inser+on:( A"G"G"C"%%%"T"G"G"A"C"A"T"
                                               3’#
                                                                          3’#
                                                                                        5’#


• Rearrangements                                     5’#
                                                       A"G"G"C"&&&"T"C"A"T"
                                                     3’#
                                                                          3’#
                                                                                  5’#

• CNV (different lengths)
   – Deletions, tandem and
                                                     5’#
                                                     3’#                   (#
                                                       A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T"
                                                                                      3’#
                                                                                              )#
                                                                                              n#
                                                                                                   5’#



     dispersed dups                            5’#
                                                 A"G"G"C"&&&"T"C"A"T"
                                                                    3’#
                                               3’#
   – duplications with SNPs/indels                         5’#
                                                                            5’#

                                                             A"G"C"T"
                                                                    3’#

• Mobile Element Insertions
                                                           3’#
                                                                     5’#
Putting “Genomes” in Bottles
• NIST working with GiaB                            CEPH Utah Pedigree 1463
  to select genomes                                 12889           12890           12891           12892




• Current plan
  – NA12878 HapMap                                          12877                           12878




    sample as Pilot sample
     • part of 17-member
       pedigree
                                    12879   12880   12881   12882   12883   12884   12885   12886   12887   12888   12893


  – trios from PGP as more
    complete set
     • 8 trios, focus on children
     • varying biogeographic
       ancestry
Consenting Genomes for use as
            Reference Materials
• Risk of re-identification
    – this is a real risk
    – privacy
    – implications for family members
• Meaning of possibility of
  withdrawal
• Commercial application
    – indirect, research
    – direct, derived products
• PGP project currently state-of-art
    – broad and direct
    – test to demonstrate understanding
• “Wild West”
Characterization Methods
Whole Genome Sequencing             Other
• ABI 5500 (1kb, 6kb, and 10kb      • Genotyping microarrays
  mate-pair libraries)
                                    • Array CGH
• Illumina
• Complete Genomics                 • Targeted sequencing
   – including LFR                  • Fosmid sequencing?
• Emerging technologies             • Optical Mapping?
   – Ion Proton
   – nanopore?
• 3x replication of sequencing (3              Father    Mother
  library preps)
                                       Husband      NA12878
• …

                                         Son       Daughter
Timeline
Consortium Activity                NIST RM Activity
• WG Telecons                      • 80 mg gDNA for NA12878
   – Starting up in April            expected @ NIST 4/2013
   – Info to be posted on             – 8000 samples
     www.genomeinabottle.org          – available for characterization
                                        within GiaB immediately
       • schedules
                                      – target for release as NIST RM
       • agendas                        2/2014
       • summaries                         • SNPs, small indels
• Website forums                   • PGP Samples coming
   – general and supporting each   • IRB Status
     WG                               – working to establish policy
• Upcoming Workshops                       • looks good for release of NA12878
                                             as pilot RM
   – Proposed 8/2013                       • PGP samples expected to gain
       • NIST, Gaithersburg, MD              approval
Artificial Constructs
• useful as spike-ins
   – QC on clinical samples                    5’#
                                                 A"G"G"C"%%%"T"C"A"T"
                                   Reference:( 3’#                  3’#
                                                                           5’#

• a panel of druggable targets               5’#
                                   Inversion:( A"G"G"A"%%%"G"C"A"T"
                                             3’#
                                                                  3’#

  in development at NCI                                                    5’#



   – pDNA with a mutation insert               5’#
                                                 A"G"G"C"%%%"T"C"A"T"
                                   Reference:( 3’#                  3’#

       • ‘barcoded’ adjacent to
                                                                           5’#


         mutation of interest       Inser+on:( 5’#A"G"G"C"%%%"T"G"G"A"C"A"T"
                                                                           3’#
                                             3’#
                                                                                      5’#

• large-scale constructs may                       5’#
                                                     A"G"G"C"&&&"T"C"A"T"
  be useful for SV and specific                    3’#
                                                                        3’#
                                                                                5’#

  contexts                                         5’#
                                                                         (#
                                                     A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T"
                                                   3’#
                                                                                    3’#
                                                                                            )#   5’#

• recapitulate “difficult”                                                                  n#

  sequence contexts                          5’#
                                               A"G"G"C"&&&"T"C"A"T"
                                             3’#
                                                                  3’#
                                                                          5’#
   – simple sequence                                     5’#
                                                           A"G"C"T"
                                                                  3’#
                                                         3’#
   – duplications                                                  5’#
Microbial Genome RMs
Reference Samples
 Extracted DNA
                       Sample
                     Preparation


                     Sequencing

                                      Variant List,
                    Bioinformatics   Performance
                                        Metrics
With multiple data sets, both opportunity for integration and question of
just how to do it.

DATA INTEGRATION
Datasets
• 9 whole genome – Illumina, CG, 454, SOLiD
• 3 whole exome – Illumina, Ion Torrent
Integration of Data to
 Form “Gold Standard” Genotype Calls
Candidate variants                   Find all possible variant sites



Confident variants        Find highly confident sites across multiple datasets


Find characteristics     Identify sites with atypical characteristics signifying
      of bias                  sequencing, mapping, or alignment bias


                       For each site, remove datasets with decreasingly atypical
    Arbitration                  characteristics until all datasets agree


                        Even if all datasets agree, identify them as uncertain if
 Confidence Level                    few have typical characteristics
Characteristics of Sequence
  Data/Genotype associated with bias
• Systematic sequencing            • Mapping problems
  errors                              – Mapping Quality
   – Strand bias                      – Higher (or lower) than
   – Base Quality Rank Sum              expected coverage – CNV
     Test                             – Length of aligned reads
• Local Alignment problems         • Abnormal allele balance
   –   Distance from end of read     or Quality/Depth
   –   Mean position within read      – Allele Balance
   –   Read Position Rank Sum         – Quality/Depth
   –   HaplotypeScore
   –   Mean length of aligned
       reads
Example of Arbitration: SSE suspected
          from strand bias
 Platform A




                            Strand Bias
 Platform B




                            (SNP overrepresented
                            on reverse strands)




              Homopolymer
Performance Assessment
             of Genotype Calling
• For our purposes, we         • Fourth category:
  consider three categories      Uncertain Genotype
  of genotype calls              – developing
   – homozygous reference      • Three performance
   – heterozygous                assessments:
   – homozygous variant          – Individual dataset and
• by convention                    Consensus calls against
   – Negative: homozygous          Omni SNP Array
     reference                   – Individual dataset against
   – Positive: anything else       Omni SNP Array and
                                   Consensus
• our approach looks at 3x3      – Individual dataset with two
  matrix of call                   different genotype callers
  concordance                      against Consensus
Genotype Comparison Tables
                                                                 Method as “Truth”
                                        Hom. Ref         Heterozygous        Hom. Variant          Uncertain
                         Hom. Ref.




                                                                                                               ?
Method being Assessed




                                                                                                               ?
                        Het.
                        Hom. Var.




                                                                                                               ?
                        Uncertain




                                                                                                               ?
                                       ?                   ?                  ?                   ?
                                     * current state of research: only consensus process has “Uncertain” category
Consensus has lower FN rate than
                               individual datasets
                                                      Illumina Omni SNP Array
                                           Homozygous                       Homozygous
                                                           Heterozygous                          Uncertain
HiSeq – GATK




                                            Reference                          Variant
                         Homozygous                                    “FNs”
                          Reference/         1.45M         7.24k (1.34%)    5.28k (0.65%)          N/A
                            No Call         “FPs*”
                         Heterozygous      196 (0.03%)     411k (60.7%)           133 (0.02%)      N/A
                         Homozygous
                                           154 (0.02%)      150 (0.02%)           249k (37.0%)     N/A
                            Variant
                                                      Illumina Omni SNP Array
                                           Homozygous                             Homozygous
Integrated Consensus




                                                           Heterozygous                          Uncertain
                                            Reference                               Variant
                                                                          “FNs”
     Genotypes




                         Homozygous
                                              1.45M         613 (0.09%)           977 (0.15%)      N/A
                          Reference
                                              “FPs*”
                         Heterozygous      241 (0.04%)     414k (61.5%)           173 (0.03%)      N/A
                         Homozygous
                                           152 (0.02%)      61 (0.01%)            249k (36.9%)     N/A
                            Variant
                           Uncertain       5458 (0.81%)    3421 (0.51%)       4808 (0.71%)         N/A

                       * Note that most or all of the putative FPs seem to actually be FNs on the microarray
SNP arrays overestimate performance
                                           Illumina Omni SNP Array
                                     Homozygous                     Homozygous
                                                   Heterozygous                      Uncertain
HiSeq – GATK



                                      Reference                        Variant
                      Homozygous                               “FNs”
                       Reference/      1.45M       7.24k (1.34%)    5.28k (0.65%)       N/A
                         No Call      “FPs*”
                      Heterozygous   196 (0.03%)   411k (60.7%)      133 (0.02%)        N/A
                      Homozygous
                                     154 (0.02%)    150 (0.02%)     249k (37.0%)        N/A
                         Variant


                                       Integrated Consensus Genotypes
                                     Homozygous                     Homozygous
                                                   Heterozygous                      Uncertain
      HiSeq – GATK




                                      Reference                       Variant
                      Homozygous                              “FNs”
                       Reference/      1.52M       157k (4.68%)    30.3k (0.90%)       4.17M
                         No Call       “FPs”
                      Heterozygous   47 (0.00%)    1.90M (56.4%)     34 (0.00%)     16.9k (0.50%)
                      Homozygous
                                      1 (0.00%)     298 (0.01%)    1.19M (35.3%)    73.3k (2.18%)
                         Variant
Samtools has higher FP and lower FN
                                   than GATK
                                         Integrated Consensus Genotypes
HiSeq – samtools



                                       Homozygous                     Homozygous
                                                     Heterozygous                      Uncertain
                                        Reference                        Variant
                        Homozygous                               “FNs”
                         Reference/       1.51M      49.6k (1.47%)    6.74k (0.20%)      3.93M
                           No Call       “FPs”
                        Heterozygous   3141(0.09%)   2.00M (59.6%)     74 (0.00%)     175k (5.19%)
                        Homozygous
                                       21 (0.00%)     777 (0.02%)    1.21M (36.0%)    192k (5.71%)
                           Variant


                                         Integrated Consensus Genotypes
                                       Homozygous                     Homozygous
                                                     Heterozygous                      Uncertain
        HiSeq – GATK




                                        Reference                       Variant
                        Homozygous                              “FNs”
                         Reference/      1.52M       157k (4.68%)    30.3k (0.90%)       4.17M
                           No Call       “FPs”
                        Heterozygous   47 (0.00%)    1.90M (56.4%)     34 (0.00%)     16.9k (0.50%)
                        Homozygous
                                        1 (0.00%)     298 (0.01%)    1.19M (35.3%)    73.3k (2.18%)
                           Variant
Performance Metrics:
                                                  Characteristics of Mis-calls
                                                          Consensus Genotypes
                                              Hom. Ref.   Heterozygous     Hom. Variant   Uncertain
             Heterozygous Hom. Ref./No call
HiSeq/GATK
             Hom. Variant




                                                           QUAL/Depth of Coverage
                                                                  Strand Bias
                                                                        ...
Challenges with assessing
                  performance
• All variant types are not           • Genotypes fall in 3+
  equal                                 categories (not
• Nearby variants are often             positive/negative)
  difficult to align                     – standard diagnostic accuracy
                                           measures not well posed
• All regions of the genome
  are not equal                       • Data from multiple
   – Homopolymers, STRs,
                                        platforms and library
     duplications                       preparations
   – Can be similar or different in      – when characterizing a
     different genomes                     Reference Material
• Labeling difficult variants as         – when assessing performance
                                           of a test platform
  uncertain leads to higher
  apparent accuracy when
  assessing performance
Genome-in-a-Bottle Consortium
• Genome-in-a-Bottle                   • Developing genomic DNA
   – www.genomeinabottle.org             reference materials for
       • newsletters, blogs, forums,     small number of
         announcements                   microbial species
   – new partners welcome!                – to enable performance
   – targeting pilot reference              assessment of sequencing
     material availability in 2013          platforms
   – working to identify best             – range of GC
     practice for consent of              – range of complexity
     subject genome as a
     whole-genome reference
     material
QUESTIONS?
Microbial Reference Material
                  Considerations
•   Variation in GC Content
     – Genomes with a range of GC to
       challenge platforms
     – Within genome variation to challenge
       analytical process to define mobile
       genetic and insertion elements
•   Structural variations to challenge the
    ability to recognize
     – Repetitive sequences (e.g. palindromic
       repeats)
     – Homopolymers (>14 bases)
     – Insertion elements
     – Chromosomal rearrangements
     – SNP calls (e.g. variant silencing due to
       motifs)
•   Reference data available on multiple
    platforms
•   Pedigree/phylogeny of strains
•   Phenotypic characterization
Interesting work on assessing
    performance for microbial sequencing
•    Quail et al. at Sanger report on using
     4 different microbial genomes to
     characterize sequencer performance
      – ~20% - ~68% GC overall
      – Bordetella pertussis
           •   67.7 % GC, with some regions in excess
               of 90 % GC content
      – Salmonella Pullorum
           •   52 % GC
      – Staphylococcus aureus
           •   33 % GC
      – Plasmodium falciparum
           •   19.3 % GC, with some regions close to 0
               % GC content
•    “We routinely use these to test new                  Quail, M. et al. A tale of three next
     sequencing technologies, as together                  generation sequencing platforms:
     their sequences represent the range
     of genomic landscapes that one                        comparison of Ion Torrent, Pacific
     might encounter.”                                       Biosciences and Illumina MiSeq
                                                         sequencers. BMC Genomics 13, 341
                                                                                        (2012).

March 2013 NIST Reference Material Program and Data Integration

  • 1.
    NIST Program forHuman Genome Reference Materials Marc Salit and Justin Zook NIST
  • 2.
    Some use casesfor a well-characterized, stable RM • Obtain metrics for validation, QC, QA, PT • Determine sources and types of bias/error • Learn to resolve difficult structural variants • Improve reference genome assembly • Optimization – integration of data from multiple platforms – sequencing and analysis Comparison of SNP Calls for • Enable regulated applications NA12878 on 2 platforms, 3 analysis methods
  • 3.
    Some use casesfor a well-characterized, stable RM • Obtain metrics for validation, QC, QA, PT • Determine sources and types of bias/error • Learn to resolve difficult structural variants • Improve reference genome assembly • Optimization – integration of data from multiple platforms – sequencing and analysis Comparison of SNP Calls for • Enable regulated applications NA12878 on 2 platforms, 3 analysis methods
  • 4.
    Measurement Process • gDNAreference Sample materials will be gDNA isolation developed to generic measurement process characterize Library Prep performance of a part Sequencing of process – materials will be Alignment/Mapping certified for their Variant Calling variants against a reference sequence, Confidence Estimates with confidence estimates Downstream Analysis
  • 5.
    Variants of Interest •SNPs (and larger 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# 5’# polymorphisms) Inversion:( 5’#A"G"G"A"%%%"G"C"A"T" 3’# 3’# • Indels 5’# 5’# A"G"G"C"%%%"T"C"A"T" • Longer insertions/deletions Reference:( 3’# 3’# 5’# 5’# • Inversions Inser+on:( A"G"G"C"%%%"T"G"G"A"C"A"T" 3’# 3’# 5’# • Rearrangements 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# 5’# • CNV (different lengths) – Deletions, tandem and 5’# 3’# (# A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T" 3’# )# n# 5’# dispersed dups 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# – duplications with SNPs/indels 5’# 5’# A"G"C"T" 3’# • Mobile Element Insertions 3’# 5’#
  • 6.
    Putting “Genomes” inBottles • NIST working with GiaB CEPH Utah Pedigree 1463 to select genomes 12889 12890 12891 12892 • Current plan – NA12878 HapMap 12877 12878 sample as Pilot sample • part of 17-member pedigree 12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893 – trios from PGP as more complete set • 8 trios, focus on children • varying biogeographic ancestry
  • 7.
    Consenting Genomes foruse as Reference Materials • Risk of re-identification – this is a real risk – privacy – implications for family members • Meaning of possibility of withdrawal • Commercial application – indirect, research – direct, derived products • PGP project currently state-of-art – broad and direct – test to demonstrate understanding • “Wild West”
  • 8.
    Characterization Methods Whole GenomeSequencing Other • ABI 5500 (1kb, 6kb, and 10kb • Genotyping microarrays mate-pair libraries) • Array CGH • Illumina • Complete Genomics • Targeted sequencing – including LFR • Fosmid sequencing? • Emerging technologies • Optical Mapping? – Ion Proton – nanopore? • 3x replication of sequencing (3 Father Mother library preps) Husband NA12878 • … Son Daughter
  • 9.
    Timeline Consortium Activity NIST RM Activity • WG Telecons • 80 mg gDNA for NA12878 – Starting up in April expected @ NIST 4/2013 – Info to be posted on – 8000 samples www.genomeinabottle.org – available for characterization within GiaB immediately • schedules – target for release as NIST RM • agendas 2/2014 • summaries • SNPs, small indels • Website forums • PGP Samples coming – general and supporting each • IRB Status WG – working to establish policy • Upcoming Workshops • looks good for release of NA12878 as pilot RM – Proposed 8/2013 • PGP samples expected to gain • NIST, Gaithersburg, MD approval
  • 10.
    Artificial Constructs • usefulas spike-ins – QC on clinical samples 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# 5’# • a panel of druggable targets 5’# Inversion:( A"G"G"A"%%%"G"C"A"T" 3’# 3’# in development at NCI 5’# – pDNA with a mutation insert 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# • ‘barcoded’ adjacent to 5’# mutation of interest Inser+on:( 5’#A"G"G"C"%%%"T"G"G"A"C"A"T" 3’# 3’# 5’# • large-scale constructs may 5’# A"G"G"C"&&&"T"C"A"T" be useful for SV and specific 3’# 3’# 5’# contexts 5’# (# A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T" 3’# 3’# )# 5’# • recapitulate “difficult” n# sequence contexts 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# 5’# – simple sequence 5’# A"G"C"T" 3’# 3’# – duplications 5’#
  • 11.
    Microbial Genome RMs ReferenceSamples Extracted DNA Sample Preparation Sequencing Variant List, Bioinformatics Performance Metrics
  • 12.
    With multiple datasets, both opportunity for integration and question of just how to do it. DATA INTEGRATION
  • 13.
    Datasets • 9 wholegenome – Illumina, CG, 454, SOLiD • 3 whole exome – Illumina, Ion Torrent
  • 14.
    Integration of Datato Form “Gold Standard” Genotype Calls Candidate variants Find all possible variant sites Confident variants Find highly confident sites across multiple datasets Find characteristics Identify sites with atypical characteristics signifying of bias sequencing, mapping, or alignment bias For each site, remove datasets with decreasingly atypical Arbitration characteristics until all datasets agree Even if all datasets agree, identify them as uncertain if Confidence Level few have typical characteristics
  • 15.
    Characteristics of Sequence Data/Genotype associated with bias • Systematic sequencing • Mapping problems errors – Mapping Quality – Strand bias – Higher (or lower) than – Base Quality Rank Sum expected coverage – CNV Test – Length of aligned reads • Local Alignment problems • Abnormal allele balance – Distance from end of read or Quality/Depth – Mean position within read – Allele Balance – Read Position Rank Sum – Quality/Depth – HaplotypeScore – Mean length of aligned reads
  • 16.
    Example of Arbitration:SSE suspected from strand bias Platform A Strand Bias Platform B (SNP overrepresented on reverse strands) Homopolymer
  • 17.
    Performance Assessment of Genotype Calling • For our purposes, we • Fourth category: consider three categories Uncertain Genotype of genotype calls – developing – homozygous reference • Three performance – heterozygous assessments: – homozygous variant – Individual dataset and • by convention Consensus calls against – Negative: homozygous Omni SNP Array reference – Individual dataset against – Positive: anything else Omni SNP Array and Consensus • our approach looks at 3x3 – Individual dataset with two matrix of call different genotype callers concordance against Consensus
  • 18.
    Genotype Comparison Tables Method as “Truth” Hom. Ref Heterozygous Hom. Variant Uncertain Hom. Ref. ? Method being Assessed ? Het. Hom. Var. ? Uncertain ? ? ? ? ? * current state of research: only consensus process has “Uncertain” category
  • 19.
    Consensus has lowerFN rate than individual datasets Illumina Omni SNP Array Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A No Call “FPs*” Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A Variant Illumina Omni SNP Array Homozygous Homozygous Integrated Consensus Heterozygous Uncertain Reference Variant “FNs” Genotypes Homozygous 1.45M 613 (0.09%) 977 (0.15%) N/A Reference “FPs*” Heterozygous 241 (0.04%) 414k (61.5%) 173 (0.03%) N/A Homozygous 152 (0.02%) 61 (0.01%) 249k (36.9%) N/A Variant Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A * Note that most or all of the putative FPs seem to actually be FNs on the microarray
  • 20.
    SNP arrays overestimateperformance Illumina Omni SNP Array Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A No Call “FPs*” Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A Variant Integrated Consensus Genotypes Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M No Call “FPs” Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%) Variant
  • 21.
    Samtools has higherFP and lower FN than GATK Integrated Consensus Genotypes HiSeq – samtools Homozygous Homozygous Heterozygous Uncertain Reference Variant Homozygous “FNs” Reference/ 1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M No Call “FPs” Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%) Homozygous 21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%) Variant Integrated Consensus Genotypes Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M No Call “FPs” Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%) Variant
  • 22.
    Performance Metrics: Characteristics of Mis-calls Consensus Genotypes Hom. Ref. Heterozygous Hom. Variant Uncertain Heterozygous Hom. Ref./No call HiSeq/GATK Hom. Variant QUAL/Depth of Coverage Strand Bias ...
  • 23.
    Challenges with assessing performance • All variant types are not • Genotypes fall in 3+ equal categories (not • Nearby variants are often positive/negative) difficult to align – standard diagnostic accuracy measures not well posed • All regions of the genome are not equal • Data from multiple – Homopolymers, STRs, platforms and library duplications preparations – Can be similar or different in – when characterizing a different genomes Reference Material • Labeling difficult variants as – when assessing performance of a test platform uncertain leads to higher apparent accuracy when assessing performance
  • 24.
    Genome-in-a-Bottle Consortium • Genome-in-a-Bottle • Developing genomic DNA – www.genomeinabottle.org reference materials for • newsletters, blogs, forums, small number of announcements microbial species – new partners welcome! – to enable performance – targeting pilot reference assessment of sequencing material availability in 2013 platforms – working to identify best – range of GC practice for consent of – range of complexity subject genome as a whole-genome reference material
  • 25.
  • 26.
    Microbial Reference Material Considerations • Variation in GC Content – Genomes with a range of GC to challenge platforms – Within genome variation to challenge analytical process to define mobile genetic and insertion elements • Structural variations to challenge the ability to recognize – Repetitive sequences (e.g. palindromic repeats) – Homopolymers (>14 bases) – Insertion elements – Chromosomal rearrangements – SNP calls (e.g. variant silencing due to motifs) • Reference data available on multiple platforms • Pedigree/phylogeny of strains • Phenotypic characterization
  • 27.
    Interesting work onassessing performance for microbial sequencing • Quail et al. at Sanger report on using 4 different microbial genomes to characterize sequencer performance – ~20% - ~68% GC overall – Bordetella pertussis • 67.7 % GC, with some regions in excess of 90 % GC content – Salmonella Pullorum • 52 % GC – Staphylococcus aureus • 33 % GC – Plasmodium falciparum • 19.3 % GC, with some regions close to 0 % GC content • “We routinely use these to test new Quail, M. et al. A tale of three next sequencing technologies, as together generation sequencing platforms: their sequences represent the range of genomic landscapes that one comparison of Ion Torrent, Pacific might encounter.” Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).

Editor's Notes

  • #24 focus on the diagnostic power of these characteristic analyses; useful for identifying problems and optimizing, as well as to identify characteristics that are more prone to mis-calls