SlideShare a Scribd company logo
1 of 27
NIST Program for Human
Genome Reference Materials
    Marc Salit and Justin Zook
               NIST
Some use cases for a
         well-characterized, stable RM
• Obtain metrics for validation,
  QC, QA, PT
• Determine sources and types
  of bias/error
• Learn to resolve difficult
  structural variants
• Improve reference genome
  assembly
• Optimization
    – integration of data from
      multiple platforms
    – sequencing and analysis
                                   Comparison of SNP Calls for
• Enable regulated applications
                                   NA12878 on 2 platforms, 3
                                       analysis methods
Some use cases for a
         well-characterized, stable RM
• Obtain metrics for validation,
  QC, QA, PT
• Determine sources and types
  of bias/error
• Learn to resolve difficult
  structural variants
• Improve reference genome
  assembly
• Optimization
    – integration of data from
      multiple platforms
    – sequencing and analysis
                                   Comparison of SNP Calls for
• Enable regulated applications
                                   NA12878 on 2 platforms, 3
                                       analysis methods
Measurement Process
• gDNA reference                                              Sample

  materials will be                                        gDNA isolation
  developed to




                          generic measurement process
  characterize                                              Library Prep

  performance of a part                                     Sequencing
  of process
  – materials will be                                   Alignment/Mapping

    certified for their
                                                           Variant Calling
    variants against a
    reference sequence,                                 Confidence Estimates
    with confidence
    estimates                                           Downstream Analysis
Variants of Interest
• SNPs (and larger                               5’#
                                                   A"G"G"C"%%%"T"C"A"T"
                                     Reference:( 3’#                  3’#
                                                                             5’#
  polymorphisms)                     Inversion:( 5’#A"G"G"A"%%%"G"C"A"T"
                                                                       3’#
                                               3’#

• Indels                                                                     5’#


                                                 5’#
                                                   A"G"G"C"%%%"T"C"A"T"
• Longer insertions/deletions        Reference:( 3’#                  3’#
                                                                             5’#
                                               5’#
• Inversions                          Inser+on:( A"G"G"C"%%%"T"G"G"A"C"A"T"
                                               3’#
                                                                          3’#
                                                                                        5’#


• Rearrangements                                     5’#
                                                       A"G"G"C"&&&"T"C"A"T"
                                                     3’#
                                                                          3’#
                                                                                  5’#

• CNV (different lengths)
   – Deletions, tandem and
                                                     5’#
                                                     3’#                   (#
                                                       A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T"
                                                                                      3’#
                                                                                              )#
                                                                                              n#
                                                                                                   5’#



     dispersed dups                            5’#
                                                 A"G"G"C"&&&"T"C"A"T"
                                                                    3’#
                                               3’#
   – duplications with SNPs/indels                         5’#
                                                                            5’#

                                                             A"G"C"T"
                                                                    3’#

• Mobile Element Insertions
                                                           3’#
                                                                     5’#
Putting “Genomes” in Bottles
• NIST working with GiaB                            CEPH Utah Pedigree 1463
  to select genomes                                 12889           12890           12891           12892




• Current plan
  – NA12878 HapMap                                          12877                           12878




    sample as Pilot sample
     • part of 17-member
       pedigree
                                    12879   12880   12881   12882   12883   12884   12885   12886   12887   12888   12893


  – trios from PGP as more
    complete set
     • 8 trios, focus on children
     • varying biogeographic
       ancestry
Consenting Genomes for use as
            Reference Materials
• Risk of re-identification
    – this is a real risk
    – privacy
    – implications for family members
• Meaning of possibility of
  withdrawal
• Commercial application
    – indirect, research
    – direct, derived products
• PGP project currently state-of-art
    – broad and direct
    – test to demonstrate understanding
• “Wild West”
Characterization Methods
Whole Genome Sequencing             Other
• ABI 5500 (1kb, 6kb, and 10kb      • Genotyping microarrays
  mate-pair libraries)
                                    • Array CGH
• Illumina
• Complete Genomics                 • Targeted sequencing
   – including LFR                  • Fosmid sequencing?
• Emerging technologies             • Optical Mapping?
   – Ion Proton
   – nanopore?
• 3x replication of sequencing (3              Father    Mother
  library preps)
                                       Husband      NA12878
• …

                                         Son       Daughter
Timeline
Consortium Activity                NIST RM Activity
• WG Telecons                      • 80 mg gDNA for NA12878
   – Starting up in April            expected @ NIST 4/2013
   – Info to be posted on             – 8000 samples
     www.genomeinabottle.org          – available for characterization
                                        within GiaB immediately
       • schedules
                                      – target for release as NIST RM
       • agendas                        2/2014
       • summaries                         • SNPs, small indels
• Website forums                   • PGP Samples coming
   – general and supporting each   • IRB Status
     WG                               – working to establish policy
• Upcoming Workshops                       • looks good for release of NA12878
                                             as pilot RM
   – Proposed 8/2013                       • PGP samples expected to gain
       • NIST, Gaithersburg, MD              approval
Artificial Constructs
• useful as spike-ins
   – QC on clinical samples                    5’#
                                                 A"G"G"C"%%%"T"C"A"T"
                                   Reference:( 3’#                  3’#
                                                                           5’#

• a panel of druggable targets               5’#
                                   Inversion:( A"G"G"A"%%%"G"C"A"T"
                                             3’#
                                                                  3’#

  in development at NCI                                                    5’#



   – pDNA with a mutation insert               5’#
                                                 A"G"G"C"%%%"T"C"A"T"
                                   Reference:( 3’#                  3’#

       • ‘barcoded’ adjacent to
                                                                           5’#


         mutation of interest       Inser+on:( 5’#A"G"G"C"%%%"T"G"G"A"C"A"T"
                                                                           3’#
                                             3’#
                                                                                      5’#

• large-scale constructs may                       5’#
                                                     A"G"G"C"&&&"T"C"A"T"
  be useful for SV and specific                    3’#
                                                                        3’#
                                                                                5’#

  contexts                                         5’#
                                                                         (#
                                                     A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T"
                                                   3’#
                                                                                    3’#
                                                                                            )#   5’#

• recapitulate “difficult”                                                                  n#

  sequence contexts                          5’#
                                               A"G"G"C"&&&"T"C"A"T"
                                             3’#
                                                                  3’#
                                                                          5’#
   – simple sequence                                     5’#
                                                           A"G"C"T"
                                                                  3’#
                                                         3’#
   – duplications                                                  5’#
Microbial Genome RMs
Reference Samples
 Extracted DNA
                       Sample
                     Preparation


                     Sequencing

                                      Variant List,
                    Bioinformatics   Performance
                                        Metrics
With multiple data sets, both opportunity for integration and question of
just how to do it.

DATA INTEGRATION
Datasets
• 9 whole genome – Illumina, CG, 454, SOLiD
• 3 whole exome – Illumina, Ion Torrent
Integration of Data to
 Form “Gold Standard” Genotype Calls
Candidate variants                   Find all possible variant sites



Confident variants        Find highly confident sites across multiple datasets


Find characteristics     Identify sites with atypical characteristics signifying
      of bias                  sequencing, mapping, or alignment bias


                       For each site, remove datasets with decreasingly atypical
    Arbitration                  characteristics until all datasets agree


                        Even if all datasets agree, identify them as uncertain if
 Confidence Level                    few have typical characteristics
Characteristics of Sequence
  Data/Genotype associated with bias
• Systematic sequencing            • Mapping problems
  errors                              – Mapping Quality
   – Strand bias                      – Higher (or lower) than
   – Base Quality Rank Sum              expected coverage – CNV
     Test                             – Length of aligned reads
• Local Alignment problems         • Abnormal allele balance
   –   Distance from end of read     or Quality/Depth
   –   Mean position within read      – Allele Balance
   –   Read Position Rank Sum         – Quality/Depth
   –   HaplotypeScore
   –   Mean length of aligned
       reads
Example of Arbitration: SSE suspected
          from strand bias
 Platform A




                            Strand Bias
 Platform B




                            (SNP overrepresented
                            on reverse strands)




              Homopolymer
Performance Assessment
             of Genotype Calling
• For our purposes, we         • Fourth category:
  consider three categories      Uncertain Genotype
  of genotype calls              – developing
   – homozygous reference      • Three performance
   – heterozygous                assessments:
   – homozygous variant          – Individual dataset and
• by convention                    Consensus calls against
   – Negative: homozygous          Omni SNP Array
     reference                   – Individual dataset against
   – Positive: anything else       Omni SNP Array and
                                   Consensus
• our approach looks at 3x3      – Individual dataset with two
  matrix of call                   different genotype callers
  concordance                      against Consensus
Genotype Comparison Tables
                                                                 Method as “Truth”
                                        Hom. Ref         Heterozygous        Hom. Variant          Uncertain
                         Hom. Ref.




                                                                                                               ?
Method being Assessed




                                                                                                               ?
                        Het.
                        Hom. Var.




                                                                                                               ?
                        Uncertain




                                                                                                               ?
                                       ?                   ?                  ?                   ?
                                     * current state of research: only consensus process has “Uncertain” category
Consensus has lower FN rate than
                               individual datasets
                                                      Illumina Omni SNP Array
                                           Homozygous                       Homozygous
                                                           Heterozygous                          Uncertain
HiSeq – GATK




                                            Reference                          Variant
                         Homozygous                                    “FNs”
                          Reference/         1.45M         7.24k (1.34%)    5.28k (0.65%)          N/A
                            No Call         “FPs*”
                         Heterozygous      196 (0.03%)     411k (60.7%)           133 (0.02%)      N/A
                         Homozygous
                                           154 (0.02%)      150 (0.02%)           249k (37.0%)     N/A
                            Variant
                                                      Illumina Omni SNP Array
                                           Homozygous                             Homozygous
Integrated Consensus




                                                           Heterozygous                          Uncertain
                                            Reference                               Variant
                                                                          “FNs”
     Genotypes




                         Homozygous
                                              1.45M         613 (0.09%)           977 (0.15%)      N/A
                          Reference
                                              “FPs*”
                         Heterozygous      241 (0.04%)     414k (61.5%)           173 (0.03%)      N/A
                         Homozygous
                                           152 (0.02%)      61 (0.01%)            249k (36.9%)     N/A
                            Variant
                           Uncertain       5458 (0.81%)    3421 (0.51%)       4808 (0.71%)         N/A

                       * Note that most or all of the putative FPs seem to actually be FNs on the microarray
SNP arrays overestimate performance
                                           Illumina Omni SNP Array
                                     Homozygous                     Homozygous
                                                   Heterozygous                      Uncertain
HiSeq – GATK



                                      Reference                        Variant
                      Homozygous                               “FNs”
                       Reference/      1.45M       7.24k (1.34%)    5.28k (0.65%)       N/A
                         No Call      “FPs*”
                      Heterozygous   196 (0.03%)   411k (60.7%)      133 (0.02%)        N/A
                      Homozygous
                                     154 (0.02%)    150 (0.02%)     249k (37.0%)        N/A
                         Variant


                                       Integrated Consensus Genotypes
                                     Homozygous                     Homozygous
                                                   Heterozygous                      Uncertain
      HiSeq – GATK




                                      Reference                       Variant
                      Homozygous                              “FNs”
                       Reference/      1.52M       157k (4.68%)    30.3k (0.90%)       4.17M
                         No Call       “FPs”
                      Heterozygous   47 (0.00%)    1.90M (56.4%)     34 (0.00%)     16.9k (0.50%)
                      Homozygous
                                      1 (0.00%)     298 (0.01%)    1.19M (35.3%)    73.3k (2.18%)
                         Variant
Samtools has higher FP and lower FN
                                   than GATK
                                         Integrated Consensus Genotypes
HiSeq – samtools



                                       Homozygous                     Homozygous
                                                     Heterozygous                      Uncertain
                                        Reference                        Variant
                        Homozygous                               “FNs”
                         Reference/       1.51M      49.6k (1.47%)    6.74k (0.20%)      3.93M
                           No Call       “FPs”
                        Heterozygous   3141(0.09%)   2.00M (59.6%)     74 (0.00%)     175k (5.19%)
                        Homozygous
                                       21 (0.00%)     777 (0.02%)    1.21M (36.0%)    192k (5.71%)
                           Variant


                                         Integrated Consensus Genotypes
                                       Homozygous                     Homozygous
                                                     Heterozygous                      Uncertain
        HiSeq – GATK




                                        Reference                       Variant
                        Homozygous                              “FNs”
                         Reference/      1.52M       157k (4.68%)    30.3k (0.90%)       4.17M
                           No Call       “FPs”
                        Heterozygous   47 (0.00%)    1.90M (56.4%)     34 (0.00%)     16.9k (0.50%)
                        Homozygous
                                        1 (0.00%)     298 (0.01%)    1.19M (35.3%)    73.3k (2.18%)
                           Variant
Performance Metrics:
                                                  Characteristics of Mis-calls
                                                          Consensus Genotypes
                                              Hom. Ref.   Heterozygous     Hom. Variant   Uncertain
             Heterozygous Hom. Ref./No call
HiSeq/GATK
             Hom. Variant




                                                           QUAL/Depth of Coverage
                                                                  Strand Bias
                                                                        ...
Challenges with assessing
                  performance
• All variant types are not           • Genotypes fall in 3+
  equal                                 categories (not
• Nearby variants are often             positive/negative)
  difficult to align                     – standard diagnostic accuracy
                                           measures not well posed
• All regions of the genome
  are not equal                       • Data from multiple
   – Homopolymers, STRs,
                                        platforms and library
     duplications                       preparations
   – Can be similar or different in      – when characterizing a
     different genomes                     Reference Material
• Labeling difficult variants as         – when assessing performance
                                           of a test platform
  uncertain leads to higher
  apparent accuracy when
  assessing performance
Genome-in-a-Bottle Consortium
• Genome-in-a-Bottle                   • Developing genomic DNA
   – www.genomeinabottle.org             reference materials for
       • newsletters, blogs, forums,     small number of
         announcements                   microbial species
   – new partners welcome!                – to enable performance
   – targeting pilot reference              assessment of sequencing
     material availability in 2013          platforms
   – working to identify best             – range of GC
     practice for consent of              – range of complexity
     subject genome as a
     whole-genome reference
     material
QUESTIONS?
Microbial Reference Material
                  Considerations
•   Variation in GC Content
     – Genomes with a range of GC to
       challenge platforms
     – Within genome variation to challenge
       analytical process to define mobile
       genetic and insertion elements
•   Structural variations to challenge the
    ability to recognize
     – Repetitive sequences (e.g. palindromic
       repeats)
     – Homopolymers (>14 bases)
     – Insertion elements
     – Chromosomal rearrangements
     – SNP calls (e.g. variant silencing due to
       motifs)
•   Reference data available on multiple
    platforms
•   Pedigree/phylogeny of strains
•   Phenotypic characterization
Interesting work on assessing
    performance for microbial sequencing
•    Quail et al. at Sanger report on using
     4 different microbial genomes to
     characterize sequencer performance
      – ~20% - ~68% GC overall
      – Bordetella pertussis
           •   67.7 % GC, with some regions in excess
               of 90 % GC content
      – Salmonella Pullorum
           •   52 % GC
      – Staphylococcus aureus
           •   33 % GC
      – Plasmodium falciparum
           •   19.3 % GC, with some regions close to 0
               % GC content
•    “We routinely use these to test new                  Quail, M. et al. A tale of three next
     sequencing technologies, as together                  generation sequencing platforms:
     their sequences represent the range
     of genomic landscapes that one                        comparison of Ion Torrent, Pacific
     might encounter.”                                       Biosciences and Illumina MiSeq
                                                         sequencers. BMC Genomics 13, 341
                                                                                        (2012).

More Related Content

More from GenomeInABottle

GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGenomeInABottle
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923GenomeInABottle
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGenomeInABottle
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020GenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGenomeInABottle
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGenomeInABottle
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGenomeInABottle
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGenomeInABottle
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGenomeInABottle
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyGenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GenomeInABottle
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphsGenomeInABottle
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normalGenomeInABottle
 

More from GenomeInABottle (20)

GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
 

March 2013 NIST Reference Material Program and Data Integration

  • 1. NIST Program for Human Genome Reference Materials Marc Salit and Justin Zook NIST
  • 2. Some use cases for a well-characterized, stable RM • Obtain metrics for validation, QC, QA, PT • Determine sources and types of bias/error • Learn to resolve difficult structural variants • Improve reference genome assembly • Optimization – integration of data from multiple platforms – sequencing and analysis Comparison of SNP Calls for • Enable regulated applications NA12878 on 2 platforms, 3 analysis methods
  • 3. Some use cases for a well-characterized, stable RM • Obtain metrics for validation, QC, QA, PT • Determine sources and types of bias/error • Learn to resolve difficult structural variants • Improve reference genome assembly • Optimization – integration of data from multiple platforms – sequencing and analysis Comparison of SNP Calls for • Enable regulated applications NA12878 on 2 platforms, 3 analysis methods
  • 4. Measurement Process • gDNA reference Sample materials will be gDNA isolation developed to generic measurement process characterize Library Prep performance of a part Sequencing of process – materials will be Alignment/Mapping certified for their Variant Calling variants against a reference sequence, Confidence Estimates with confidence estimates Downstream Analysis
  • 5. Variants of Interest • SNPs (and larger 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# 5’# polymorphisms) Inversion:( 5’#A"G"G"A"%%%"G"C"A"T" 3’# 3’# • Indels 5’# 5’# A"G"G"C"%%%"T"C"A"T" • Longer insertions/deletions Reference:( 3’# 3’# 5’# 5’# • Inversions Inser+on:( A"G"G"C"%%%"T"G"G"A"C"A"T" 3’# 3’# 5’# • Rearrangements 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# 5’# • CNV (different lengths) – Deletions, tandem and 5’# 3’# (# A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T" 3’# )# n# 5’# dispersed dups 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# – duplications with SNPs/indels 5’# 5’# A"G"C"T" 3’# • Mobile Element Insertions 3’# 5’#
  • 6. Putting “Genomes” in Bottles • NIST working with GiaB CEPH Utah Pedigree 1463 to select genomes 12889 12890 12891 12892 • Current plan – NA12878 HapMap 12877 12878 sample as Pilot sample • part of 17-member pedigree 12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893 – trios from PGP as more complete set • 8 trios, focus on children • varying biogeographic ancestry
  • 7. Consenting Genomes for use as Reference Materials • Risk of re-identification – this is a real risk – privacy – implications for family members • Meaning of possibility of withdrawal • Commercial application – indirect, research – direct, derived products • PGP project currently state-of-art – broad and direct – test to demonstrate understanding • “Wild West”
  • 8. Characterization Methods Whole Genome Sequencing Other • ABI 5500 (1kb, 6kb, and 10kb • Genotyping microarrays mate-pair libraries) • Array CGH • Illumina • Complete Genomics • Targeted sequencing – including LFR • Fosmid sequencing? • Emerging technologies • Optical Mapping? – Ion Proton – nanopore? • 3x replication of sequencing (3 Father Mother library preps) Husband NA12878 • … Son Daughter
  • 9. Timeline Consortium Activity NIST RM Activity • WG Telecons • 80 mg gDNA for NA12878 – Starting up in April expected @ NIST 4/2013 – Info to be posted on – 8000 samples www.genomeinabottle.org – available for characterization within GiaB immediately • schedules – target for release as NIST RM • agendas 2/2014 • summaries • SNPs, small indels • Website forums • PGP Samples coming – general and supporting each • IRB Status WG – working to establish policy • Upcoming Workshops • looks good for release of NA12878 as pilot RM – Proposed 8/2013 • PGP samples expected to gain • NIST, Gaithersburg, MD approval
  • 10. Artificial Constructs • useful as spike-ins – QC on clinical samples 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# 5’# • a panel of druggable targets 5’# Inversion:( A"G"G"A"%%%"G"C"A"T" 3’# 3’# in development at NCI 5’# – pDNA with a mutation insert 5’# A"G"G"C"%%%"T"C"A"T" Reference:( 3’# 3’# • ‘barcoded’ adjacent to 5’# mutation of interest Inser+on:( 5’#A"G"G"C"%%%"T"G"G"A"C"A"T" 3’# 3’# 5’# • large-scale constructs may 5’# A"G"G"C"&&&"T"C"A"T" be useful for SV and specific 3’# 3’# 5’# contexts 5’# (# A"G"G"C"&&&"T"C"G"A"&&&"T"C"A"T" 3’# 3’# )# 5’# • recapitulate “difficult” n# sequence contexts 5’# A"G"G"C"&&&"T"C"A"T" 3’# 3’# 5’# – simple sequence 5’# A"G"C"T" 3’# 3’# – duplications 5’#
  • 11. Microbial Genome RMs Reference Samples Extracted DNA Sample Preparation Sequencing Variant List, Bioinformatics Performance Metrics
  • 12. With multiple data sets, both opportunity for integration and question of just how to do it. DATA INTEGRATION
  • 13. Datasets • 9 whole genome – Illumina, CG, 454, SOLiD • 3 whole exome – Illumina, Ion Torrent
  • 14. Integration of Data to Form “Gold Standard” Genotype Calls Candidate variants Find all possible variant sites Confident variants Find highly confident sites across multiple datasets Find characteristics Identify sites with atypical characteristics signifying of bias sequencing, mapping, or alignment bias For each site, remove datasets with decreasingly atypical Arbitration characteristics until all datasets agree Even if all datasets agree, identify them as uncertain if Confidence Level few have typical characteristics
  • 15. Characteristics of Sequence Data/Genotype associated with bias • Systematic sequencing • Mapping problems errors – Mapping Quality – Strand bias – Higher (or lower) than – Base Quality Rank Sum expected coverage – CNV Test – Length of aligned reads • Local Alignment problems • Abnormal allele balance – Distance from end of read or Quality/Depth – Mean position within read – Allele Balance – Read Position Rank Sum – Quality/Depth – HaplotypeScore – Mean length of aligned reads
  • 16. Example of Arbitration: SSE suspected from strand bias Platform A Strand Bias Platform B (SNP overrepresented on reverse strands) Homopolymer
  • 17. Performance Assessment of Genotype Calling • For our purposes, we • Fourth category: consider three categories Uncertain Genotype of genotype calls – developing – homozygous reference • Three performance – heterozygous assessments: – homozygous variant – Individual dataset and • by convention Consensus calls against – Negative: homozygous Omni SNP Array reference – Individual dataset against – Positive: anything else Omni SNP Array and Consensus • our approach looks at 3x3 – Individual dataset with two matrix of call different genotype callers concordance against Consensus
  • 18. Genotype Comparison Tables Method as “Truth” Hom. Ref Heterozygous Hom. Variant Uncertain Hom. Ref. ? Method being Assessed ? Het. Hom. Var. ? Uncertain ? ? ? ? ? * current state of research: only consensus process has “Uncertain” category
  • 19. Consensus has lower FN rate than individual datasets Illumina Omni SNP Array Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A No Call “FPs*” Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A Variant Illumina Omni SNP Array Homozygous Homozygous Integrated Consensus Heterozygous Uncertain Reference Variant “FNs” Genotypes Homozygous 1.45M 613 (0.09%) 977 (0.15%) N/A Reference “FPs*” Heterozygous 241 (0.04%) 414k (61.5%) 173 (0.03%) N/A Homozygous 152 (0.02%) 61 (0.01%) 249k (36.9%) N/A Variant Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A * Note that most or all of the putative FPs seem to actually be FNs on the microarray
  • 20. SNP arrays overestimate performance Illumina Omni SNP Array Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.45M 7.24k (1.34%) 5.28k (0.65%) N/A No Call “FPs*” Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A Variant Integrated Consensus Genotypes Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M No Call “FPs” Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%) Variant
  • 21. Samtools has higher FP and lower FN than GATK Integrated Consensus Genotypes HiSeq – samtools Homozygous Homozygous Heterozygous Uncertain Reference Variant Homozygous “FNs” Reference/ 1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M No Call “FPs” Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%) Homozygous 21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%) Variant Integrated Consensus Genotypes Homozygous Homozygous Heterozygous Uncertain HiSeq – GATK Reference Variant Homozygous “FNs” Reference/ 1.52M 157k (4.68%) 30.3k (0.90%) 4.17M No Call “FPs” Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%) Variant
  • 22. Performance Metrics: Characteristics of Mis-calls Consensus Genotypes Hom. Ref. Heterozygous Hom. Variant Uncertain Heterozygous Hom. Ref./No call HiSeq/GATK Hom. Variant QUAL/Depth of Coverage Strand Bias ...
  • 23. Challenges with assessing performance • All variant types are not • Genotypes fall in 3+ equal categories (not • Nearby variants are often positive/negative) difficult to align – standard diagnostic accuracy measures not well posed • All regions of the genome are not equal • Data from multiple – Homopolymers, STRs, platforms and library duplications preparations – Can be similar or different in – when characterizing a different genomes Reference Material • Labeling difficult variants as – when assessing performance of a test platform uncertain leads to higher apparent accuracy when assessing performance
  • 24. Genome-in-a-Bottle Consortium • Genome-in-a-Bottle • Developing genomic DNA – www.genomeinabottle.org reference materials for • newsletters, blogs, forums, small number of announcements microbial species – new partners welcome! – to enable performance – targeting pilot reference assessment of sequencing material availability in 2013 platforms – working to identify best – range of GC practice for consent of – range of complexity subject genome as a whole-genome reference material
  • 26. Microbial Reference Material Considerations • Variation in GC Content – Genomes with a range of GC to challenge platforms – Within genome variation to challenge analytical process to define mobile genetic and insertion elements • Structural variations to challenge the ability to recognize – Repetitive sequences (e.g. palindromic repeats) – Homopolymers (>14 bases) – Insertion elements – Chromosomal rearrangements – SNP calls (e.g. variant silencing due to motifs) • Reference data available on multiple platforms • Pedigree/phylogeny of strains • Phenotypic characterization
  • 27. Interesting work on assessing performance for microbial sequencing • Quail et al. at Sanger report on using 4 different microbial genomes to characterize sequencer performance – ~20% - ~68% GC overall – Bordetella pertussis • 67.7 % GC, with some regions in excess of 90 % GC content – Salmonella Pullorum • 52 % GC – Staphylococcus aureus • 33 % GC – Plasmodium falciparum • 19.3 % GC, with some regions close to 0 % GC content • “We routinely use these to test new Quail, M. et al. A tale of three next sequencing technologies, as together generation sequencing platforms: their sequences represent the range of genomic landscapes that one comparison of Ion Torrent, Pacific might encounter.” Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).

Editor's Notes

  1. focus on the diagnostic power of these characteristic analyses; useful for identifying problems and optimizing, as well as to identify characteristics that are more prone to mis-calls