• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
NIST program to develop genomic reference materials
 

NIST program to develop genomic reference materials

on

  • 937 views

 

Statistics

Views

Total Views
937
Views on SlideShare
918
Embed Views
19

Actions

Likes
0
Downloads
27
Comments
0

1 Embed 19

http://www.dnalinklabs.com 19

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    NIST program to develop genomic reference materials NIST program to develop genomic reference materials Presentation Transcript

    • NIST  Program  to  Develop  Genomic  Reference  Materials   Jus<n  Zook  and  Marc  Salit  
    • Scope  of  NIST  work  •  Human  Whole  Genome  RMs  •  Synthe<c  DNA  constructs  •  Microbial  Whole  Genome  RMs  
    • RM  Development  Process  1.  Select  and  procure  materials  2.  Characterize  materials  3.  Process  and  integrate  data  from  mul<ple   plaMorms  4.  Confirm  selected  genotypes  5.  Write  Report  of  Analysis  6.  Develop  methods  for  end  users  to  obtain   performance  metrics  from  the  materials  
    • Proposed  Timeline  for  Human  RMs  
    • Proposed  Timeline  for  Synthe<c   Structures  Title 2011 Effort 2012 2013 2014 2015 2 1) Human RMs 535w 1.1) Select/Procure human DNA for RM 32w 1.2) **NIST receives packaged DNA for RM/SRM 1.3) Develop bioinformatics pipeline for data 97w integration 1.4) Human Primary Sequencing 147w 1.5) Human Homogeneity assessment 8w 1.6) Analyze homogeneity data and produce preliminary 10w SNP calls for RM 1.7) Write human RM Report of Analysis 10w 1.8) Process Human RM for release 24w 1.9) **Human RM officially released 1.10) Human Sequencing data integration 25w 1.11) Human Validation 20w 1.12) Human other characterization methods 48w 1.13) Analyze validation data and refine sequencing calls 12w 1.14) Develop pipeline for SVs and test 40w 1.15) Write Human SRM Report of Analysis 8w 1.16) Process Human SRM for release 24w 1.17) **Human SRM officially released 1.18) Procure local data storage 10w 1.19) Procure Bioinformatics data analysis tools 10w 1.20) Procure Automated sample prep instrumentation 10w 2) Microbial RMs 279w 2.1) Select/Procure microbial DNA for RMs 31w 2.2) Microbial Primary Sequencing 124w 2.3) Microbial Homogeneity assessment 6w 2.4) Microbial Sequencing data integration 40w 2.4.1) Mapping/Alignment 10w 2.4.2) Variant calling 12w 2.4.3) Form consensus variant calls 12w
    • Proposed  Characteriza<on  Methods   for  Whole  Genomes  Whole  Genome  Sequencing   Other  •  ABI  5500  (1kb,  6kb,  and   •  Genotyping  microarrays   10kb  mate-­‐pair  libraries)   •  Array  CGH  •  Illumina   •  Targeted  sequencing  •  Complete  Genomics   •  Fosmid  sequencing?  •  Upcoming  technologies?     •  Op<cal  Mapping?   –  Ion  Proton?     –  Oxford  Nanopore?   Father   Mother  •  3x  replica<on  of  sequencing   (3  library  preps)   Husband   NA12878   Son   Daughter  
    • Integra<on  of  Exis<ng  Data  to  Form   Consensus  Genotype  Calls   Find  all  possible  variant  sites   Find  sites  where  all  datasets  agree   Iden<fy  sites  with  atypical  characteris<cs  signifying   sequencing,  mapping,  or  alignment  bias   For  each  site,  remove  datasets  with  decreasingly  atypical   characteris<cs  un<l  all  datasets  agree   Even  if  all  datasets  agree,  iden<fy  them  as  uncertain  if   few  have  typical  characteris<cs  
    • Consensus  has  lower  FN  rate  than   individual  datasets   Illumina  Omni  SNP  Array   Homozygous   Homozygous  HiSeq  –  GATK   Heterozygous   Uncertain   Reference   Variant   Homozygous   “FNs”   Reference/   1.45M   7.24k  (1.34%)   5.28k  (0.65%)   N/A   No  Call   “FPs*”   Heterozygous   196  (0.03%)   411k  (60.7%)   133  (0.02%)   N/A   Homozygous   154  (0.02%)   150  (0.02%)   249k  (37.0%)   N/A   Variant   Illumina  Omni  SNP  Array  Integrated  Consensus   Homozygous   Homozygous   Heterozygous   Uncertain   Reference   Variant   Homozygous   “FNs”   Genotypes   Reference/   1.45M   613  (0.09%)   977  (0.15%)   N/A   No  Call   “FPs*”   Heterozygous   241  (0.04%)   414k  (61.5%)   173  (0.03%)   N/A   Homozygous   152  (0.02%)   61  (0.01%)   249k  (36.9%)   N/A   Variant   Uncertain   5458  (0.81%)   3421  (0.51%)   4808  (0.71%)   N/A   *  Note  that  most  or  all  of  the  puta<ve  FPs  seem  to  actually  be  FNs  on  the  microarray  
    • SNP  arrays  overesMmate  performance   Illumina  Omni  SNP  Array   Homozygous   Homozygous  HiSeq  –  GATK   Heterozygous   Uncertain   Reference   Variant   Homozygous   “FNs”   Reference/   1.45M   7.24k  (1.34%)   5.28k  (0.65%)   N/A   No  Call   “FPs*”   Heterozygous   196  (0.03%)   411k  (60.7%)   133  (0.02%)   N/A   Homozygous   154  (0.02%)   150  (0.02%)   249k  (37.0%)   N/A   Variant   Integrated  Consensus  Genotypes   Homozygous   Homozygous   HiSeq  –  GATK   Heterozygous   Uncertain   Reference   Variant   Homozygous   “FNs”   Reference/   1.52M   157k  (4.68%)   30.3k  (0.90%)   4.17M   No  Call   “FPs”   Heterozygous   47  (0.00%)   1.90M  (56.4%)   34  (0.00%)   16.9k  (0.50%)   Homozygous   1  (0.00%)   298  (0.01%)   1.19M  (35.3%)   73.3k  (2.18%)   Variant  
    • Samtools  has  higher  FP  and  lower  FN   than  GATK   Integrated  Consensus  Genotypes  HiSeq  –  samtools   Homozygous   Homozygous   Heterozygous   Uncertain   Reference   Variant   Homozygous   “FNs”   Reference/   1.51M   49.6k  (1.47%)   6.74k  (0.20%)   3.93M   No  Call   “FPs”   Heterozygous   3141(0.09%)   2.00M  (59.6%)   74  (0.00%)   175k  (5.19%)   Homozygous   192k  (5.71%)   21  (0.00%)   777  (0.02%)   1.21M  (36.0%)   Variant   Integrated  Consensus  Genotypes   Homozygous   Homozygous   HiSeq  –  GATK   Heterozygous   Uncertain   Reference   Variant   Homozygous   “FNs”   Reference/   1.52M   157k  (4.68%)   30.3k  (0.90%)   4.17M   No  Call   “FPs”   Heterozygous   47  (0.00%)   1.90M  (56.4%)   34  (0.00%)   16.9k  (0.50%)   Homozygous   1  (0.00%)   298  (0.01%)   1.19M  (35.3%)   73.3k  (2.18%)   Variant  
    • Performance  Metrics:  Characteris<cs   of  Mis-­‐calls   Consensus  Genotypes   Hom.  Ref.   Heterozygous   Hom.  Variant   Uncertain   Heterozygous   Hom.  Ref./No  call  HiSeq/GATK   Hom.  Variant   QUAL/Depth  of  Coverage   Strand  Bias   .  .  .  
    • Challenges  with  assessing   performance  •  All  variant  types  are  not  equal  •  Nearby  variants  are  ojen  difficult  to  align  •  All  regions  of  the  genome  are  not  equal   –  Homopolymers,  STRs,  duplica<ons   –  Can  be  similar  or  different  in  different  genomes  •  Labeling  difficult  variants  as  “uncertain”  in  the   Reference  Material  leads  to  higher  apparent  accuracy   when  assessing  performance  •  Genotypes  fall  in  3+  categories  (not  posi<ve/nega<ve)  •  It’s  important  to  consider  data  from  mul<ple  plaMorms   and  library  prepara<ons  when  characterizing  a   Reference  Material