Elliott Margulies - Striving for Perfection: The Platinum Genomes Project
Upcoming SlideShare
Loading in...5
×
 

Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

on

  • 1,618 views

 

Statistics

Views

Total Views
1,618
Views on SlideShare
1,599
Embed Views
19

Actions

Likes
0
Downloads
49
Comments
0

1 Embed 19

http://www.dnalinklabs.com 19

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Elliott Margulies - Striving for Perfection: The Platinum Genomes Project Elliott Margulies - Striving for Perfection: The Platinum Genomes Project Presentation Transcript

    • Striving for Perfection: The Platinum Genomes Project Elliott H. Margulies, Ph.D. Director, Scientific Research COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE© 2011 Illumina, Inc. All rights reserved.Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera,Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and namescontained herein are the property of their respective owners.
    • From Sample to AnswerSample Sequence Analyse Annotate Interpret Answer Enabling clinical use of WGS Fast sequencing from low-input and FFPE samples Improved Accuracy and Utility of detected variants Integrated “push button” analyses – from sequence to annotated variants Focus on genome exploration2
    • The truth is hard to find… Sequencing the same genome twice We identify many more Mendeliandoes not give you the identical answer conflicts than actually exist A/A T/T Variants Dad Mom First Time ? Second Time Child T/T3
    • Summary of increased accuracy Eland+CASAVA Mendelian Sensitivity   Conflicts   Accuracy   Filter   96.62   13,032   99.9995% unfiltered   96.10   8,383   99.9997% + gVCF filters   95.25   5,309   99.9998% + score:coverage 1.43% loss 59.26% loss in sensitivity in conflicts Sensitivity Conflicts Accuracy Method 95.90 4,928 99.9998% BWA+MPG* NB: Accuracy is expressed here as % total filtered calls that are Mendelian concordant* Accurate and comprehensive sequencing of personal genomes S.S. Ajay, S.C.J. Parker, H. Ozel Abaan, Karin V. Fuentes Fajardo, and E.H. Margulies Genome Res. 2011 21: 1498-15054
    • A critical assessment of whole-genomesequencing… ! Where are we doing well? ! What parts of the genome are still inaccessible or less accurately called – and most importantly, why? GOALS: ! Maximum utility for use in research and medical applications ! Determine key areas for improvement and assess progress ! Assess performance in real-life situations5
    • Platinum genomes: the proposal ! Select a small set of well-known and accessible genomes ! Generate initial WGS datasets using best current practices ! Make it freely available in a database by "open source" principles ! Perform analyses to define high and low quality regions and variant calls ! Examine low quality regions and calls and validate with additional evidence (methods) ! Maintain a database with revised data and evidence to provide a long term benchmark ! Develop improved methods (analysis, chemistry, sample prep)6
    • CEPH/Utah Pedigree 1463 12889 12890 12891 12892 12877 12878 12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893 ! Three generation family, extensively sequenced by the genomics community ! Focus on the trio shaded in gray (12877 12878 and 12882) ! Sourced ~200µg for the initial trio (shaded) and ~50µg for all others7
    • Initial dataset Genotype Genotype Sample   Depth   Q30   coverage   concordance   NA12877   219.63   91.3   99.79   99.25   NA12878   211.88   93.6   99.8   99.25  Technical NA12882   217.95   93.2   99.8   99.24  Replicate NA12881   46.67   91.7   99.84   99.28   NA12880   48.37   91.4   99.74   99.28   NA12879   48.01   92   99.75   99.29   NA12883   54.73   94.2   99.6   99.27   NA12884   43.76   93.2   99.7   99.27   NA12885   54.56   94   99.8   99.28   NA12886   64.98   91   99.8   99.28   NA12887   48.33   92.4   99.81   99.29   NA12888   47.61   92.2   99.81   99.28   NA12889   49.99   91   99.49   99.28   NA12890   59.34   88   99.8   99.29   NA12891   45.49   93   99.75   99.28   NA12892   50.32   93.4   99.67   99.29   NA12893   47.69   92.7   99.79   99.28   8
    • NA12882 Technical Technical Replicate A Replicate B 200x 200x (18 lanes) (18 lanes) 100x 100x 100x 100x (8 lanes) (8 lanes) (8 lanes) (8 lanes) 50x 50x 50x 50x 50x 50x 50x 50x ! Callability and reproducibility among pairs of replicates –  50x vs 100x vs 200x –  Between technical replicates9
    • Pair-wise comparisons of genome builds Concordance at variant positions where both genomes PASSed basic quality filters Coverage Library SNPs Indels Combined 50x different 99.34%   90.94%   98.52%   50x same 99.36%   90.83%   98.52%   100x different 99.47%   90.60%   98.57%   100x same 99.47%   90.54%   98.56%   200x different 99.53%   90.23%   98.55%  10
    • NA12882 Technical Technical Replicate A Replicate B 200x 200x (18 lanes) (18 lanes) 100x 100x 100x 100x (8 lanes) (8 lanes) (8 lanes) (8 lanes) 50x 50x 50x 50x 50x 50x 50x 50x ! Consistency across all the replicates –  How many replicates were able to be called at a given position? –  How many different genotypes were present at that position?11
    • Consistency among technical replicates Number of different genotypes     0   1   2   3   4   5   6   7   8   9   10   11   12   13   14  PASSing genotype quality filter 0   1.96   1   0.23   Number of replicates 2   0.21   0.0005   3   0.18   0.0006   3.5E-­‐05   4   0.16   0.0007   4.2E-­‐05   8.7E-­‐06   5   0.15   0.0007   4.5E-­‐05   1.3E-­‐05   3.5E-­‐06   6   0.15   0.0008   4.6E-­‐05   1.6E-­‐05   6.1E-­‐06   1.4E-­‐06   7   0.16   0.0008   4.9E-­‐05   1.8E-­‐05   8.8E-­‐06   3.0E-­‐06   8.2E-­‐07   8   0.16   0.0007   5.5E-­‐05   1.9E-­‐05   9.0E-­‐06   4.3E-­‐06   1.9E-­‐06   4.1E-­‐07   9   0.17   0.0007   5.6E-­‐05   2.0E-­‐05   1.1E-­‐05   5.2E-­‐06   2.5E-­‐06   1.4E-­‐06   3.7E-­‐07   10   0.20   0.0006   6.1E-­‐05   2.1E-­‐05   1.1E-­‐05   7.4E-­‐06   3.8E-­‐06   1.9E-­‐06   7.1E-­‐07   1.9E-­‐07   11   0.24   0.0006   6.9E-­‐05   2.6E-­‐05   1.4E-­‐05   9.4E-­‐06   6.4E-­‐06   3.7E-­‐06   1.5E-­‐06   3.7E-­‐07   7.4E-­‐08   12   0.32   0.0007   8.5E-­‐05   3.2E-­‐05   1.9E-­‐05   1.2E-­‐05   8.6E-­‐06   5.5E-­‐06   2.8E-­‐06   1.3E-­‐06   4.8E-­‐07   7.4E-­‐08   13   0.61   0.0010   1.2E-­‐04   4.3E-­‐05   2.8E-­‐05   1.9E-­‐05   1.5E-­‐05   1.1E-­‐05   7.4E-­‐06   4.6E-­‐06   2.0E-­‐06   6.7E-­‐07   2.2E-­‐07   14   95.07   0.0025   2.3E-­‐04   8.6E-­‐05   5.3E-­‐05   4.0E-­‐05   3.6E-­‐05   3.3E-­‐05   3.0E-­‐05   2.3E-­‐05   1.4E-­‐05   7.6E-­‐06   2.1E-­‐06   6.0E-­‐07   “Metal”   Genome   SNVs  from  a  50x  build   Gold   95.1%   94.80%   3,030,777   Silver   2.95%   4.15%   132,579   Copper   0.01%   1.05%   33,679   Lead   1.96%   12
    • Genomic features overlapping with “metal”regions Genome   SNVs   CDS   medCDS   gold   95.07%   94.80%   96.91%   97.87%   silver   2.95%   4.15%   1.35%   1.11%   copper   0.01%   1.05%   0.003%   0.002%   lead   1.96%   0.00%   1.74%   1.02%  13
    • A closer examination of “Copper” regions:those that had more than one genotype 86% of copper regions had just two different genotypes Type  of   inconsistency   Percentage   REF  /  het  SNV   37.40   REF  /  het  DEL   21.89   REF  /  het  INS   15.11   het  SNV  /  hom  SNV   5.38   het  DEL  /  hom  DEL   0.42   het  INS  /  hom  INS   1.43   Remaining   18.38  14
    • Concordance in “metal” regions SNP concordance from two builds generated from different libraries 50x   100x   200x   ALL   99.34%   99.47%   99.53%   Gold   99.80%   99.94%   99.94%   Silver   85.00%   89.81%   93.80%   Copper   53.85%   67.85%   82.12%   Lead*   519   6,589   22,164   Non-gold regions of the genome point to areas that are not comprehensively/accurately assessed*  Absolute  values  more  revealing   15
    • Concordance in “metal” regions Concordance of variants between two 100x builds from the same library SNPs   Indels   Both   Overall   99.47%   90.54%   98.56%   Gold   99.92%   96.77%   99.65%   Silver   90.65%   68.18%   86.32%   Copper   77.13%   57.11%   61.00%   Lead   73.44%   74.73%   73.88%   Indels need more attention16
    • Practical/Clinical/Medical Relevance 200x build comparison in medically-relevant CDS regions Percent Percent Metal   ALL   Same   Different   the Same   in Metal   Combined   1,187   1,182   5   99.58%   Gold   1,151   1,151   0   100.00%   96.97%   Silver   29   26   3   89.66%   2.44%   Copper   2   2   0   100.00%   0.17%   Lead   5   3   2   60.00%   0.42%  17
    • Future Plans ! Classify inconsistent parts of the genome into: –  Alignment or read length issues §  Paralogous/repetitive/CNV regions §  Missed or wrong indel calls –  Depth of coverage –  Platform-specific artifacts ! Disseminate data/analyses to the research community ! Platform for developing better indel detection ! Error correction via haplotyping efforts ! Independent validation efforts ! Develop a database of variants and associated evidence18
    • Acknowledgements ! David Bentley ! Klaus Maisinger ! Sean Humphray ! Russell Grocock ! Mark Ross ! Peter Saffrey ! Nick Kerry ! Brad Sickler ! Nondas Fritzilas ! Pedro Cruz ! Phil Tedder ! Shankar Ajay ! Mike Eberle ! Marc Laurant ! Lisa Murray ! Semyon Kruglyak19
    • END20
    • Accurate and comprehensive sequencing of pe Subramanian S. Ajay, Stephen C.J. Parker, Hatice Ozel Abaan, et al. Genome Res. published online July 19, 2011 Downloaded from genome.cshlp.org on July 20, 2011 - Published by Cold Spring Harbor Laboratory Press Access the most recent version at doi:10.1101/gr.123638.111 Research Accurate and comprehensive sequencing Supplemental http://genome.cshlp.org/content/suppl/2011/06 Material of personal genomes P<P Published online July 19, 2011 in advance of the p Subramanian S. Ajay,1 Stephen C.J. Parker,1 Hatice Ozel Abaan,1 Karin V. Fuentes Fajardo,2 and Elliott H. Margulies1,3,4 Freely available online through the Genome Resea Open Access 1 Genome Informatics Section, Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Email alerting Receive free email alerts when Bethesda, Maryland 20892, USA; 2Undiagnosed Diseases Program, Office of the Clinical Director, National Human Genome Research new articles cite th service Institute, National Institutes of Health, Bethesda, Maryland 20892, USA top right corner of the article or click here As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ~30@ coverage is not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results Genotype calls are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAIIx and HiSeq 2000, to a very high depth (126@). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used. These results help provide a ‘‘sequencing guide’’ for future whole-genome sequencing decisions and metrics by which 50x coverage statistics should be reported. 50x [Supplemental material is available for this article.] Whole-genome sequencing and analysis is becoming part of a hg19  callable   a question that is extremely important as whole-genome se- Filter   translational research toolkit (Lupski et al. 2010; Sobreira et al. 2010) to investigate small-scale changes such as single-nucleotide In  both   Discordant   quencing and analysis of individual genomes transitions from primarily research-based projects to being used for clinical and variants (SNVs) and indels (Bentley et al. 2008; Wang et al. 2008; diagnostic applications. Additionally, we seek to understand the No  extra  filters   Kim et al. 2009; McKernan et al. 2009; Fujimoto et al. 2010; Lee 98.33%   46,580   relationship between the amount of sequence data generated and et al. 2010; Pleasance et al. 2010) in addition to large-scale events the resulting proportion of the genome where confident geno- With  alignment  and  genotype  Filters   such as chromosomal rearrangements (Campbell et al. 2008; Chen et al. 2008) and copy-number variation (Chiang et al. 2009; 93.13%   1,673   types can be derived—we refer to this as the ‘‘callable’’ portion, a term that is roughly equivalent to the 1000 Genomes Project’s Park et al. 2010). For both basic genome biology and clinical ‘‘accessible’’ portion. Using these sequencing metrics and geno- No  q20  Evidence  (MapQ1)   diagnostics, the trade-offs of data quality and quantity will de- 267   type-calling filters will help obviate the need for costly and time- termine what constitutes a ‘‘comprehensive and accurate’’ whole- consuming validation efforts. Currently, no empirically derived21 genome analysis, especially for detecting SNVs. As whole-genome sequencing becomes commoditized, it will be important to deter- data sets exist for determining how much sequence data is needed to enable accurate detection of SNVs. NHGRI mine quantitative metrics to assess and describe the comprehen- To address this issue, we sequenced a blood sample from a siveness of an individual’s genome sequence. No such standards male individual with an undiagnosed clinical condition on two currently exist. related platforms—Illumina’s GAIIx and HiSeq 2000—to a total of