The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule                  Justin H. Johnson        ...
Agenda• Who We Are• NGS at 30K• The Challenges  – Even Before We Get to the Platforms  – When We Get to the Platforms
Who We Are
Life Tech ServiceProvider
Contract Research Division• Five SOLiD4 sequencing platforms• One Life Techologies 5500XL• Two Ion Torrent PGMs• Bioinform...
Edge BioServ                          Scientific Advisory BoardElaine Mardis, Ph.D.                       Steven Salzberg,...
NGS @ 30K Feet
Machines and Vendors
Obligatory NGS Exponential Growth SlideNature Biotechnology Volume 26 Number10 October2008
Ultra High Throughput + Lower Cost = Broader Applications                          RNA-Seq/                     Whole Tran...
Challenges
ChallengesTechnical Expertise
Experimental Design Considerations       Sequencing Platform in Use       Choice of Library Construction       Depth of...
ChallengesFlexibility w/ Standards
Flexibility with Standards and Scale• Then (CE) – The Norm  – 10 Machines, 30 – 360 Days, 1 Project• Now (Illumina/SOLiD/4...
ChallengesSample Preparation
Sample Sourcing for RNA Projects– Blood: Large quantities of sample available, but  with limited utility in transcriptome ...
Unamplified vs Amplified• Prostate Cancer Cell Line (Vcap) from CPDR  – Well characterized  – Differential Expression upon...
Amplification Gives Different Results• Gene Expression in Unstimulated Cells                  14,075
Spearman’s Correlation from 2                  PipelinesPipeline A                   Unamplified            Amplified     ...
ChallengesSample Analysis
Exome Seq Ultimately About Variants• Coverage• Project Design  – Cohorts  – Cancer• Algorithms a Solved Problem?  – Single...
Ultimately Comes to Variation• Coverage• Project Design  – Cohorts  – Cancer• Algorithms Solved Problem?  – Single open so...
Digging Deep with an ExomeGenetic variation in an individual human exome.Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, ...
Venter Genome - Algorithms   • PLOS genetics 2008 vol 4 issue 8 e10000160   • ~21K SNP in exons (29MB Targeted)   • 36,206...
3 Tools and Associated SNP Counts• Software A  – 45,551• Software B  – 29,814• Software C  – 40,964
Software B v. Software A             B                 A           29,814            45,511   8,564            21,250     ...
Software B v. Software C             B                 C           29,814            40,964   6,358            23,456     ...
Software A v. Software C              A                 C            45,511            40,964   14,738            30,773  ...
B                 A    29,814            45,5114,750        1,608             13,130             19,642    3,814          ...
ChallengesPlatforms
The weight in…                  Yield/Day           Read Error Rates   Read LengthsIllumina MiSeq    2.0 Gb              1...
Illumina MiSeqMid-Range Length, Accurate Reads, Large Throughput• All Resequencing• All De novo Applications• Transcriptom...
Ion Torrent PGMLong, Mostly Accurate Reads in 2.5 Hours• Microbial & Viral Resequencing• Microbial & Viral De novo Applica...
Pac Bio RSUltra Long, Less Accurate Reads & Rapid Sequence• Microbial & Viral De novo Applications• Structural Variation /...
Ion Torrent PGM                                   Mean Read                Total #            A20 Mean Read     Name      ...
Ion Torrent PGM                                                                         Percent of                        ...
Why the Difference?Quality?
Quality?Q-Q plots of the DH10B Ion Torrent 316 chip data expected vs empirical qualitybefore recalibration (left) and afte...
QualityQ-Q plots of DH10B MiSeq data expected vs empirical quality before recalibration(left) and after recalibration (rig...
Empirical Quality
Empirical Quality - Long Reads
Then Why?• De Bruijn Graphs adversely affected by more  frequent INDEL characteristics of Ion Torrent• Higher Average Qual...
Does this matter in Resequencing?• Depends on the tools used!   – If you understand error profile, you can correct for it…...
Resequencing    • Ion claims substitution issues with MiSeq 1    • Illumina claims INDEL issues with Ion 21. http://www.io...
Resequencing                     Variants     Specificity   Sensitivity   PPV                     IdentifiedIon/TMAP/SamTo...
Ion Data                                               PPV and Sensitivity of Samtools Analyses100.000% 80.000% 60.000%   ...
MiSeq Data                                             PPV and Sensitivity of Samtools Analyses of MiSeq Data100.000% 80.0...
Resequencing ConclusionUsing appropriate aligners and variant callers we show bothplatforms have high accuracy,   each wit...
What About PacBio?• We have less experience with PacBio• We (EdgeBio) thinks PacBio may have a niche,  but given large ini...
Take This Home• There are many challenges before we even get  to picking a platform  – Technical Expertise  – Standards in...
Acknowledgements• CPDR (Center for Prostate Disease Research) Collaboration   – Shyh-Han Tan, Ph.D.            EdgeBio Seq...
Questions   Twitter: @Bioinfojjohnson@edgebio.com
Upcoming SlideShare
Loading in …5
×

The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule

3,670 views

Published on

Published in: Technology, Health & Medicine
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,670
On SlideShare
0
From Embeds
0
Number of Embeds
90
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule

  1. 1. The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule Justin H. Johnson Director of Bioinformatics EdgeBio Washington DC, USA
  2. 2. Agenda• Who We Are• NGS at 30K• The Challenges – Even Before We Get to the Platforms – When We Get to the Platforms
  3. 3. Who We Are
  4. 4. Life Tech ServiceProvider
  5. 5. Contract Research Division• Five SOLiD4 sequencing platforms• One Life Techologies 5500XL• Two Ion Torrent PGMs• Bioinformatics consulting on Illumina, 454, and PacBio• Automation thru Caliper Sciclone & Biomek FX• Commercial partnerships with companies such as CLCBio, DNANexus and Genologics• MD/PhD & Masters Level Scientists and Bioinformaticians• IT Infrastructure of >100 CPUs and >100TB storage
  6. 6. Edge BioServ Scientific Advisory BoardElaine Mardis, Ph.D. Steven Salzberg, Ph.D.Co-Director, Genome Sequencing Center Director, Center for Bioinformatics andWashington University School of Medicine Computational Biology University of MarylandSam Levy, Ph.D.Director of Genome Sciences Gabor Marth, Ph.D.Scripps Translational Science Institute Professor of BioinformaticsScripps Genomic Medicine Boston CollegeMichael Zody, M.S.Chief Technologist Elliott Margulies, Ph.D.Broad Institute Investigator Genome Informatics SectionKen Dewar, Ph.D. National Human Genome Research InstituteAssistant Professor National Institutes of HealthMcGill University and Genome Quebec
  7. 7. NGS @ 30K Feet
  8. 8. Machines and Vendors
  9. 9. Obligatory NGS Exponential Growth SlideNature Biotechnology Volume 26 Number10 October2008
  10. 10. Ultra High Throughput + Lower Cost = Broader Applications RNA-Seq/ Whole Transcriptome Epigenome - mRNA Expression & Discovery - Transcriptionally Active Sites - Alternative Splicing - Protein-DNA Interactions - Allele-Specific Expression - Methylation Analysis - microRNA Expression & Discovery Genome- De Novo- Resequencing/ Mutation Metagenome Discovery & Profiling - Microbial Diversity- Exome Sequencing - Heterogeneous Samples- Copy Number Variation- Ancient DNA
  11. 11. Challenges
  12. 12. ChallengesTechnical Expertise
  13. 13. Experimental Design Considerations  Sequencing Platform in Use  Choice of Library Construction  Depth of coverage  Re$ources  Number of Replicates  Number of Samples and Control  Etc…
  14. 14. ChallengesFlexibility w/ Standards
  15. 15. Flexibility with Standards and Scale• Then (CE) – The Norm – 10 Machines, 30 – 360 Days, 1 Project• Now (Illumina/SOLiD/454) – Scale – 1 machine, 14 Days, 30 Projects• Now (Ion Torrent) - Flexibility – 1 machine, 1 Day, 1 Project.• Standardization of analysis (Details Later)
  16. 16. ChallengesSample Preparation
  17. 17. Sample Sourcing for RNA Projects– Blood: Large quantities of sample available, but with limited utility in transcriptome analysis– Tissue: Needle biopsy most common, but sample quantity very low– Surgical section: Larger quantities available, but limited utility; need laser capture microdissection to provide useful results, sample quantity very low– FFPE Slides: Very useful in clinical research but amount of sample and quality low.
  18. 18. Unamplified vs Amplified• Prostate Cancer Cell Line (Vcap) from CPDR – Well characterized – Differential Expression upon the addition of androgens. – Compared transcriptome from a single pool of RNA • Unamplified, ribosomally depleted (Ribominus™) • Amplified, no ribosomal depletion required • Two Pipelines for analysis
  19. 19. Amplification Gives Different Results• Gene Expression in Unstimulated Cells 14,075
  20. 20. Spearman’s Correlation from 2 PipelinesPipeline A Unamplified Amplified Androgen + - + - + … 0.930 0.904 0.892Unamplified - … … 0.896 0.900 + … … … 0.928 Amplified - … … … …Pipeline B Unamplified Amplified Androgen + - + - + … 0.853 0.757 0.701Unamplified - … … 0.720 0.712 + … … … 0.848 Amplified - … … … …
  21. 21. ChallengesSample Analysis
  22. 22. Exome Seq Ultimately About Variants• Coverage• Project Design – Cohorts – Cancer• Algorithms a Solved Problem? – Single open source pipelines – Single commercial pipelines – Proprietary internal algorithms. – A mixture?
  23. 23. Ultimately Comes to Variation• Coverage• Project Design – Cohorts – Cancer• Algorithms Solved Problem? – Single open source pipelines – Single commercial pipelines – Proprietary internal algorithms. – A mixture?
  24. 24. Digging Deep with an ExomeGenetic variation in an individual human exome.Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, Li K, Axelrod N, Busam DA, Strausberg RL, Venter JC.PLoS Genet. 2008 Aug 15;4(8):e1000160.
  25. 25. Venter Genome - Algorithms • PLOS genetics 2008 vol 4 issue 8 e10000160 • ~21K SNP in exons (29MB Targeted) • 36,206 expected SNPs for 50MB Kit% Difference Homozygous TP TN FP FN Sensitivity Pos.pred.valB 1% 0% -39% -1% 1% 4%A 31% 0% 88% -41% 31% -6%C -32% 0% -49% 42% -32% 2%% Difference Heterozygous TP TN FP FN Sensitivity Pos.pred.valB 0% 0% 16% 0% 0% -9%A -15% 0% -44% 21% -15% 16%C 15% 0% 28% -20% 15% -7%
  26. 26. 3 Tools and Associated SNP Counts• Software A – 45,551• Software B – 29,814• Software C – 40,964
  27. 27. Software B v. Software A B A 29,814 45,511 8,564 21,250 24,261 Union: 54,075 Intersection: 21,250 Not to Scale
  28. 28. Software B v. Software C B C 29,814 40,964 6,358 23,456 17,508 Union: 47,322 Intersection 23,456
  29. 29. Software A v. Software C A C 45,511 40,964 14,738 30,773 10,191 Union: 55,702 Intersection: 30,773
  30. 30. B A 29,814 45,5114,750 1,608 13,130 19,642 3,814 11,131 6,377 Union: 60,452 Intersection: 19,642 Voting Scheme (2/3): 36,195 C 40,964
  31. 31. ChallengesPlatforms
  32. 32. The weight in… Yield/Day Read Error Rates Read LengthsIllumina MiSeq 2.0 Gb 1.3% (V4) 150Ion Torrent PGM 0.5 Gb (316 Chip) 1.5% (316 Chip) 120 - 240PacBio RS 3.0 Gb 2-15% 430-2900• Illumina and PacBio numbers from Vendor Sequencing• Ion Torrent from EdgeBio Sequencing
  33. 33. Illumina MiSeqMid-Range Length, Accurate Reads, Large Throughput• All Resequencing• All De novo Applications• Transcriptome• Methylation
  34. 34. Ion Torrent PGMLong, Mostly Accurate Reads in 2.5 Hours• Microbial & Viral Resequencing• Microbial & Viral De novo Applications• Eukaryotic Amplicon Sequencing• Metagenomics – WGS – 16S Surveys
  35. 35. Pac Bio RSUltra Long, Less Accurate Reads & Rapid Sequence• Microbial & Viral De novo Applications• Structural Variation / Haplotyping
  36. 36. Ion Torrent PGM Mean Read Total # A20 Mean Read Name Total # Reads Length Longest Read (Mbp) Q20 Mb Length HG19-01 2,660,176 139 203 369.91 124.00 74 HG19-02 2,321,405 121 202 281.43 116.43 75 HG19-03 2,471,922 134 203 331.54 124.17 77Microbe (37% GC) 2,869,789 122 202 350.23 160.48 82Microbe (30% GC) 2,866,851 122 202 350.16 141.31 81
  37. 37. Ion Torrent PGM Percent of # Aligned / % Aligned / Aligned Total # # N50 Largest Consensus Name Assembled Assembled Genome Reads Contigs Contig Contig Accuracy Reads Reads Covered (AQ40)DH10B Mapping 1,384,863 1,334,138 96.34% 90 107,749 326,368 99.51% 99.97% DH10B Denovo 1,384,863 1,335,604 96.44% 216 42,499 146,899 99.53% 1.73%On Similar Illumina Data Set• Normalizing for coverage and removing Paired Ends• N50 of 94926 and Largest Contig of 236274• Removing normalization improved numbers
  38. 38. Why the Difference?Quality?
  39. 39. Quality?Q-Q plots of the DH10B Ion Torrent 316 chip data expected vs empirical qualitybefore recalibration (left) and after recalibration (right).
  40. 40. QualityQ-Q plots of DH10B MiSeq data expected vs empirical quality before recalibration(left) and after recalibration (right).
  41. 41. Empirical Quality
  42. 42. Empirical Quality - Long Reads
  43. 43. Then Why?• De Bruijn Graphs adversely affected by more frequent INDEL characteristics of Ion Torrent• Higher Average Quality reads are less abundant in Ion Torrent
  44. 44. Does this matter in Resequencing?• Depends on the tools used! – If you understand error profile, you can correct for it…• Ran Simulated DH10B mutation experiment 1. make mutated e. coli reference (fakemut) 2. align data to mutated reference (clc, tmap, or other mappers) 3. calculate per base coverage on the BAM file (genomeCoverageBed) 4. run samtools/mpileup/vcffilter (or CLC SNP/INDELcalling) to call variants -run various settings to compare variant calling 5. Calculate false positives, true positives, and false negatives 6. Calculate number of variants missed due to low coverage 7. Calculate PPV and corrected sensitivity 8. Graph PPV and corrected sensitivity
  45. 45. Resequencing • Ion claims substitution issues with MiSeq 1 • Illumina claims INDEL issues with Ion 21. http://www.iontorrent.com/lib/images/PDFs/co23743_pgm_app_note.pdf2. http://www.illumina.com/Documents/products/appnotes/appnote_miseq_ecoli.pdf
  46. 46. Resequencing Variants Specificity Sensitivity PPV IdentifiedIon/TMAP/SamTools 460 100% 76.957% 97.676%(Mod)Ion/TMAP/SamTools 459 99.895% 91.939% 6.014% (~6500(Default) False Negatives)MiSeq/Eland/SamTools 220 99.99996% 95.91% 99.06%(Default – SNPs ONLY)MiSeq/CLC/SamTools 459 95.464% 99.998% 83.871 (~65 False(Default) Negatives)MiSeq SubSampled on DH10B Ion SubSampled on DH10B(TMAP/Samtools): (TMAP/Samtools):9 total variants identified 16 total variants identified8 SNPs and 1 INDEL 0 SNPs and 16 INDEL
  47. 47. Ion Data PPV and Sensitivity of Samtools Analyses100.000% 80.000% 60.000% Total PPV SNPs PPV INDELs PPV Total Corrected Sensitivity 40.000% SNPs Corrected Sensitivity INDELs Corrected Sensitivity 20.000% 0.000% Default Q4, h100, o20, Q14, h75, o20, Q7, h50, o10, Q14, h50, o10, Variant Calling Q14, h50, o10, Samtools e27, m1, H1 e21, m4, H2 e17, m4, H1 e17, m4, H1 e17, m4, H2
  48. 48. MiSeq Data PPV and Sensitivity of Samtools Analyses of MiSeq Data100.000% 80.000% 60.000% Total PPV SNPs PPV INDELs PPV Total Corrected Sensitivity 40.000% SNPs Corrected Sensitivity INDELs Corrected Sensitivity 20.000% 0.000% MiSeq CLC with Default Samtools MiSeq CLC Map with Variant Analysis MiSeq TMAP Map with Variant Analysis
  49. 49. Resequencing ConclusionUsing appropriate aligners and variant callers we show bothplatforms have high accuracy, each with strengths and weaknesses…
  50. 50. What About PacBio?• We have less experience with PacBio• We (EdgeBio) thinks PacBio may have a niche, but given large initial investment, waiting.• Many conferences and posters – only results seen are for de novo sequencing and finishing (Broad).• Will be here all week and would love to hear why you love it.
  51. 51. Take This Home• There are many challenges before we even get to picking a platform – Technical Expertise – Standards in Prep and Analysis With Great NGS Power Comes Great Responsibility
  52. 52. Acknowledgements• CPDR (Center for Prostate Disease Research) Collaboration – Shyh-Han Tan, Ph.D. EdgeBio Sequencing EdgeBio IFX Joy Adigun John Seed Elyse Nagle Anjana Varadarajan Jennifer Sheffield David Jenkins Rossio Kersey Phil Dagosto Ryan Mease Quang Tri Nguyen
  53. 53. Questions Twitter: @Bioinfojjohnson@edgebio.com

×