The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculePresentation Transcript
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule Justin H. Johnson Director of Bioinformatics EdgeBio Washington DC, USA
Agenda• Who We Are• NGS at 30K• The Challenges – Even Before We Get to the Platforms – When We Get to the Platforms
Who We Are
Life Tech ServiceProvider
Contract Research Division• Five SOLiD4 sequencing platforms• One Life Techologies 5500XL• Two Ion Torrent PGMs• Bioinformatics consulting on Illumina, 454, and PacBio• Automation thru Caliper Sciclone & Biomek FX• Commercial partnerships with companies such as CLCBio, DNANexus and Genologics• MD/PhD & Masters Level Scientists and Bioinformaticians• IT Infrastructure of >100 CPUs and >100TB storage
Edge BioServ Scientific Advisory BoardElaine Mardis, Ph.D. Steven Salzberg, Ph.D.Co-Director, Genome Sequencing Center Director, Center for Bioinformatics andWashington University School of Medicine Computational Biology University of MarylandSam Levy, Ph.D.Director of Genome Sciences Gabor Marth, Ph.D.Scripps Translational Science Institute Professor of BioinformaticsScripps Genomic Medicine Boston CollegeMichael Zody, M.S.Chief Technologist Elliott Margulies, Ph.D.Broad Institute Investigator Genome Informatics SectionKen Dewar, Ph.D. National Human Genome Research InstituteAssistant Professor National Institutes of HealthMcGill University and Genome Quebec
Ultra High Throughput + Lower Cost = Broader Applications RNA-Seq/ Whole Transcriptome Epigenome - mRNA Expression & Discovery - Transcriptionally Active Sites - Alternative Splicing - Protein-DNA Interactions - Allele-Specific Expression - Methylation Analysis - microRNA Expression & Discovery Genome- De Novo- Resequencing/ Mutation Metagenome Discovery & Profiling - Microbial Diversity- Exome Sequencing - Heterogeneous Samples- Copy Number Variation- Ancient DNA
Experimental Design Considerations Sequencing Platform in Use Choice of Library Construction Depth of coverage Re$ources Number of Replicates Number of Samples and Control Etc…
ChallengesFlexibility w/ Standards
Flexibility with Standards and Scale• Then (CE) – The Norm – 10 Machines, 30 – 360 Days, 1 Project• Now (Illumina/SOLiD/454) – Scale – 1 machine, 14 Days, 30 Projects• Now (Ion Torrent) - Flexibility – 1 machine, 1 Day, 1 Project.• Standardization of analysis (Details Later)
Sample Sourcing for RNA Projects– Blood: Large quantities of sample available, but with limited utility in transcriptome analysis– Tissue: Needle biopsy most common, but sample quantity very low– Surgical section: Larger quantities available, but limited utility; need laser capture microdissection to provide useful results, sample quantity very low– FFPE Slides: Very useful in clinical research but amount of sample and quality low.
Unamplified vs Amplified• Prostate Cancer Cell Line (Vcap) from CPDR – Well characterized – Differential Expression upon the addition of androgens. – Compared transcriptome from a single pool of RNA • Unamplified, ribosomally depleted (Ribominus™) • Amplified, no ribosomal depletion required • Two Pipelines for analysis
Amplification Gives Different Results• Gene Expression in Unstimulated Cells 14,075
Exome Seq Ultimately About Variants• Coverage• Project Design – Cohorts – Cancer• Algorithms a Solved Problem? – Single open source pipelines – Single commercial pipelines – Proprietary internal algorithms. – A mixture?
Ultimately Comes to Variation• Coverage• Project Design – Cohorts – Cancer• Algorithms Solved Problem? – Single open source pipelines – Single commercial pipelines – Proprietary internal algorithms. – A mixture?
Digging Deep with an ExomeGenetic variation in an individual human exome.Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, Li K, Axelrod N, Busam DA, Strausberg RL, Venter JC.PLoS Genet. 2008 Aug 15;4(8):e1000160.
3 Tools and Associated SNP Counts• Software A – 45,551• Software B – 29,814• Software C – 40,964
Software B v. Software A B A 29,814 45,511 8,564 21,250 24,261 Union: 54,075 Intersection: 21,250 Not to Scale
Software B v. Software C B C 29,814 40,964 6,358 23,456 17,508 Union: 47,322 Intersection 23,456
Software A v. Software C A C 45,511 40,964 14,738 30,773 10,191 Union: 55,702 Intersection: 30,773
B A 29,814 45,5114,750 1,608 13,130 19,642 3,814 11,131 6,377 Union: 60,452 Intersection: 19,642 Voting Scheme (2/3): 36,195 C 40,964
The weight in… Yield/Day Read Error Rates Read LengthsIllumina MiSeq 2.0 Gb 1.3% (V4) 150Ion Torrent PGM 0.5 Gb (316 Chip) 1.5% (316 Chip) 120 - 240PacBio RS 3.0 Gb 2-15% 430-2900• Illumina and PacBio numbers from Vendor Sequencing• Ion Torrent from EdgeBio Sequencing
Illumina MiSeqMid-Range Length, Accurate Reads, Large Throughput• All Resequencing• All De novo Applications• Transcriptome• Methylation
Ion Torrent PGMLong, Mostly Accurate Reads in 2.5 Hours• Microbial & Viral Resequencing• Microbial & Viral De novo Applications• Eukaryotic Amplicon Sequencing• Metagenomics – WGS – 16S Surveys
Pac Bio RSUltra Long, Less Accurate Reads & Rapid Sequence• Microbial & Viral De novo Applications• Structural Variation / Haplotyping
Ion Torrent PGM Mean Read Total # A20 Mean Read Name Total # Reads Length Longest Read (Mbp) Q20 Mb Length HG19-01 2,660,176 139 203 369.91 124.00 74 HG19-02 2,321,405 121 202 281.43 116.43 75 HG19-03 2,471,922 134 203 331.54 124.17 77Microbe (37% GC) 2,869,789 122 202 350.23 160.48 82Microbe (30% GC) 2,866,851 122 202 350.16 141.31 81
Ion Torrent PGM Percent of # Aligned / % Aligned / Aligned Total # # N50 Largest Consensus Name Assembled Assembled Genome Reads Contigs Contig Contig Accuracy Reads Reads Covered (AQ40)DH10B Mapping 1,384,863 1,334,138 96.34% 90 107,749 326,368 99.51% 99.97% DH10B Denovo 1,384,863 1,335,604 96.44% 216 42,499 146,899 99.53% 1.73%On Similar Illumina Data Set• Normalizing for coverage and removing Paired Ends• N50 of 94926 and Largest Contig of 236274• Removing normalization improved numbers
Why the Difference?Quality?
Quality?Q-Q plots of the DH10B Ion Torrent 316 chip data expected vs empirical qualitybefore recalibration (left) and after recalibration (right).
QualityQ-Q plots of DH10B MiSeq data expected vs empirical quality before recalibration(left) and after recalibration (right).
Empirical Quality - Long Reads
Then Why?• De Bruijn Graphs adversely affected by more frequent INDEL characteristics of Ion Torrent• Higher Average Quality reads are less abundant in Ion Torrent
Does this matter in Resequencing?• Depends on the tools used! – If you understand error profile, you can correct for it…• Ran Simulated DH10B mutation experiment 1. make mutated e. coli reference (fakemut) 2. align data to mutated reference (clc, tmap, or other mappers) 3. calculate per base coverage on the BAM file (genomeCoverageBed) 4. run samtools/mpileup/vcffilter (or CLC SNP/INDELcalling) to call variants -run various settings to compare variant calling 5. Calculate false positives, true positives, and false negatives 6. Calculate number of variants missed due to low coverage 7. Calculate PPV and corrected sensitivity 8. Graph PPV and corrected sensitivity
Resequencing • Ion claims substitution issues with MiSeq 1 • Illumina claims INDEL issues with Ion 21. http://www.iontorrent.com/lib/images/PDFs/co23743_pgm_app_note.pdf2. http://www.illumina.com/Documents/products/appnotes/appnote_miseq_ecoli.pdf
Resequencing Variants Specificity Sensitivity PPV IdentifiedIon/TMAP/SamTools 460 100% 76.957% 97.676%(Mod)Ion/TMAP/SamTools 459 99.895% 91.939% 6.014% (~6500(Default) False Negatives)MiSeq/Eland/SamTools 220 99.99996% 95.91% 99.06%(Default – SNPs ONLY)MiSeq/CLC/SamTools 459 95.464% 99.998% 83.871 (~65 False(Default) Negatives)MiSeq SubSampled on DH10B Ion SubSampled on DH10B(TMAP/Samtools): (TMAP/Samtools):9 total variants identified 16 total variants identified8 SNPs and 1 INDEL 0 SNPs and 16 INDEL
MiSeq Data PPV and Sensitivity of Samtools Analyses of MiSeq Data100.000% 80.000% 60.000% Total PPV SNPs PPV INDELs PPV Total Corrected Sensitivity 40.000% SNPs Corrected Sensitivity INDELs Corrected Sensitivity 20.000% 0.000% MiSeq CLC with Default Samtools MiSeq CLC Map with Variant Analysis MiSeq TMAP Map with Variant Analysis
Resequencing ConclusionUsing appropriate aligners and variant callers we show bothplatforms have high accuracy, each with strengths and weaknesses…
What About PacBio?• We have less experience with PacBio• We (EdgeBio) thinks PacBio may have a niche, but given large initial investment, waiting.• Many conferences and posters – only results seen are for de novo sequencing and finishing (Broad).• Will be here all week and would love to hear why you love it.
Take This Home• There are many challenges before we even get to picking a platform – Technical Expertise – Standards in Prep and Analysis With Great NGS Power Comes Great Responsibility
Acknowledgements• CPDR (Center for Prostate Disease Research) Collaboration – Shyh-Han Tan, Ph.D. EdgeBio Sequencing EdgeBio IFX Joy Adigun John Seed Elyse Nagle Anjana Varadarajan Jennifer Sheffield David Jenkins Rossio Kersey Phil Dagosto Ryan Mease Quang Tri Nguyen