Next-Gen Sequencing:4 years in the trenches C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University email@example.com
These slides are available online. “titus brown slideshare” You can also e-mail me: firstname.lastname@example.orgAlso note that these are my opinions and observations, culledfrom personal experience, online material, and reading. I’m happy to cite/explain further upon request, but: Your Mileage May Vary
Things I won’t talk aboutDon’t work on/with/have anything useful to say about: Exome sequencing Ancient DNA ChIP-seq (protein-DNA interactions)Work on but you’re probably not interested in: Metagenomics (sequencing uncultured microbial communities) Bioinformatics data structures and algorithms
Overview Shotgun sequencing basics Things everyone wants to know: how much $$... Various current problems & challenges Technology, now and future Some papers and projects worth looking at; & our own experiences
Two specific concepts:First, sequencing everything at random is very much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences. These two concepts underlie the recent stunning increases in sequencing capacity.
What are current costs forIllumina?Approximate costs from MSU sequencing center, a few months ago, including labor:RNAseq: $200 prep / sample Single-ended 1x50 -- $1100/lane – 100-150 mn reads Paired-end 2x100 -- $2500/lane – 200-300 mn reads (/ 2)Barcoding samples, etc, gets complicated.Discuss biology, etc with a sequencing geek before going forward!
What does this data really giveyou?? With RNAseq, you can do de novo (genome- and gene-annotation- independent) gene & isoform discovery and quantification; 50- 100m reads/sample is probably “enough” (see: http://blog.fejes.ca/?p=607 for a good discussion) With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth. De novo assembly of complex vertebrate genomes is not casual: Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way. Assembly & scaffolding process itself is still evolving.
Why so much data?Why do we need 10-20x coverage (resequencing) or 50- 100m reads (mRNAseq) with Illumina?Two (linked) reasons: Shotgun sequencing is random Counting/sampling variation
1. Useful minimum coveragedepends on high average coverage
Coverage conclusionsMore coverage rarely hurts (you can always discard data, but it is harder/more $$ to get more data from an old sample)Your desired coverage numbers should be driven by sensitivity considerations.
Problems and challengesSystematic bias in sequencing and software.Genome assembly: scaffolding and sensitivityGene referencesmRNAseq isoform construction
Resequencing: bias and error Calling SNPs by mapping -- U. Colorado http://genomics-course.jasondk.org/?p=395
Both sequencing and bioinformaticsyield many low-frequency artifacts!“Obvious” things like misalignments to paralogous/repeat sequences.Indels are handled badly by current tools (up to 60% false positive rate?!)Oxidation of DNA during library prep step (acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets. => With any data set, especially big ones, there will both random and systematic error and bias. http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the- truth-you-cant-handle-the-truth/
Genome assembly: scaffolding &sensitivityEveryone wants two things from a genome assembly --Long/correct scaffolds See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmonComplete genome content
Sequence data Readsoriginal DNA fragmentsoriginal DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Scaffolds Ordered, oriented contigs mate pairscontigs gap size estimate Scaffold contig gap slides from http://slideshare.net/flxlex/ ; Lex Nederbragt http://dx.doi.org/10.6084/m9.figshare.100940
slides from http://slideshare.net/flxlex/ ; Lex NederbragtLonger reads! Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 22 Polymorphic contig Contig 1 Contig 4 Polymorphic contig 33 Polymorphic contig
Cod: PacBio results Mapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Sensitivity – does your genomeinclude everything?Generally not!For example, the chick genome is missing a substantial number of genes from microchromosomes: 723 genes from HSA19q missing from chicken galGal4. ESTs and RNAseq transcripts for many or most.
Approach - Digital normalization(a computational version of library normalization) Digital normalization “smooths out” coverage from different loci, and can “recover” low coverage regions for assembly.
Applying diginorm to increasesensitivityReassembled chick genome from 70x Illumina -> normalized reads in ~24 hours.Contig assembly contained partial or complete matches to 70% of previously unmappable transcripts assembled from chick mRNAseqTogether with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.
Existing chick gene models lack exons,isoforms Our data Models *This gene contains at least 4 isoforms. Likit Preeyanon
(Exon detection is pretty good.) Likit Preeyanon
Gene Modeler Pipeline (“gimme”?)Merge transcripts together based on transcript mapping to genome; can include existing gene predictions, iterate.Construct gene modelsRemove redundant sequencesPredict strands and ORFs Likit Preeyanon
Some thoughts on bioinfoSoftware is evolving very fast. Don’t worry about using the latest, but keep an eye on possible artifacts/problems with what you do use.In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.
Technology – where next?Most slides taken from Lex Nederbragt:http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-technologies-at-the-norwegian-sequencing-centre-and-beyond
High-throughput sequencing Phase 1: more is better 2005 GS20 200 000 reads 100 bp 0.02 Gb/run 2011 GS FLX+ 1.2 million reads 750 bp 0.7 Gb/run 2006 GA 28 million reads 25 bp 0.7 Gb/run 2011 HiSeq 2000 3 billion reads 2x100 bp 600 Gb/run slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencing Phase 2: smaller is better GS Junior from Roche/454 0.04 GB/run 400 bp reads 0.7 GB/run 700 bp reads MiSeq from Illumina 4.5 GB/run 2x150 bp reads 600 GB/run 2x100 bp reads PGM from Ion Torrent/ Life Technologies 0.01, 0.1 or 1 GB/run 100 or 200 bp readsslides from http://slideshare.net/flxlex/ ; Lex Nederbragt
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt High-throughput sequencing Why benchtop sequencing instruments? DiagnosticsAffordable priceper instrument Small projects Fast turn around timehttp://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/http://www.vetlearn.com/ http://vanillajava.blogspot.com
Which instrument to choose? slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
S High-throughput sequencingReal-time sequencing Technology Phospholinked hexaphosphate nucleotides G A T C b Lim of detection zone it Fluorescence pulse Intensitye detection Time slides from http://slideshare.net/flxlex/ Nature Reviews |Genetics ; Lex Nederbragt Figure 4 |Real-time sequencing. Pacific Biosciences’ four-colour real-tim sequencing m e ethod is shown.
Need to combine Illumina + PacBio still. P_errorCorrection pipeline from 93% of reads recovered 2.7x Alignments of at least 1kb to cod published assembly + Error-corrected reads 23x s + w rea d Ra 24 cpus 4.5 days 100 Gb RAMslides from http://slideshare.net/flxlex/ ; Lex
My perspective on tech:Illumina HiSeq + benchtop sequencers (MiSeq) currently most reliable for data generation: data in hand, decent quality.PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).
Two final pieces of adviceShould you work with genome centers? Maybe. Genome centers are good at large, well funded projects. Their default pipelines are reliable but not always cutting edge. “Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give. They also have their own schedules and incentives.Where should you go for contract sequencing? I get asked this a lot! My best recommendation is UC Davis. “Cheaper” is not always “better”; data quality can vary immensely.
Advertisement: next-gen sequencecourse http://bioinformatics.msu.edu/ngs-summer-course-2013 June 10-June 20, Kellogg Biological Station; < $500 Hands on exposure to data, analysis tools.
AcknowledgementsI showed work from Likit Preeyanon and Alexis Black Pyrkosz, in my labHans Cheng is primary collaborator on chick workUSDA funded our technology development.Lex Nederbragt for his slides :)
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.