Your SlideShare is downloading. ×
0
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
2013 pag-equine-workshop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2013 pag-equine-workshop

1,276

Published on

1 Comment
0 Likes
Statistics
Notes
  • Nice summary. Thanks for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
1,276
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
23
Comments
1
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Next-Gen Sequencing:4 years in the trenches C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu
  • 2. These slides are available online. “titus brown slideshare” You can also e-mail me: ctb@msu.eduAlso note that these are my opinions and observations, culledfrom personal experience, online material, and reading. I’m happy to cite/explain further upon request, but: Your Mileage May Vary
  • 3. Things I won’t talk aboutDon’t work on/with/have anything useful to say about: Exome sequencing Ancient DNA ChIP-seq (protein-DNA interactions)Work on but you’re probably not interested in: Metagenomics (sequencing uncultured microbial communities) Bioinformatics data structures and algorithms
  • 4. Overview Shotgun sequencing basics Things everyone wants to know: how much $$... Various current problems & challenges Technology, now and future Some papers and projects worth looking at; & our own experiences
  • 5. Two specific concepts:First, sequencing everything at random is very much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences. These two concepts underlie the recent stunning increases in sequencing capacity.
  • 6. What are current costs forIllumina?Approximate costs from MSU sequencing center, a few months ago, including labor:RNAseq: $200 prep / sample Single-ended 1x50 -- $1100/lane – 100-150 mn reads Paired-end 2x100 -- $2500/lane – 200-300 mn reads (/ 2)Barcoding samples, etc, gets complicated.Discuss biology, etc with a sequencing geek before going forward!
  • 7. What does this data really giveyou?? With RNAseq, you can do de novo (genome- and gene-annotation- independent) gene & isoform discovery and quantification; 50- 100m reads/sample is probably “enough” (see: http://blog.fejes.ca/?p=607 for a good discussion) With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth. De novo assembly of complex vertebrate genomes is not casual: Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way. Assembly & scaffolding process itself is still evolving.
  • 8. Why so much data?Why do we need 10-20x coverage (resequencing) or 50- 100m reads (mRNAseq) with Illumina?Two (linked) reasons: Shotgun sequencing is random Counting/sampling variation
  • 9. 1. Useful minimum coveragedepends on high average coverage
  • 10. 2. mRNAseq quantitation – mustovercome sampling variation
  • 11. Coverage conclusionsMore coverage rarely hurts (you can always discard data, but it is harder/more $$ to get more data from an old sample)Your desired coverage numbers should be driven by sensitivity considerations.
  • 12. Problems and challengesSystematic bias in sequencing and software.Genome assembly: scaffolding and sensitivityGene referencesmRNAseq isoform construction
  • 13. Resequencing: bias and error Calling SNPs by mapping -- U. Colorado http://genomics-course.jasondk.org/?p=395
  • 14. Both sequencing and bioinformaticsyield many low-frequency artifacts!“Obvious” things like misalignments to paralogous/repeat sequences.Indels are handled badly by current tools (up to 60% false positive rate?!)Oxidation of DNA during library prep step (acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets. => With any data set, especially big ones, there will both random and systematic error and bias. http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the- truth-you-cant-handle-the-truth/
  • 15. Suggestion: Cortex variant caller Iqbal et al., Nat Genet. 2012, pmid 22231483
  • 16. Genome assembly: scaffolding &sensitivityEveryone wants two things from a genome assembly --Long/correct scaffolds See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmonComplete genome content
  • 17. Sequence data Readsoriginal DNA fragmentsoriginal DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 18. ContigsBuilding contigs ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAGConsensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 19. Scaffolds Ordered, oriented contigs mate pairscontigs gap size estimate Scaffold contig gap slides from http://slideshare.net/flxlex/ ; Lex Nederbragt http://dx.doi.org/10.6084/m9.figshare.100940
  • 20. slides from http://slideshare.net/flxlex/ ; Lex NederbragtLonger reads! Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 22 Polymorphic contig Contig 1 Contig 4 Polymorphic contig 33 Polymorphic contig
  • 21. Cod: PacBio results Mapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 22. Sensitivity – does your genomeinclude everything?Generally not!For example, the chick genome is missing a substantial number of genes from microchromosomes: 723 genes from HSA19q missing from chicken galGal4. ESTs and RNAseq transcripts for many or most.
  • 23. Approach - Digital normalization(a computational version of library normalization) Digital normalization “smooths out” coverage from different loci, and can “recover” low coverage regions for assembly.
  • 24. Applying diginorm to increasesensitivityReassembled chick genome from 70x Illumina -> normalized reads in ~24 hours.Contig assembly contained partial or complete matches to 70% of previously unmappable transcripts assembled from chick mRNAseqTogether with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.
  • 25. Mapping => mRNAseq quantitation Reference transcriptome required.
  • 26. Existing chick gene models lack exons,isoforms Our data Models *This gene contains at least 4 isoforms. Likit Preeyanon
  • 27. (Exon detection is pretty good.) Likit Preeyanon
  • 28. Gene Modeler Pipeline (“gimme”?)Merge transcripts together based on transcript mapping to genome; can include existing gene predictions, iterate.Construct gene modelsRemove redundant sequencesPredict strands and ORFs Likit Preeyanon
  • 29. Some thoughts on bioinfoSoftware is evolving very fast. Don’t worry about using the latest, but keep an eye on possible artifacts/problems with what you do use.In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.
  • 30. Technology – where next?Most slides taken from Lex Nederbragt:http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-technologies-at-the-norwegian-sequencing-centre-and-beyond
  • 31. High-throughput sequencing Phase 1: more is better 2005 GS20 200 000 reads 100 bp 0.02 Gb/run 2011 GS FLX+ 1.2 million reads 750 bp 0.7 Gb/run 2006 GA 28 million reads 25 bp 0.7 Gb/run 2011 HiSeq 2000 3 billion reads 2x100 bp 600 Gb/run slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 32. High-throughput sequencing Phase 2: smaller is better GS Junior from Roche/454 0.04 GB/run 400 bp reads 0.7 GB/run 700 bp reads MiSeq from Illumina 4.5 GB/run 2x150 bp reads 600 GB/run 2x100 bp reads PGM from Ion Torrent/ Life Technologies 0.01, 0.1 or 1 GB/run 100 or 200 bp readsslides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 33. slides from http://slideshare.net/flxlex/ ; Lex Nederbragt High-throughput sequencing Why benchtop sequencing instruments? DiagnosticsAffordable priceper instrument Small projects Fast turn around timehttp://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/http://www.vetlearn.com/ http://vanillajava.blogspot.com
  • 34. Which instrument to choose? slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 35. High-throughput sequencing Phase 3: single-moleculeC2 (current) chemistry:Average read length 2500 bp36 000 reads90 MB per ‘run’ slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 36. S High-throughput sequencingReal-time sequencing Technology Phospholinked hexaphosphate nucleotides G A T C b Lim of detection zone it Fluorescence pulse Intensitye detection Time slides from http://slideshare.net/flxlex/ Nature Reviews |Genetics ; Lex Nederbragt Figure 4 |Real-time sequencing. Pacific Biosciences’ four-colour real-tim sequencing m e ethod is shown.
  • 37. Need to combine Illumina + PacBio still. P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to cod published assembly + Error-corrected reads 23x s + w rea d Ra 24 cpus 4.5 days 100 Gb RAMslides from http://slideshare.net/flxlex/ ; Lex
  • 38. My perspective on tech:Illumina HiSeq + benchtop sequencers (MiSeq) currently most reliable for data generation: data in hand, decent quality.PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).
  • 39. Two final pieces of adviceShould you work with genome centers? Maybe. Genome centers are good at large, well funded projects. Their default pipelines are reliable but not always cutting edge. “Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give. They also have their own schedules and incentives.Where should you go for contract sequencing? I get asked this a lot! My best recommendation is UC Davis. “Cheaper” is not always “better”; data quality can vary immensely.
  • 40. Advertisement: next-gen sequencecourse http://bioinformatics.msu.edu/ngs-summer-course-2013 June 10-June 20, Kellogg Biological Station; < $500 Hands on exposure to data, analysis tools.
  • 41. AcknowledgementsI showed work from Likit Preeyanon and Alexis Black Pyrkosz, in my labHans Cheng is primary collaborator on chick workUSDA funded our technology development.Lex Nederbragt for his slides :)

×