Next-generation sequencing course, part 1: technologies
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Next-generation sequencing course, part 1: technologies

on

  • 6,493 views

 

Statistics

Views

Total Views
6,493
Views on SlideShare
6,493
Embed Views
0

Actions

Likes
3
Downloads
280
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Next-generation sequencing course, part 1: technologies Presentation Transcript

  • 1. [I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 1: TechnologiesProf Jan AertsFaculty of Engineering - ESAT/SCDjan.aerts@esat.kuleuven.beTA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be) 1
  • 2. AnnouncementsMay 27th (9am-noon): evaluationopen book 2
  • 3. Note to self...Upload s_1_sequence.txt and s_2_sequence.txt to Galaxy first... 3
  • 4. Overview• linux refresher (6/5)• next-generation sequencing technologies and applications (6/5)• sequence mapping (13/5)• variant calling - SNPs (20/5)• variant calling - structural variation (20/5) 4
  • 5. Linux Refresher... 5
  • 6. Next-generation sequencing technologies 6
  • 7. General principle 7
  • 8. Big data... 8
  • 9. First vs second generation sequencingSanger sequencing (1st gen) 2nd/next gen sequencing Shendure & Ji, 2008 9
  • 10. Paired-end sequencing Korbel et al, 2007 10
  • 11. General approaches• 2nd generation: clonally amplified single molecules • Roche 454 pyrosequencing • Illumina Genome Analyzer -> HiSeq: reversible terminator technology • ABI SOLiD: ligation-based extension• Next-next-generation/3rd generation: true single molecule • Helicos: Heliscore • Pacific Biosciences: SMRT 11
  • 12. Mardis, 2011 12
  • 13. Steps genome enrichment template preparation sequencing and imaging data analysis 13
  • 14. A. Genome enrichment 14
  • 15. Sequencing costs 15
  • 16. What?Only sequence relevant parts of the genome instead of whole genome, e.g.:• specific Mb-scale regions known to be involved in particular disease (e.g. based on GWAS)• specific candidate genes belonging to disease pathway• exome (= all exons) => how to isolate these from non-target sequence? “pulldown” 16
  • 17. Pulldown: on-array Turner et al, 2009 17
  • 18. Pulldown: in-solution Turner et al, 2009 18
  • 19. Performance metrics• fold-enrichment: ratio of abundance of target sequences post-enrichment vs pre-enrichment• capture specificity: fraction of sequence reads that map to target• uniformity: relative abundance of individual targets after enrichment• completeness: fraction of target bases detectably captured 19
  • 20. B. Template preparation 20
  • 21. Problem: most imaging systems not designed to detect single fluorescent event=> need amplified templatesAim: to produce a representative, non-biased source of nucleic acid materialfrom the genome under investigation => population of identical templatesSteps: 1. shear DNA 2. amplify templates Options: emulsion PCR (emPCR) or solid phase amplification 21
  • 22. Amplification by emulsion PCRemulsion = mixture of two or more immiscible (unblendable) liquids; e.g.mayonnaise, vinaigretteemPCR: thousands of microreactors/micro-eppendorfsone bead + one DNA molecule per microreactor => PCR to 1000s of copies 22
  • 23. Williams et al, 2006 Metzker et al, 2010 23
  • 24. Solid-phase amplification http://bit.ly/6JYIUzhttp://www.youtube.com/watch?v=77r5p8IBwJk&NR=1 Metzker et al, 2010 24
  • 25. C. Sequencing and imaging 25
  • 26. Sequencing and imagingTechnologies:1. cyclic reversible termination2. sequencing by ligation3. pyrosequencing4. real-time sequencing 26
  • 27. Cyclic reversible terminationDNA synthesis is terminated after adding single nucleotidestart/stop/start/stop/start/stop/... Illumina: 4-coloursequencing result sequencing steps Metzker et al, 2010 27
  • 28. Helicos: 1-colour sequencing stepssequencing result Metzker et al, 2010 Metzker et al, 2010 28
  • 29. Sequencing by ligation http://bit.ly/fPh22Xsequencing steps 29
  • 30. sequencing resulthttp://bit.ly/fPh22X 30
  • 31. Pyrosequencing Metzker et al, 2010 Metzker et al, 2010 31
  • 32. Real-time sequencing “ZMW” zero-mode waveguide DNA polymerase “strobe sequencing” 32
  • 33. Run time Gb/runRoche 454 8.5 hr 45 Illumina 9 days 35 SOLiD 14 days 50 Helicos 8 days 37 PacBio ? ? 33
  • 34. Accuracy - base calling error• base quality drops along read Sanger > SOLiD > Illumina > 454 > Helicos (“dephasing” within clusters)• base calling errors 34
  • 35. Accuracy - homopolymer runs Issue for Roche 454: 39% of errors are homopolymers A5 motifs: 3.3% error rate A8 motifs: 50% error rate Reason: use signal intensity as a measure for homopolymer length 35
  • 36. 36
  • 37. Ronaghi, Genome Res 11:3-11 (2001) 37
  • 38. http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg 38
  • 39. Is it 4? Is it 5? Is it 4? http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg 39
  • 40. Consensus accuracyIncrease accuracy for SNP calling by increasing coverage: Illumina: 20X SOLiD: 12X 454: 7.4X Sanger: 3XFactors: raw accuracy + read lengthHow deep do you have to sequence? => Poisson distribution: “If you sequence ataverage of 10X, how much of the genome will be covered at least 5X”? 40
  • 41. Bentley et al, Nature 456:53-56 (2008) 41
  • 42. FASTQ file format example fasta entries (n=2) “@” + identifier example fastq entries (n=2) sequence “+” + identifier (optional)phred-based quality scores phred quality score encoding Wikipedia 42
  • 43. Sequence quality controlIs this good sequence? (essential!)E.g.: using FastQC tool (Babraham Institute, UK; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) 43
  • 44. Sequence quality control per base sequence quality good bad 44
  • 45. Sequence quality control per sequence quality scores good bad 45
  • 46. Sequence quality control per base sequence content good bad 46
  • 47. Sequence quality control per base GC content good bad 47
  • 48. Sequence quality control per sequence GC content good bad 48
  • 49. Sequence quality control k-mer content good bad 49
  • 50. Intermezzo: Galaxy 50
  • 51. Online genome analysishttp://galaxy.psu.edu/“Galaxy allows you to do analyses you cannot do anywhere else without theneed to install or download anything. You can analyze multiple alignments,compare genomic annotations, profile metagenomic samples and much muchmore...” 51
  • 52. 52
  • 53. 53
  • 54. Applications of next-generation sequencing 54
  • 55. Kahvejian et al, 2008 55
  • 56. DNA-seqChIP-seq RNA-seq Kahvejian et al, 2008 50 56
  • 57. identify sequence variations DNA-seq ChIP-seq RNA-seq identifypathogens Kahvejian et al, 2008 50 51 57
  • 58. Exercises 58
  • 59. Try to login to the server mentioned on Toledo with username and passwordprovided there.There are 2 FASTQ files in /mnt/homes/jaerts/: s_1_sequence.txt ands_2_sequence.txt (= paired ends) • How many sequences are in s_1_sequence.txt? • What encoding was used for the quality score? Illumina? Sanger? • What are the numerical quality scores for the first sequence in s_1_sequence.txt (i.e. 7172283/1)? 59
  • 60. • Create an account on the Galaxy server• Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload them into Galaxy. These files are also available on the linux server• Have a look at the contents of s_1_sequence.txt.• Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ Groomer”)• Draw the quality score boxplot for s_1_sequence.txt• Draw the nucleotide distribution chart for s_1_sequence.txt 60
  • 61. ReferencesBentley DR et al. Accurate whole human genome sequencing using reversibleterminator chemistry. Nature 456: 53-59 (2008)Kahvejian A, Quackenbush J & Thompson JF. What would you do if you couldsequence everything? Nature Biotechnology 26: 1125-1133 (2008)Korbel JO et al. Paired-end mapping reveals extensive structural variation in thehuman genome. Science 318: 420-426 (2007)Mardis ER. A decade’s perspective on DNA sequencing technology. Nature470: 198-203 (2011)Metzker ML. Sequencing technologies - the next generation. Nature ReviewsGenetics 11:31-46 (2010)Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology26:1135-1145 (2008)Turner EH et al. Methods for genomic partitioning. Annual Review of Genomicsand Human Genetics 10 (2009) 61