Next-generation sequencing course, part 1: technologies

7,789 views

Published on

Published in: Education, Technology, Business

Next-generation sequencing course, part 1: technologies

  1. 1. [I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 1: TechnologiesProf Jan AertsFaculty of Engineering - ESAT/SCDjan.aerts@esat.kuleuven.beTA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be) 1
  2. 2. AnnouncementsMay 27th (9am-noon): evaluationopen book 2
  3. 3. Note to self...Upload s_1_sequence.txt and s_2_sequence.txt to Galaxy first... 3
  4. 4. Overview• linux refresher (6/5)• next-generation sequencing technologies and applications (6/5)• sequence mapping (13/5)• variant calling - SNPs (20/5)• variant calling - structural variation (20/5) 4
  5. 5. Linux Refresher... 5
  6. 6. Next-generation sequencing technologies 6
  7. 7. General principle 7
  8. 8. Big data... 8
  9. 9. First vs second generation sequencingSanger sequencing (1st gen) 2nd/next gen sequencing Shendure & Ji, 2008 9
  10. 10. Paired-end sequencing Korbel et al, 2007 10
  11. 11. General approaches• 2nd generation: clonally amplified single molecules • Roche 454 pyrosequencing • Illumina Genome Analyzer -> HiSeq: reversible terminator technology • ABI SOLiD: ligation-based extension• Next-next-generation/3rd generation: true single molecule • Helicos: Heliscore • Pacific Biosciences: SMRT 11
  12. 12. Mardis, 2011 12
  13. 13. Steps genome enrichment template preparation sequencing and imaging data analysis 13
  14. 14. A. Genome enrichment 14
  15. 15. Sequencing costs 15
  16. 16. What?Only sequence relevant parts of the genome instead of whole genome, e.g.:• specific Mb-scale regions known to be involved in particular disease (e.g. based on GWAS)• specific candidate genes belonging to disease pathway• exome (= all exons) => how to isolate these from non-target sequence? “pulldown” 16
  17. 17. Pulldown: on-array Turner et al, 2009 17
  18. 18. Pulldown: in-solution Turner et al, 2009 18
  19. 19. Performance metrics• fold-enrichment: ratio of abundance of target sequences post-enrichment vs pre-enrichment• capture specificity: fraction of sequence reads that map to target• uniformity: relative abundance of individual targets after enrichment• completeness: fraction of target bases detectably captured 19
  20. 20. B. Template preparation 20
  21. 21. Problem: most imaging systems not designed to detect single fluorescent event=> need amplified templatesAim: to produce a representative, non-biased source of nucleic acid materialfrom the genome under investigation => population of identical templatesSteps: 1. shear DNA 2. amplify templates Options: emulsion PCR (emPCR) or solid phase amplification 21
  22. 22. Amplification by emulsion PCRemulsion = mixture of two or more immiscible (unblendable) liquids; e.g.mayonnaise, vinaigretteemPCR: thousands of microreactors/micro-eppendorfsone bead + one DNA molecule per microreactor => PCR to 1000s of copies 22
  23. 23. Williams et al, 2006 Metzker et al, 2010 23
  24. 24. Solid-phase amplification http://bit.ly/6JYIUzhttp://www.youtube.com/watch?v=77r5p8IBwJk&NR=1 Metzker et al, 2010 24
  25. 25. C. Sequencing and imaging 25
  26. 26. Sequencing and imagingTechnologies:1. cyclic reversible termination2. sequencing by ligation3. pyrosequencing4. real-time sequencing 26
  27. 27. Cyclic reversible terminationDNA synthesis is terminated after adding single nucleotidestart/stop/start/stop/start/stop/... Illumina: 4-coloursequencing result sequencing steps Metzker et al, 2010 27
  28. 28. Helicos: 1-colour sequencing stepssequencing result Metzker et al, 2010 Metzker et al, 2010 28
  29. 29. Sequencing by ligation http://bit.ly/fPh22Xsequencing steps 29
  30. 30. sequencing resulthttp://bit.ly/fPh22X 30
  31. 31. Pyrosequencing Metzker et al, 2010 Metzker et al, 2010 31
  32. 32. Real-time sequencing “ZMW” zero-mode waveguide DNA polymerase “strobe sequencing” 32
  33. 33. Run time Gb/runRoche 454 8.5 hr 45 Illumina 9 days 35 SOLiD 14 days 50 Helicos 8 days 37 PacBio ? ? 33
  34. 34. Accuracy - base calling error• base quality drops along read Sanger > SOLiD > Illumina > 454 > Helicos (“dephasing” within clusters)• base calling errors 34
  35. 35. Accuracy - homopolymer runs Issue for Roche 454: 39% of errors are homopolymers A5 motifs: 3.3% error rate A8 motifs: 50% error rate Reason: use signal intensity as a measure for homopolymer length 35
  36. 36. 36
  37. 37. Ronaghi, Genome Res 11:3-11 (2001) 37
  38. 38. http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg 38
  39. 39. Is it 4? Is it 5? Is it 4? http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg 39
  40. 40. Consensus accuracyIncrease accuracy for SNP calling by increasing coverage: Illumina: 20X SOLiD: 12X 454: 7.4X Sanger: 3XFactors: raw accuracy + read lengthHow deep do you have to sequence? => Poisson distribution: “If you sequence ataverage of 10X, how much of the genome will be covered at least 5X”? 40
  41. 41. Bentley et al, Nature 456:53-56 (2008) 41
  42. 42. FASTQ file format example fasta entries (n=2) “@” + identifier example fastq entries (n=2) sequence “+” + identifier (optional)phred-based quality scores phred quality score encoding Wikipedia 42
  43. 43. Sequence quality controlIs this good sequence? (essential!)E.g.: using FastQC tool (Babraham Institute, UK; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) 43
  44. 44. Sequence quality control per base sequence quality good bad 44
  45. 45. Sequence quality control per sequence quality scores good bad 45
  46. 46. Sequence quality control per base sequence content good bad 46
  47. 47. Sequence quality control per base GC content good bad 47
  48. 48. Sequence quality control per sequence GC content good bad 48
  49. 49. Sequence quality control k-mer content good bad 49
  50. 50. Intermezzo: Galaxy 50
  51. 51. Online genome analysishttp://galaxy.psu.edu/“Galaxy allows you to do analyses you cannot do anywhere else without theneed to install or download anything. You can analyze multiple alignments,compare genomic annotations, profile metagenomic samples and much muchmore...” 51
  52. 52. 52
  53. 53. 53
  54. 54. Applications of next-generation sequencing 54
  55. 55. Kahvejian et al, 2008 55
  56. 56. DNA-seqChIP-seq RNA-seq Kahvejian et al, 2008 50 56
  57. 57. identify sequence variations DNA-seq ChIP-seq RNA-seq identifypathogens Kahvejian et al, 2008 50 51 57
  58. 58. Exercises 58
  59. 59. Try to login to the server mentioned on Toledo with username and passwordprovided there.There are 2 FASTQ files in /mnt/homes/jaerts/: s_1_sequence.txt ands_2_sequence.txt (= paired ends) • How many sequences are in s_1_sequence.txt? • What encoding was used for the quality score? Illumina? Sanger? • What are the numerical quality scores for the first sequence in s_1_sequence.txt (i.e. 7172283/1)? 59
  60. 60. • Create an account on the Galaxy server• Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload them into Galaxy. These files are also available on the linux server• Have a look at the contents of s_1_sequence.txt.• Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ Groomer”)• Draw the quality score boxplot for s_1_sequence.txt• Draw the nucleotide distribution chart for s_1_sequence.txt 60
  61. 61. ReferencesBentley DR et al. Accurate whole human genome sequencing using reversibleterminator chemistry. Nature 456: 53-59 (2008)Kahvejian A, Quackenbush J & Thompson JF. What would you do if you couldsequence everything? Nature Biotechnology 26: 1125-1133 (2008)Korbel JO et al. Paired-end mapping reveals extensive structural variation in thehuman genome. Science 318: 420-426 (2007)Mardis ER. A decade’s perspective on DNA sequencing technology. Nature470: 198-203 (2011)Metzker ML. Sequencing technologies - the next generation. Nature ReviewsGenetics 11:31-46 (2010)Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology26:1135-1145 (2008)Turner EH et al. Methods for genomic partitioning. Annual Review of Genomicsand Human Genetics 10 (2009) 61

×