Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

4,167 views

Published on

Introduction to de novo genome assembly (2016 edition)

Published in: Science
  • Be the first to comment

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

  1. 1. De novo Genome Assembly A/Prof Torsten Seemann Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016
  2. 2. Introduction
  3. 3. The human genome has 47 pieces MT (or XY)
  4. 4. The shortest piece is 48,000,000 bp Total haploid size 3,200,000,000 bp (3.2 Gbp) Total diploid size 6,400,000,000 bp (6.4 Gbp) ∑ = 6,400,000,000 bp
  5. 5. Human DNA iSequencer ™ 46 chromosomal and1 mitochondrial sequences In an ideal world ... AGTCTAGGATTCGCTATAG ATTCAGGCTCTGATATATT TCGCGGCATTAGCTAGAGA TCTCGAGATTCGTCCCAGT CTAGGATTCGCTAT AAGTCTAAGATTC...
  6. 6. The real world ( for now ) Short fragments Reads are stored in “FASTQ” files Genome Sequencing
  7. 7. Assemble Compare
  8. 8. Genome assembly (the red pill)
  9. 9. De novo genome assembly Ideally, one sequence per replicon. Millions of short sequences (reads) A few long sequences (contigs) Reconstruct the original genome from the sequence reads only
  10. 10. De novo genome assembly “From scratch”
  11. 11. An example
  12. 12. A small “genome” Friends, Romans, countrymen, lend me your ears; I’ll return them tomorrow!
  13. 13. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me Shakespearomics Whoops! I dropped them.
  14. 14. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; Shakespearomics I am good with words.
  15. 15. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; • Majority consensus Friends, Romans, countrymen, lend me your ears; (1 contig) Shakespearomics We have reached a consensus !
  16. 16. Overlap - Layout - Consensus Amplified DNA Shear DNA Sequenced reads Overlaps Layout Consensus ↠ “Contigs”
  17. 17. Assembly graphs
  18. 18. Overlap graph
  19. 19. Another example is this
  20. 20. Overlaps find
  21. 21. Do the graph traverse Size matters not. Look at me. Judge me by my size, do you? Size matters not. Look at me. Judge me by my soze, do you? 2 supporting reads 1 supporting reads
  22. 22. So far, so good.
  23. 23. Why is it so hard?
  24. 24. What makes a jigsaw puzzle hard? Repetitive regions Lots of pieces Missing pieces Multiple copies No corners (circular genomes) No box Dirty pieces Frayed pieces : .,
  25. 25. What makes genome assembly hard Size of the human genome = 3.2 x 109 bp (3,200,000,000) Typical short read length = 102 bp (100) A puzzle with millions to billions of pieces Storing in RAM is a challenge ⇢ “succinct data structure” 1. Many pieces (read length is very short compared to the genome)
  26. 26. What makes genome assembly hard Finding overlaps means examining every pair of reads Comparisons = N×(N-1)/2 ~ N2 Lots of smart tricks to reduce this close to ~N 2. Lots of overlaps
  27. 27. What makes genome assembly hard ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA 3. Lots of sky (short repeats) How could we possibly assemble this segment of DNA? All the reads are the same!
  28. 28. What makes genome assembly hard Read 1: GGAACCTTTGGCCCTGT Read 2: GGCGCTGTCCATTTTAGAAACC 4. Dirty pieces (sequencing errors) 1. The error in Read 2 might prevent us from seeing that it overlaps with Read 1 2. How would we know which is the correct sequence?
  29. 29. What makes genome assembly hard 5. Multiple copies (long repeats) Our old nemesis the REPEAT ! Gene Copy 1 Gene Copy 2
  30. 30. Repeats
  31. 31. What is a repeat? A segment of DNA that occurs more than once in the genome
  32. 32. Major classes of repeats in the human genome Repeat Class Arrangement Coverage (Hg) Length (bp) Satellite (micro, mini) Tandem 3% 2-100 SINE Interspersed 15% 100-300 Transposable elements Interspersed 12% 200-5k LINE Interspersed 21% 500-8k rDNA Tandem 0.01% 2k-43k Segmental Duplications Tandem or Interspersed 0.2% 1k-100k
  33. 33. Repeats Repeat copy 1 Repeat copy 2 Collapsed repeat consensus 1 locus 4 contigs
  34. 34. Repeats are hubs in the graph
  35. 35. Long reads can span repeats Repeat copy 1 Repeat copy 2 long reads 1 contig
  36. 36. Draft vs Finished genomes (bacteria) 250 bp - Illumina - $250 12,000 bp - Pacbio - $2500
  37. 37. The two laws of repeats 1. It is impossible to resolve repeats of length L unless you have reads longer than L. 2. It is impossible to resolve repeats of length L unless you have reads longer than L.
  38. 38. How much data do we need?
  39. 39. Coverage vs Depth Coverage Fraction of the genome sequenced by at least one read Depth Average number of reads that cover any given region Intuitively more reads should increase coverage and depth
  40. 40. Example: Read length = ⅛ Genome Length Coverage = ⅜ Depth = ⅜ Genome Reads
  41. 41. Example: Read length = ⅛ Genome Length Coverage = 4.5/8 Depth = 5/8 Genome Reads Newly Covered Regions = 1.5 Reads
  42. 42. Example: Read length = ⅛ Genome Length Coverage = 5.2/8 Depth = 7/8 Genome Reads Newly Covered Regions = 0.7 Reads
  43. 43. Example: Read length = ⅛ Genome Length Coverage = 6.7/8 Depth = 10/8 Genome Reads Newly Covered Regions = 1.5 Reads
  44. 44. Depth is easy to calculate Depth = N × L / G Depth = Y / G N = Number of reads L = Length of a read G = Genome length Y = Sequence yield (N x L)
  45. 45. Example: Coverage of the Tarantula genome “The size estimate of the tarantula genome based on k-mer analysis is 6 Gb and we sequenced at 40x coverage from a single female A. geniculata” Sangaard et al 2014. Nature Comm (5) p1-11 Depth = N × L / G 40 = N x 100 / (6 Billion) N = 40 * 6 Billion / 100 N = 2.4 Billion
  46. 46. Coverage and depth are related Approximate formula for coverage assuming random reads Coverage = 1 - e-Depth
  47. 47. Much more sequencing needed in reality ● Sequencing is not random ○ GC and AT rich regions are under represented ○ Other chemistry quirks ● More depth needed for: ○ sequencing errors ○ polymorphisms
  48. 48. Assessing assemblies
  49. 49. Contiguity Completeness Correctness
  50. 50. Contiguity ● Desire ○ Fewer contigs ○ Longer contigs ● Metrics ○ Number of contigs ○ Average contig length ○ Median contig length ○ Maximum contig length ○ “N50”, “NG50”, “D50”
  51. 51. Contiguity: the N50 statistic
  52. 52. Completeness : Total size Proportion of the original genome represented by the assembly Can be between 0 and 1 Proportion of estimated genome size … but estimates are not perfect
  53. 53. Completeness: core genes Proportion of coding sequences can be estimated based on known core genes thought to be present in a wide variety of organisms. Assumes that the proportion of assembled genes is equal to the proportion of assembled core genes. Number of Core Genes in Assembly Number of Core Genes in Database In the past this was done with a tool called CEGMA There is a new tool for this called BUSCO
  54. 54. Correctness Proportion of the assembly that is free from errors Errors include 1. Mis-joins 2. Repeat compressions 3. Unnecessary duplications 4. Indels / SNPs caused by assembler
  55. 55. Correctness: check for self consistency ● Align all the reads back to the contigs ● Look for inconsistencies Original Reads Align Assembly Mapped Read
  56. 56. Assemble ALL the things
  57. 57. Why assemble genomes ● Produce a reference for new species ● Genomic variation can be studied by comparing against the reference ● A wide variety of molecular biological tools become available or more effective ● Coding sequences can be studied in the context of their related non-coding (eg regulatory) sequences ● High level genome structure (number, arrangement of genes and repeats) can be studied
  58. 58. Vast majority of genomes remain unsequenced Estimated number of species ~ 9 Million Whole genome sequences (including incomplete drafts) ~ 3000 Estimated number of species Millions to Billions Genomic sequences ~ 150,000 Eukaryotes Bacteria & Archaea Every individual has a different genome Some diseases such as cancer give rise to massive genome level variation
  59. 59. Not just genomes ● Transcriptomes ○ One contig for every isoform ○ Do not expect uniform coverage ● Metagenomes ○ Mixture of different organisms ○ Host, bacteria, virus, fungi all at once ○ All different depths ● Metatranscriptomes ○ Combination of above!
  60. 60. Conclusions
  61. 61. Take home points ● De novo assembly is the process of reconstructing a long sequence from many short ones ● Represented as a mathematical “overlap graph” ● Assembly is very challenging (“impossible”) because ○ sequencing bias under represents certain regions ○ Reads are short relative to genome size ○ Repeats create tangled hubs in the assembly graph ○ Sequencing errors cause detours and bubbles in the assembly graph
  62. 62. Contact tseemann.github.io torsten.seemann@gmail.com @torstenseemann
  63. 63. The End

×