Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Genome Assembly Forensics

1,232 views

Published on

Automated assemblies are one thing, good assemblies are another!

This presentation covers the basic concepts of using paired-end and mate pair read data to identify mis-assemblies. It also covers some of the tools for visualising and correcting mis-assemblies. An attempt is made to rate these tools on their feature set and scalability beyond small (<15MBase) genomes and provides some closing remakes about what the ideal genome assembly editing tool should have in terms of features.

Published in: Education
  • Be the first to comment

Genome Assembly Forensics

  1. 1. Genome Assembly Forensics and Visualisation Nathan S. Watson-Haigh Fri 11th May 2012, ACPFG Journal ClubSchatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34.Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9(3), p.R55.Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in Bioinformatics. Available at: http://bib.oxfordjournals.org/content/early/2011/12/23/bib.bbr074.
  2. 2. Overview• Genome Assembly• N50/N90/N95• Paired-end and Matepair Reads• Mis-assembly Signatures• Assembly Validation and Manual Editing
  3. 3. Genome Assembly – Shotgun Reads DNA being sequenced aligned shotgun reads
  4. 4. Genome Assembly – Repeats
  5. 5. Genome Assembly – Repeats
  6. 6. Genome Assembly – Repeatsreads from different double coverage repeats can’t be resolved
  7. 7. Genome Assembly – Repeats
  8. 8. Genome Assembly – Diploid
  9. 9. Assembly Metrics – N50• The N50 is the most widely reported metric for de novo assemblies• It is a single measure of the contig length size distribution of an assembly – If contigs are sorted into descending length order, the n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs – Commonly reported with the N90 and N95
  10. 10. Assembly Metrics – N50 + = N50 + = N90 + = N95
  11. 11. Assembly Metrics – N50• The N50 is the most widely reported metric for de novo assemblies• It is a single measure of the contig length size distribution of an assembly – If contigs are sorted into descending length order, the n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs – Commonly reported with the N90 and N95• These stats DO NOT imply anything about assembly quality – Could simply concatenate contigs together to get a better N50!!
  12. 12. Paired-end Reads
  13. 13. Matepair Reads
  14. 14. Paired-end and Matepair ReadsPaired-end Matepair reverse compliment
  15. 15. So, Why are Pairs so Useful?
  16. 16. So, Why are Pairs so Useful?
  17. 17. Pairs are Useful – Orientation andSeparation
  18. 18. Pairs are Useful – Orientation andSeparation
  19. 19. Pairs are Useful – Orientation andSeparation
  20. 20. Pairs are Useful – Orientation andSeparation
  21. 21. Pairs are Useful – Orientation and SeparationIncorrect orientationIncorrect distance
  22. 22. Mis-assembly Signatures –Collapsed Tandem Repeat Correct alignment Incorrect alignment
  23. 23. Mis-assembly Signatures –Collapsed Tandem Repeat Correct assembly Mis-assembly
  24. 24. Mis-assembly Signatures –Collapsed (small) Tandem Repeat Correct assembly Mis-assembly
  25. 25. Mis-assembly Signatures –Collapsed Repeat Correct assembly Mis-assembly
  26. 26. Mis-assembly Signatures –Rearrangement Correct assembly Mis-assembly
  27. 27. Automated Assemblies Are One Thing, Good Assemblies Are Another• Given the computer resources you can generate an automated assembly in a few weeks – Not necessarily good – Need to optimise assembly parameters• For small organisms (< ~15Mbases) – Commodity hardware – OLC assemblers• For larger genomes – More RAM (10-100’s Gbytes) for OLC assemblers – De Bruijin Graph assemblers – Read Mapping step to generate contig read alignments
  28. 28. Automated Assemblies Are One Thing, Good Assemblies Are Another• Automated assemblies need to be checked for mis-assemblies – Need paired-end/matepair reads – Need viewers to visualise paired-end data – Need editors to break/join/reassemble parts of the assembly deemed to be inconsistent with read pair info – Need enough computer hardware to allow all this data to be loaded – especially with large volumes of Illumina paired-end data
  29. 29. Automated Assemblies Are One Thing, Good Assemblies Are Another• Very time consuming and laborious to check/edit – Small assemblies (< ~15Mbases) • Several weeks/few months to move 1 scaffold/contig at a time – Large assemblies need a team to do the same thing • Need enough RAM to load all the paired-end data • Need ways to identify regions requiring closer inspection • identify possible mis-assemblies• Major hurdles – Software inadequacies – Time – File formats! Grrrr!
  30. 30. Software InadequaciesSoftware Contig Scaffold Editing Reassemble Clipping Other View View InfoSeqMan 9 9 6 6 6 $$, buggy, not forPro large assemblies (32bit), 1 template sizeGap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizesConsed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizesHawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  31. 31. SeqMan Pro – Strategy View
  32. 32. SeqMan Pro
  33. 33. Software InadequaciesSoftware Contig Scaffold Editing Reassemble Clipping Other View View InfoSeqMan 9 9 6 6 6 $$, buggy, not forPro large assemblies (32bit), 1 template sizeGap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizesConsed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizesHawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  34. 34. Gap5 – Template View
  35. 35. Gap5 – Contig Comparator
  36. 36. Gap5 – Join Editor
  37. 37. Gap5 – Contig Editor
  38. 38. Software InadequaciesSoftware Contig Scaffold Editing Reassemble Clipping Other View View InfoSeqMan 9 9 6 6 6 $$, buggy, not forPro large assemblies (32bit), 1 template sizeGap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizesConsed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizesHawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  39. 39. Consed – Assembly View
  40. 40. Consed – Contig Viewer/Editor
  41. 41. Software InadequaciesSoftware Contig Scaffold Editing Reassemble Clipping Other View View InfoSeqMan 9 9 6 6 6 $$, buggy, not forPro large assemblies (32bit), 1 template sizeGap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizesConsed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizesHawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  42. 42. Scaffold/Contig Length Distribution
  43. 43. Library Stats
  44. 44. Compression-Expansion (CE)Statistic • A measure of the deviation of local distribution of insert sizes to the global distribution of insert sizes – 0 indicates no deviation – ≤ 3 indicates much compression – ≥3 indicates much expansion
  45. 45. Insert Coverage Read Coverage
  46. 46. 500bp inserts 3kb inserts 20kb inserts
  47. 47. AMOSvalidate• An assembly analysis pipeline to identify possible mis-assemblies – Paired-end data • CE stats • Incorrect orientation • Missing mate – Coverage – SNP density – Singletons
  48. 48. Hawkeye Cons• Poor support for correcting mis-assemblies once detected
  49. 49. Software InadequaciesSoftware Contig Scaffold Editing Reassemble Clipping Other View View InfoSeqMan 9 9 6 6 6 $$, buggy, not forPro large assemblies (32bit), 1 template sizeGap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizesConsed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizesHawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  50. 50. Closing Remarks• Software exist to allow manual editing of assemblies – Time consuming – Different tools have different features – Most fall over with assemblies > ~15Mbases or with many contigs/scaffolds (10k-100k)
  51. 51. Closing Remarks• Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
  52. 52. Closing Remarks• Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye) – Contig join editor for manual alignment and editing of contigs (like Gap5)
  53. 53. Gap5 – Join Editor
  54. 54. Closing Remarks• Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye) – Contig join editor for manual alignment and editing of contigs (like Gap5) – Visualise clipped regions with consensus mismatches (like Gap5)
  55. 55. Gap5 – Contig Editor
  56. 56. Closing Remarks• Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye) – Contig join editor for manual alignment and editing of contigs (like Gap5) – Visualise clipped regions with consensus mismatches (like Gap5) – Automated analysis of assembly to identify regions requiring attention (like AMOSvalidate) and a way to navigate to those regions for editing – Minimise mouse-clicks and keyboard presses!!
  57. 57. Newbler Plant Genome Assemblies• Pretty conservative in contig construction• Seems to split out repetitive regions into their own contigs pretty well• Heterozygsity issues – SNP alignment issues – Indels break contigs – Hidden in clipped regions – Manual joining of neighbouring contigs can reduce scaffolded contig numbers by 60-70% – Many unscaffolded contigs have high sequence similarity to scaffolded contigs – could collapse these and reduce the number of unscaffolded contigs by 50%
  58. 58. Gap5 – Contig Editor

×