Fast algorithms for large scale
   genome alignment and
   comparison

                                                   ...
The article(s)

        A.L. Delcher, S. Kasif, R.D. Fleischmann, J.
         Peterson, O. White, S.L. Salzberg: “Alignme...
The problem

        When the genome sequence of two closely
         related organisms becomes available, one of the
   ...
The challenge

        Naïve
          • O(n2) space and time
        Hashing
          • faster, but still partly O(n2)...
The algorithm

       1) Perform a Maximal Unique Match (MUM)
         decomposition of the two genomes
       2) Sort the...
MUM: the suffix tree




p. 6   2007/05/28          ACMB
Longest Increasing Subsequence




p. 7   2007/05/28   ACMB
Closing the gaps




p. 8   2007/05/28         ACMB
MUMmer v2.0

        Relaxes the uniqueness constraint
        Faster, takes less space
        Algorithmic improvement...
Time-space improvements

         The amount of memory used in the suffix tree
          has been reduced
           • fr...
Streaming queries




p. 11   2007/05/28         ACMB
Clustering of matches

         Old version computed a single longest alignment
          between the sequences
        ...
Alignment of incomplete genomes

         In a typical Whole-Genome Shotgun-Sequencing,
          the genome is broken up...
Alignment of incomplete genomes

        1)NUCmer input: two multi-fasta files representing
          partial or complete ...
NUCmer




p. 15   2007/05/28    ACMB
PROmer

        1)Given two multi-fasta files, PROmer translates the
          DNA to amino acids
        2)An index is cr...
MUMmer v3.0

         New improvements in code
           • slightly faster than 2.0, 25% less memory
         More modu...
Graphical interfaces




p. 18   2007/05/28           ACMB
Graphical interfaces




p. 19   2007/05/28           ACMB
Graphical interfaces




p. 20   2007/05/28           ACMB
That's All, Folks



                          Thank you!
                     Questions are welcome




p. 21   2007/05/2...
Upcoming SlideShare
Loading in...5
×

Fast algorithms for large scale genome alignment and comparison

1,913

Published on

A presentation of the three articles about MUMmer by A.L. Delcher et al., made for the PhD course "Algorithms for Computational Molecular Biology" at Politecnico di Milano, May 2007

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,913
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
31
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Fast algorithms for large scale genome alignment and comparison

  1. 1. Fast algorithms for large scale genome alignment and comparison Davide Eynard eynard@elet.polimi.it Dipartimento di Elettronica e Informazione Politecnico di Milano 2007/05/28 Algorithms for Computational Molecular Biology
  2. 2. The article(s)  A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, S.L. Salzberg: “Alignment of whole genomes”, 1999  A.L. Delcher, A. Philippy, J. Carlton, S.L. Salzberg: “Fast algorithms for large-scale genome alignment and comparison”, 2002  S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S.L. Salzberg: “Versatile and open software for comparing large genomes”, 2004 p. 2 2007/05/28 ACMB
  3. 3. The problem  When the genome sequence of two closely related organisms becomes available, one of the first questions researchers want to ask is how the two genomes align  Aligning (very) long sequences • Single gene sequences may be as long as tens of thousand of nucleotides • Whole genomes are usually millions of nucleotides or larger! p. 3 2007/05/28 ACMB
  4. 4. The challenge  Naïve • O(n2) space and time  Hashing • faster, but still partly O(n2)  Dynamic Programming • O(n) space, takes more time  MUMmer • Suffix trees: O(n) space and time • LIS: O(k log k) where k is the number of MUMs p. 4 2007/05/28 ACMB
  5. 5. The algorithm 1) Perform a Maximal Unique Match (MUM) decomposition of the two genomes 2) Sort the matches found in the MUM alignment, and extract the LIS (Longest Increasing Sequence) of matches that occur in the same order in both genomes 3) Close the gaps in the alignment, performing local identification of large inserts, repeats, small mutated regions, tandem repeats and SNPs 4) Output the alignment p. 5 2007/05/28 ACMB
  6. 6. MUM: the suffix tree p. 6 2007/05/28 ACMB
  7. 7. Longest Increasing Subsequence p. 7 2007/05/28 ACMB
  8. 8. Closing the gaps p. 8 2007/05/28 ACMB
  9. 9. MUMmer v2.0  Relaxes the uniqueness constraint  Faster, takes less space  Algorithmic improvements • memory • streaming query • new module to cluster matches  Able to align not only simple DNA sequences, but also human chromosomes  Able to align incomplete genomes and protein sequences p. 9 2007/05/28 ACMB
  10. 10. Time-space improvements  The amount of memory used in the suffix tree has been reduced • from at most 37bytes/bp to at most 20bytes/bp  Speed has increased • E.coli vs. V.cholerae, from 74sec,293MB to 27sec, 100MB  Suffix tree is used to store only one sequence, while the second one (query) is streamed against the suffix tree • once the suffix tree has been built, multiple queries can be streamed • quick way to find the next match • matches are maximal on the right hand side p. 10 2007/05/28 ACMB
  11. 11. Streaming queries p. 11 2007/05/28 ACMB
  12. 12. Clustering of matches  Old version computed a single longest alignment between the sequences  New version works as follows: • first, the system outputs a series of separate, independent alignment regions • clustering is performed by finding pairs of matches that are sufficiently close • finally, a LIS computation is done within each component to yield the most consistent sequence of matches in the cluster p. 12 2007/05/28 ACMB
  13. 13. Alignment of incomplete genomes  In a typical Whole-Genome Shotgun-Sequencing, the genome is broken up into millions of pieces • If the reads are generated at random, then >99% of a genome will be covered by sequencing enough reads to cover the genome eight times • The result of assembly is usually a collection of large, unordered DNA sequences called contigs  NUCmer (nucleotide MUMmer) is a multiple- contig alignment program that uses MUMmer 2 as its core aligment engine p. 13 2007/05/28 ACMB
  14. 14. Alignment of incomplete genomes 1)NUCmer input: two multi-fasta files representing partial or complete assemblies 2)Create a map of all contig positions within each file 3)Concatenate files separately and run MUMmer to find exact matches 4)Map matches to separate contigs 5)MUMs are clustered together if they are separated by no more than a user-specifiedd distance 6)Dynamic programming is used to align sequences between the MUMs p. 14 2007/05/28 ACMB
  15. 15. NUCmer p. 15 2007/05/28 ACMB
  16. 16. PROmer 1)Given two multi-fasta files, PROmer translates the DNA to amino acids 2)An index is created that maps all protein sequences and lengths to the source DNA 3)Pseudo-proteomes (amino acid sequences) are passed to MUMmer 4)The index is used to translate the matches back to the original DNA input 5)Clustering step p. 16 2007/05/28 ACMB
  17. 17. MUMmer v3.0  New improvements in code • slightly faster than 2.0, 25% less memory  More modular and configurable • possibility to build hybrid systems  Ability to run a multi-contig query against a multi- contig reference  Non-unique maximal matches  Speed-up of Nucmer and Promer modules (approx. 10-fold)  Graphical viewers p. 17 2007/05/28 ACMB
  18. 18. Graphical interfaces p. 18 2007/05/28 ACMB
  19. 19. Graphical interfaces p. 19 2007/05/28 ACMB
  20. 20. Graphical interfaces p. 20 2007/05/28 ACMB
  21. 21. That's All, Folks Thank you! Questions are welcome p. 21 2007/05/28 ACMB
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×