Fast algorithms for large scale genome alignment and comparison

Fast algorithms for large scale
genome alignment and
comparison

Davide Eynard
eynard@elet.polimi.it

Dipartimento di Elettronica e Informazione
Politecnico di Milano

2007/05/28

Algorithms for Computational Molecular Biology

The article(s)

 A.L. Delcher, S. Kasif, R.D. Fleischmann, J.
Peterson, O. White, S.L. Salzberg: “Alignment of
whole genomes”, 1999
 A.L. Delcher, A. Philippy, J. Carlton, S.L.
Salzberg: “Fast algorithms for large-scale
genome alignment and comparison”, 2002
 S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M.
Shumway, C. Antonescu, S.L. Salzberg:
“Versatile and open software for comparing large
genomes”, 2004

p. 2 2007/05/28 ACMB

The problem

 When the genome sequence of two closely
related organisms becomes available, one of the
first questions researchers want to ask is how the
two genomes align
 Aligning (very) long sequences
• Single gene sequences may be as long as tens of
thousand of nucleotides
• Whole genomes are usually millions of nucleotides
or larger!

p. 3 2007/05/28 ACMB

The challenge

 Naïve
• O(n2) space and time
 Hashing
• faster, but still partly O(n2)
 Dynamic Programming
• O(n) space, takes more time
 MUMmer
• Suffix trees: O(n) space and time
• LIS: O(k log k) where k is the number of MUMs

p. 4 2007/05/28 ACMB

The algorithm

1) Perform a Maximal Unique Match (MUM)
decomposition of the two genomes
2) Sort the matches found in the MUM alignment,
and extract the LIS (Longest Increasing
Sequence) of matches that occur in the same
order in both genomes
3) Close the gaps in the alignment, performing
local identification of large inserts, repeats, small
mutated regions, tandem repeats and SNPs
4) Output the alignment

p. 5 2007/05/28 ACMB

MUM: the suffix tree

p. 6 2007/05/28 ACMB

Longest Increasing Subsequence

p. 7 2007/05/28 ACMB

Closing the gaps

p. 8 2007/05/28 ACMB

MUMmer v2.0

 Relaxes the uniqueness constraint
 Faster, takes less space
 Algorithmic improvements
• memory
• streaming query
• new module to cluster matches
 Able to align not only simple DNA sequences, but
also human chromosomes
 Able to align incomplete genomes and protein
sequences

p. 9 2007/05/28 ACMB

Time-space improvements

 The amount of memory used in the suffix tree
has been reduced
• from at most 37bytes/bp to at most 20bytes/bp
 Speed has increased
• E.coli vs. V.cholerae, from 74sec,293MB to 27sec,
100MB
 Suffix tree is used to store only one sequence,
while the second one (query) is streamed against
the suffix tree
• once the suffix tree has been built, multiple queries
can be streamed
• quick way to find the next match
• matches are maximal on the right hand side
p. 10 2007/05/28 ACMB

Streaming queries

p. 11 2007/05/28 ACMB

Clustering of matches

 Old version computed a single longest alignment
between the sequences
 New version works as follows:
• first, the system outputs a series of separate,
independent alignment regions
• clustering is performed by finding pairs of matches
that are sufficiently close
• finally, a LIS computation is done within each
component to yield the most consistent sequence
of matches in the cluster

p. 12 2007/05/28 ACMB

Alignment of incomplete genomes

 In a typical Whole-Genome Shotgun-Sequencing,
the genome is broken up into millions of pieces
• If the reads are generated at random, then >99%
of a genome will be covered by sequencing
enough reads to cover the genome eight times
• The result of assembly is usually a collection of
large, unordered DNA sequences called contigs
 NUCmer (nucleotide MUMmer) is a multiple-
contig alignment program that uses MUMmer 2
as its core aligment engine

p. 13 2007/05/28 ACMB

Alignment of incomplete genomes

1)NUCmer input: two multi-fasta files representing
partial or complete assemblies
2)Create a map of all contig positions within each
file
3)Concatenate files separately and run MUMmer to
find exact matches
4)Map matches to separate contigs
5)MUMs are clustered together if they are
separated by no more than a user-specifiedd
distance
6)Dynamic programming is used to align
sequences between the MUMs

p. 14 2007/05/28 ACMB

NUCmer

p. 15 2007/05/28 ACMB

PROmer

1)Given two multi-fasta files, PROmer translates the
DNA to amino acids
2)An index is created that maps all protein
sequences and lengths to the source DNA
3)Pseudo-proteomes (amino acid sequences) are
passed to MUMmer
4)The index is used to translate the matches back
to the original DNA input
5)Clustering step

p. 16 2007/05/28 ACMB

MUMmer v3.0

 New improvements in code
• slightly faster than 2.0, 25% less memory
 More modular and configurable
• possibility to build hybrid systems
 Ability to run a multi-contig query against a multi-
contig reference
 Non-unique maximal matches
 Speed-up of Nucmer and Promer modules
(approx. 10-fold)
 Graphical viewers

p. 17 2007/05/28 ACMB

Graphical interfaces

p. 18 2007/05/28 ACMB


p. 19 2007/05/28 ACMB


p. 20 2007/05/28 ACMB

That's All, Folks

Thank you!
Questions are welcome

p. 21 2007/05/28 ACMB

Fast algorithms for large scale genome alignment and comparison

Recommended

Recommended

More Related Content

Similar to Fast algorithms for large scale genome alignment and comparison

Similar to Fast algorithms for large scale genome alignment and comparison (20)

More from Davide Eynard

More from Davide Eynard (15)

Recently uploaded

Recently uploaded (20)

Fast algorithms for large scale genome alignment and comparison