Suffix Tree and Suffix Array

INDIAN INSTITUTE OF TECHNOLOGY (BHU),
VARANASI
CS-4301:SEMINAR/GROUPDISCUSSION
By:
HARSHITAGARWAL
11100EN006

Suffix Tree & Suffix Array
Construction
Pattern Searching
Space & Time Complexity
Applications
(In a Nutshell …)

Motivation
• Need fast text searching algorithm with low space
cost.
• DNA sequences and protein sequences are too large
to search by traditional algorithms.
• Rabin Karp
• Naive String Matching
• Some improved algorithms perform efficiently
• KMP, BM algorithms for string matching.

Definition
A suffix tree (also called PAT tree ) is a compressed trie
containing all the suffixes of the given text.
• Properties :
• Each tree edge is labeled by a substring of S.
• Each internal node except the root ,has >= 2 children.
• No edges branching out from the same internal node can start
with the same character.
• Each S(i) has its corresponding labeled path from root to a leaf, for
1 i  n .
• There are n leaves.

Trie
• A Trie represents a set of strings. For e.g. :
{ aeef , ad , bbfe , bbfg , c }

Compressed Trie
• Compress unary nodes, label edges by strings

Construction of Suffix Tree
Uniqueness :
(1) Here we preprocess the text string instead of pattern .
(2) Each suffix string is padded with a terminal symbol not seen
in the string (usually denoted by $). This ensures that no suffix
is a prefix of another.

Naïve Method
ALGORITHM :
1) Put suffix S [1,m] into the tree.
2) Then put S [i, m] into the tree for 2<=i<=m .
Naïve method - O(m2) (m = text size)
while suffixes remain:
add next shortest suffix to the tree

O(m2)
Too Expensive !!!
Can we improve it
further ??

Ukkonen’s Algorithm
• Build suffix tree incrementally from left-to-right
• Build the tree in ‘m’ phases, one for each character.
• At the end of phase i, we will have tree T’i, which is the tree
representing the prefix S[1..i].
• Build “implicit" suffix trees (no end-of-string marker).
• Extend the implicit suffix trees from the previous step.
• Convert implicit suffix tree to “explicit” in final step.
• Three Extension Rules involved :
• ADD a new edge if new character.
• mSPLIT the existing edge if character already present.
• DO NOTHING if the current suffix already present.

STEP 6: x t p x t d (ADD)
Implicit Tree

STEP 7: x t p x t d (Finalize)
Explicit Tree

Shortcuts :Improve the complexity
• Suffix Links
• Skip and Count Trick
• Edge Label Compressions
• A Stopper

Pattern Searching
Idea:
Every pattern that is present in text (or we can say every substring
of text) must be a prefix of one of all possible suffixes.
Algorithm:
• Starting from the first character of the pattern and root of Suffix
Tree, do following for every character.
• For the current character of pattern, if there is an edge from the
current node of suffix tree, follow the edge
• If there is no edge, print “pattern doesn’t exist in text” and return.
• If all characters of pattern have been processed, i.e., there is a
path from root for characters of the given pattern, then print
“Pattern found”.
Complexity:Order n (size of pattern) rather than Order m (size of text)

Definition
A suffixarray is just a sorted array of all the suffixes of a given
string. It contain integers that represent the starting indexes of
the all the suffixes of a given string, after the suffixes are sorted.
• Properties : Let Sbe a string and let S[i..j] denote the
substring of ranging from itoj.
• Suffix array is defined to be an array of integers providing the
starting positions of suffixesof S in lexicographicalorder.
• An entry A[i]contains the starting position of the i-thsmallest
suffixinS.
• For all 1 < i <= n : S[A[i-1],n]<S[A[i],n]

Construction of Suffix Array
• The text ends with the special sentinel letter $ that is unique
and lexicographically smaller than any other character.
• EasyO(n2logn)algorithm:
- Sort the n suffixes, which takes O(n log n) comparisons.
- Each comparison takes O(n).
• There are O(n log n) & O(n) algorithms for constructing suffix
arrays that use very little space.

Can we do it in
O(n) ..??
Skew Algorithm-
Divide & Conquer

Pattern Searching
• If Pattern(P) occurs in Text (T) then all its occurrences are
consecutive in the suffix array.
• Do a binary search on the suffix array.
• Complexity :Takes O(nlogm) time ,
where --m->length of Text
-- n->length of Pattern
• It can be improved to O(n+logm) time using LCP information.

Accelerate the Search :
L
R
Maintain l = LCP(P,L)
Maintain r = LCP(P,R)
M
If l = r then start
comparing M to P at l + 1
l
r

L
R
Suppose we know LCP(L,M)
If LCP(L,M) < l we go left
If LCP(L,M) > l we go right
If LCP(L,M) = l we start
comparing at l + 1
M
If l > r then
r
l

Suffix Array Vs Suffix Tree
• Suffix arrays are closely related to suffix trees:
• A suffix array can be constructed from Suffix tree by doing a DFS
(DepthFirstSearch)traversal of the suffix tree.
• A suffix tree can be constructed in linear time by using a
combination of suffix and LCP(LeastCommonPrefix)array.
• Slightspacevs.timetradeoff: Suffix arrays are more space
efficient way but just a bit slower to store the suffixes because we
just store the original string + a list of integers.

Applications
• Finding the longestrepeatedsubstring.
• Finding the longestcommonsubstring.
• Finding the longestpalindromein a string.
• Others :
• Data Compression & Clustering Algorithms.
• Bioinformatics.

Suffix Tree and Suffix Array

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Suffix Tree and Suffix Array

Similar to Suffix Tree and Suffix Array (20)

Recently uploaded

Recently uploaded (20)

Suffix Tree and Suffix Array