Pattern Matching Part One: Suffix Trees

1.
Advanced Algorithms –COMS31900 Pattern Matching part one Sufﬁx Trees Benjamin Sach

2.
Exact pattern matching T InputA text string T (length n) and a pattern string P (length m) P ba b c a b a a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a n m for all 0 j m we have that P[j] = T[i + j] (our strings are zero-indexed)

3.
Exact pattern matching T InputA text string T (length n) and a pattern string P (length m) P ba b c a b a a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a n m 4 for all 0 j m we have that P[j] = T[i + j] (our strings are zero-indexed)

4.
Exact pattern matching T InputA text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a n a b a m 6 for all 0 j m we have that P[j] = T[i + j] (our strings are zero-indexed)

5.

6.

7.
Exact pattern matching T InputA text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a n a b a m 6 for all 0 j m we have that P[j] = T[i + j] (our strings are zero-indexed) • A naive algorithm takes O(nm) time

8.
Exact pattern matching T InputA text string T (length n) and a pattern string P (length m) P ba b c a b a cb a Goal: Find all the locations where P matches in T P matches at location i iff a b a n a b a m 6 for all 0 j m we have that P[j] = T[i + j] (our strings are zero-indexed) • A naive algorithm takes O(nm) time • Many O(n) time algorithms are known (for example KMP)

9.
Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n

10.
Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m

11.
Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T.

12.
Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. e.g. 4, 6, 10 4 6 10

13.
Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • A naive algorithm takes O(n) query time (using KMP) e.g. 4, 6, 10 4 6 10

14.
Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • A naive algorithm takes O(n) query time (using KMP) • We want a query time which depends only on m and occ - occ is the number of occurences (matches) e.g. 4, 6, 10 4 6 10

15.
Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • A naive algorithm takes O(n) query time (using KMP) • We want a query time which depends only on m and occ - occ is the number of occurences (matches) • We also want O(n) space and fast preprocessing (prep.) time e.g. 4, 6, 10 4 6 10

16.
The atomic sufﬁxtree TT b n aaa sn n

17.
The atomic sufﬁxtree TT b n aaa sn n b n aaa sn n aaa sn n aa sn aa sn a sn a s s sufﬁxes

18.
The atomic suffixtree sn a s n a s a n a s TT b n aaa sn n a s b n a sn a n a s b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes suffix tree

19.

20.

21.

22.

23.

24.

25.

26.

27.
The atomic suffixtree sn a s n a s a n a s TT b n aaa sn n a s b n a sn a n a s b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes suffix tree 0 1 2 3 4 5 6 0 1 2 3 4 5 6

28.
The atomic suffixtree sn a s n a s a n a s TT b n aaa sn n a s b n a sn a n a s b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes suffix tree 0 1 2 3 4 5 6 0 1 2 3 4 5 6 • The suffix tree contains every suffix of T as a root to leaf path

29.
The atomic suffixtree sn a s n a s a n a s TT b n aaa sn n a s b n a sn a n a s b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes suffix tree 0 1 2 3 4 5 6 0 1 2 3 4 5 6 • The suffix tree contains every suffix of T as a root to leaf path • Every edge is labelled with a character from T

30.
The atomic suffixtree sn a s n a s a n a s TT b n aaa sn n a s b n a sn a n a s b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes suffix tree 0 1 2 3 4 5 6 0 1 2 3 4 5 6 • The suffix tree contains every suffix of T as a root to leaf path • Every edge is labelled with a character from T • No two edges leaving the same node have the same label

31.
The atomic suffixtree sn a s n a s a n a s TT b n aaa sn n a s b n a sn a n a s b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes suffix tree 0 1 2 3 4 5 6 0 1 2 3 4 5 6 • The suffix tree contains every suffix of T as a root to leaf path • Each leaf corresponds to a suffix (so there are n leaves) • Every edge is labelled with a character from T • No two edges leaving the same node have the same label

32.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s 0 1 2 3 4 5 a 6

33.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 a 6

34.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m a 6

35.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a 6

36.

37.

38.

39.

40.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree 6

41.

42.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree 1 6

43.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree 3 6

44.

45.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree P bn a 6

46.

47.

48.

49.

50.

51.

52.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree We can decide whether P matches somewhere in O(m) time P bn a 6

53.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree We can decide whether P matches somewhere in O(m) time (we’ll worry about outputting the matches later) P bn a 6

54.
Searching in anatomic suffix tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you find a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree We can decide whether P matches somewhere in O(m) time (we’ll worry about outputting the matches later) WARNING! How long does it take to find the correct child? There could be n edges here! In this lecture we assume the alphabet size is a constant P bn a 6

55.
Searching in anatomic suffix tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you find a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree We can decide whether P matches somewhere in O(m) time (we’ll worry about outputting the matches later) WARNING! How long does it take to find the correct child? There could be n edges here! In this lecture we assume the alphabet size is a constant This may be fine in some applications (English text or DNA for example) We can remove the assumption via the magic of hashing P bn a 6

56.
Searching in anatomic sufﬁx tree sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s How do you ﬁnd a pattern? 0 1 2 3 4 5 P aa n m start at the root and walk down the tree a . . . matches occur at the leaves of the subtree We can decide whether P matches somewhere in O(m) time (we’ll worry about outputting the matches later) P bn a 6

57.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a 6

58.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a that’s good right? 6

59.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a that’s good right? Unfortunately there can be lots of internal nodes 6

60.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a that’s good right? Unfortunately there can be lots of internal nodes this path is pretty long. . . 6

61.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a that’s good right? Unfortunately there can be lots of internal nodes this path is pretty long. . . 6

62.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a that’s good right? Unfortunately there can be lots of internal nodes this path is pretty long. . . 7 characters 6

63.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a that’s good right? Unfortunately there can be lots of internal nodes this path is pretty long. . . 7 characters 23 nodes 6

64.
how large isthe atomic sufﬁx tree? sn a s n a s a n a s TT b n aaa sn n s b n a sn a n a s There are at most n leaves 0 1 2 3 4 5 a that’s good right? Unfortunately there can be lots of internal nodes this path is pretty long. . . 7 characters 23 nodes that’s not so bad, right? 6

65.
how large isthe atomic sufﬁx tree?

66.
how large isthe atomic sufﬁx tree? T a b 2

67.
how large isthe atomic sufﬁx tree? T a b 2 a b b

68.
how large isthe atomic sufﬁx tree? T a b 2 a b b 4 nodes

69.
how large isthe atomic sufﬁx tree? T a b a b baT 2 4 a b b b b a b b a b b 4 nodes 9 nodes

70.
how large isthe atomic sufﬁx tree? T a b a b ba a b baa bT T 2 4 6 a b b b b a b b a b b b b a b b a b b b b b a b b b 4 nodes 9 nodes 16 nodes

71.
how large isthe atomic sufﬁx tree? T a b a b ba a b baa b a b baa ba bT T T 2 4 6 8 a b b b b a b b a b b b b a b b a b b b b b a b b b 4 nodes 9 nodes 16 nodes b b a b b a b b b b b a b b b 25 nodes b b b b a b b b b

72.
how large isthe atomic sufﬁx tree? T a b a b ba a b baa b a b baa b a b baa b a b aa b b T T T T 2 4 6 8 10 a b b b b a b b a b b b b a b b a b b b b b a b b b 4 nodes 9 nodes 16 nodes b b a b b a b b b b b a b b b 25 nodes b b b b a b b b b b b a b b a b b b b b a b b b 36 nodes b b b b a b b b b b b b b b a b b b b b

73.
how large isthe atomic sufﬁx tree? T a b a b ba a b baa b a b baa b a b baa b a b aa b b T T T T 2 4 6 8 10 a b b b b a b b a b b b b a b b a b b b b b a b b b 4 nodes 9 nodes 16 nodes b b a b b a b b b b b a b b b 25 nodes b b b b a b b b b b b a b b a b b b b b a b b b 36 nodes b b b b a b b b b b b b b b a b b b b b An atomic sufﬁx tree can have ((n/2) + 1)2 nodes

74.
how large isthe atomic sufﬁx tree? T a b a b ba a b baa b a b baa b a b baa b a b aa b b T T T T 2 4 6 8 10 a b b b b a b b a b b b b a b b a b b b b b a b b b 4 nodes 9 nodes 16 nodes b b a b b a b b b b b a b b b 25 nodes b b b b a b b b b b b a b b a b b b b b a b b b 36 nodes b b b b a b b b b b b b b b a b b b b b An atomic sufﬁx tree can have ((n/2) + 1)2 nodes this is too big!far

75.
Compacted sufﬁx trees sn a s n a s a n a s TTb n aaa sn n s b n a sn a n a s 0 1 2 3 4 5 6a Why is the atomic sufﬁx tree so big?

76.
Compacted sufﬁx trees sn a s n a s a n a s TTb n aaa sn n s b n a sn a n a s 0 1 2 3 4 5 6a Why is the atomic sufﬁx tree so big? because it has long paths like this one

77.
Compacted sufﬁx trees sn a s n a s a n a s TTb n aaa sn n s b n a sn a n a s 0 1 2 3 4 5 6a Why is the atomic sufﬁx tree so big? because it has long paths like this one Main Idea replace each non-branching path with a single edge

78.
Compacted sufﬁx trees sn a s n a s a n a s TTb n aaa sn n s b n a sn a n a s 0 1 2 3 4 5 6a Why is the atomic sufﬁx tree so big? because it has long paths like this one Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings

79.
Compacted sufﬁx trees sn a s n a s a n a s TTb n aaa sn n s b n a sn a n a s 0 1 2 3 4 5 6a Why is the atomic sufﬁx tree so big? because it has long paths like this one Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters)

80.
Compacted sufﬁx trees TTb n aaa sn n Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters) a s nas nas nas s na s bananas 1 3 5 0 2 4 6

81.
Compacted sufﬁx trees TTb n aaa sn n Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters) a s nas nas nas s na s bananas • There are at most n leaves 1 3 5 0 2 4 6

82.
Compacted sufﬁx trees TTb n aaa sn n Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters) a s nas nas nas s na s bananas • There are at most n leaves • Every internal node has two or more children 1 3 5 0 2 4 6

83.
Compacted sufﬁx trees TTb n aaa sn n Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters) a s nas nas nas s na s bananas • There are at most n leaves • Every internal node has two or more children so there are O(n) edges 1 3 5 0 2 4 6

84.
Compacted sufﬁx trees TTb n aaa sn n Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters) a s nas nas nas s na s bananas • There are at most n leaves • Every internal node has two or more children so there are O(n) edges don’t the edges take up 1 3 5 0 2 4 6 lots of space?

85.
Compacted sufﬁx trees TTb n aaa sn n Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters) a s nas nas nas s na s bananas • There are at most n leaves • Every internal node has two or more children so there are O(n) edges don’t the edges take up we only store the end points 1 3 5 0 2 4 6 lots of space?

86.
Compacted sufﬁx trees TTb n aaa sn n Main Idea replace each non-branching path with a single edge - edges are now labelled with substrings (instead of single characters) a s nas nas nas s na s bananas • There are at most n leaves • Every internal node has two or more children so there are O(n) edges don’t the edges take up we only store the end points we actually store (4, 6) 4 6 1 3 5 0 2 4 6 lots of space?

87.
Compacted sufﬁx trees TTb n aaa sn n a s nas nas nas s na s bananas 1 3 5 0 2 4 6

88.
Compacted sufﬁx trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Sufﬁx Tree of T 1 3 5 0 2 4 6

89.
Compacted sufﬁx trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Sufﬁx Tree of T • A rooted tree with n leaves 1 3 5 0 2 4 6

90.
Compacted sufﬁx trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Sufﬁx Tree of T • A rooted tree with n leaves • Every internal node has two or more children 1 3 5 0 2 4 6

91.
Compacted sufﬁx trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Sufﬁx Tree of T • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring 1 3 5 0 2 4 6

92.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character 1 3 5 0 2 4 6

93.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T 1 3 5 0 2 4 6

94.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6

95.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6

96.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6 Sanity Check Does the compacted suffix tree always exist?

97.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6 Sanity Check Does the compacted suffix tree always exist? TT b b this doesn’t have n leavesbb

98.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6 Sanity Check Does the compacted suffix tree always exist? TT b b this doesn’t have n leavesbb TT b b $ b$ $ b$ this has n leaves

99.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6

100.
Compacted suffix trees TTb n aaa sn n a s nas nas nas s na s bananas Compacted Suffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6 Step one: Add a $ (unique symbol) to T

101.
Compacted suffix trees a s nas nas nas s na s bananas CompactedSuffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix 1 3 5 0 2 4 6 Step one: Add a $ (unique symbol) to T TT b n aaa sn n $

102.
Compacted suffix trees CompactedSuffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix Step one: Add a $ (unique symbol) to T TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ 1 3 5 0 2 4 6

103.
Compacted suffix trees CompactedSuffix Tree of T Uses O(n) space • A rooted tree with n leaves • Every internal node has two or more children • Every edge is labelled with a substring • No two edges leaving the same node have the same first character • Each leaf is labelled with a location in T • Any root-to-leaf path spells out the corresponding suffix Step one: Add a $ (unique symbol) to T TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ This is normally just called1 3 5 0 2 4 6 a suffix tree

104.
Searching in acompacted sufﬁx tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ 1 3 5 0 2 4 6

105.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ 1 3 5 0 2 4 6

106.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an 1 3 5 0 2 4 6

107.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an 1 3 5 0 2 4 6

108.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an 1 3 5 0 2 4 6

109.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a an 1 3 5 0 2 4 6

110.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a na an 1 3 5 0 2 4 6

111.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a na an 1 3 5 0 2 4 6 remember that an edge is actually stored as a pair we’re actually looking in T

112.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a na an na1 3 5 0 2 4 6

113.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a na an na nas$ 1 3 5 0 2 4 6

114.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a na an na nas$ nas$ 1 3 5 0 2 4 6

115.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a na an na nas$ nas$ 1 3 5 0 2 4 6

116.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree 1 TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ a na an na nas$ nas$ 1 3 5 0 2 4 6

117.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an 1 3 5 0 2 4 6

118.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an 1 3 5 0 2 4 6

119.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an 1 3 5 0 2 4 6

120.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an na 1 3 5 0 2 4 6

121.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an nana 1 3 5 0 2 4 6

122.

123.

124.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an nana 2 1 3 5 0 2 4 6

125.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an nana 4 1 3 5 0 2 4 6

126.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an nana how big is this subtree?1 3 5 0 2 4 6

127.
Searching in acompacted sufﬁx tree How do you ﬁnd a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an nana how big is this subtree? O(occ) because it has occ leaves 1 3 5 0 2 4 6 (and each internal node has at least two children)

128.
Searching in acompacted suffix tree How do you find a pattern? P aa n m start at the root and walk down the tree . . . matches occur at the leaves of the subtree We can find all the matches in O(m + occ) time (by looking at the whole subtree) P n a TT b n aaa sn n $ a s$ nas$ nas$ nas$ s$ na s$ bananas$ 7$ an nana how big is this subtree? O(occ) because it has occ leaves 1 3 5 0 2 4 6 (and each internal node has at least two children)

129.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ never actually do it like this you should

130.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 never actually do it like this you should

131.

132.

133.

134.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 never actually do it like this you should

135.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 we actually store this as (0, 7) never actually do it like this you should

136.

137.

138.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 1 ananas$ never actually do it like this you should

139.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 1 ananas$ this is stored as (1, 7) never actually do it like this you should

140.

141.

142.

143.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 1 ananas$ 2 nanas$ never actually do it like this you should

144.

145.

146.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 1 ananas$ 2 nanas$ ananas$ never actually do it like this you should

147.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 1 ananas$ 2 nanas$ ananas$ ananas$ never actually do it like this you should

148.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 1 ananas$ 2 nanas$ ananas$ ananas$ ananas$ never actually do it like this you should

149.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 1 ananas$ 2 nanas$ ananas$ ananas$ ananas$ ananas$ never actually do it like this you should

150.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ nas$ 1 ana nas$ ana never actually do it like this you should

151.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ nas$ 1 ana nas$ ana ananas$ was stored as (1, 7) never actually do it like this you should

152.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ nas$ 1 ana nas$ ana ananas$ was stored as (1, 7) ana is stored as (1, 3) nas$ is stored as (4, 7) never actually do it like this you should

153.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ nas$ 1 ana nas$ ana never actually do it like this you should

154.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ s$ nas$ 1 3 ana never actually do it like this you should

155.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ s$ nas$ 1 3 ana stored as (6, 7) never actually do it like this you should

156.

157.

158.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ s$ nas$ 1 3 ana nanas$ never actually do it like this you should

159.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ s$ nas$ 1 3 ana nanas$ nanas$ never actually do it like this you should

160.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 nanas$ s$ nas$ 1 3 ana nanas$ nanas$ nanas$ never actually do it like this you should

161.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ na bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 s$ nas$ 1 3 ana na nas$ nas$ never actually do it like this you should

162.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ s$ na bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 4 s$ nas$ 1 3 ana nas$ never actually do it like this you should

163.

164.

165.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ s$ na bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 4 s$ nas$ 1 3 ana ana nas$ never actually do it like this you should

166.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ s$ na bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 0 2 4 s$ nas$ 1 3 ana ana ana nas$ never actually do it like this you should

167.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ a nas$ nas$ s$ na bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 0 2 4 nas$ a na never actually do it like this you should

168.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ a s$ nas$ nas$ s$ na bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 nas$ never actually do it like this you should

169.

170.

171.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 6 nas$ never actually do it like this you should

172.

173.

174.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 6 nas$ never actually do it like this you should

175.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 6 nas$ never actually do it like this you should

176.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 6 nas$ This takes O(n) time per suffix. . . never actually do it like this you should

177.
Naively constructing acompacted suffix tree Insert the suffixes one at a time (longest first) • Search for the new suffix in the partial suffix tree (as if you were matching a pattern) • Add a new edge and leaf for the new suffix (this may require you to break an edge in two) TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 6 nas$ This takes O(n) time per suffix. . . so O(n2) time in total never actually do it like this you should

178.
Suffix tree summary TTb n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 6 nas$ • The (compacted) suffix tree of a (length n) text uses O(n) space • Finding all matches of a pattern P of length m takes O(m + occ) where occ is the number of matches we assumed that the alphabet contained a constant number of symbols • Suffix trees can be built in O(n) time but we have only seen the O(n2) time method actually do it like this (or build a suffix array instead) you should

179.
Multiple text indexing T1b n aaa sn n1 T2 a p slp e n2 How can we index multiple texts?

180.
Multiple text indexing T1b n aaa sn n1 T2 a p slp e n2 $ & two distinct unique symbols How can we index multiple texts?

181.
Multiple text indexing TT T1b n aaa sn n1 T2 a p slp e n2 $ & two distinct unique symbols b n aaa sn $ a p slp e & n How can we index multiple texts?

182.
Multiple text indexing TT T1b n aaa sn n1 T2 a p slp e n2 $ & b n aaa sn $ a p slp e & n How can we index multiple texts?

183.
Multiple text indexing 6$ 13 & TT a s$ nas$ nas$ s$ na s bananas$ 7$ 13 5 0 2 4 nas$ T1 b n aaa sn n1 T2 a p slp e n2 $ & b n aaa sn $ a p slp e & n 14 & 12 es& les&11 8 pples& p 10 les& 9 ples& How can we index multiple texts?

184.
Multiple text indexing 6$ 13 & TT a s$ nas$ nas$ s$ na s bananas$ 7$ 13 5 0 2 4 nas$ T1 b n aaa sn n1 T2 a p slp e n2 $ & b n aaa sn $ a p slp e & n • Build a generalised sufﬁx tree in O(n1 + n2) space 14 & 12 es& les&11 8 pples& p 10 les& 9 ples& How can we index multiple texts?

185.
Multiple text indexing 6$ 13 & TT a s$ nas$ nas$ s$ na s bananas$ 7$ 13 5 0 2 4 nas$ T1 b n aaa sn n1 T2 a p slp e n2 $ & b n aaa sn $ a p slp e & n • Build a generalised sufﬁx tree in O(n1 + n2) space • Using the linear time method (which we omitted), this takes O(n1 + n2) time 14 & 12 es& les&11 8 pples& p 10 les& 9 ples& How can we index multiple texts?

186.
Multiple text indexing 6$ 13 & TT a s$ nas$ nas$ s$ na s bananas$ 7$ 13 5 0 2 4 nas$ T1 b n aaa sn n1 T2 a p slp e n2 $ & b n aaa sn $ a p slp e & n • Build a generalised sufﬁx tree in O(n1 + n2) space • Using the linear time method (which we omitted), this takes O(n1 + n2) time • Finding all matches of a pattern P of length m still takes O(m + occ) time 14 & 12 es& les&11 8 pples& p 10 les& 9 ples& How can we index multiple texts? where occ is the number of matches

187.
The sufﬁx array- a sneak preview T b n aaT a sn n

188.
The sufﬁx array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn

189.
The sufﬁx array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa snsufﬁx 1

190.
The sufﬁx array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn

191.
The sufﬁx array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the sufﬁxes lexicographically 3 aa sn

192.
The sufﬁx array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the sufﬁxes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order

193.
The suffix array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs:

194.
The suffix array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a

195.

196.

197.
The suffix array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<

198.

199.

200.
The suffix array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a

201.
The suffix array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a

202.
The suffix array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)

203.
The suffix array- a sneak preview T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)

204.
The suffix array- a sneak preview T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)

205.

206.

207.

208.
The suffix array- a sneak preview T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in a dictionary In lexicographical ordering we sort strings based on the first symbol that differs:

209.
The suffix array- a sneak preview T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order If the symbols don’t have a natural order, we use their binary representation in memory b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in a dictionary In lexicographical ordering we sort strings based on the first symbol that differs:

210.
The sufﬁx array- a sneak preview T b n aaT a sn n Sort the sufﬁxes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn

211.
The sufﬁx array- a sneak preview T b n aaT a sn n Sort the sufﬁxes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 10 2 3 4 5 6

212.
The suffix array- a sneak preview T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6

213.

214.

215.
The suffix array- a sneak preview T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6

216.
The suffix array- a sneak preview T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6

217.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array Suffix Tree recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n 10 2 3 4 5 6

218.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array Suffix Tree recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 10 2 3 4 5 6

219.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 10 2 3 4 5 6

220.

221.

222.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 10 2 3 4 5 6

223.

224.

225.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 3 10 2 3 4 5 6

226.

227.

228.

229.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 3 5 10 2 3 4 5 6

230.

231.

232.

233.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 3 5 0 10 2 3 4 5 6

234.

235.

236.

237.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 3 5 0 2 10 2 3 4 5 6

238.

239.

240.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 3 5 0 2 4 10 2 3 4 5 6

241.

242.

243.

244.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 3 5 0 2 4 6 10 2 3 4 5 6

245.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) 1 3 5 0 2 4 6 10 2 3 4 5 6

246.
Constructing the SuffixArray from the Suffix Tree a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array recall that we added a unique symbol $ to make sure the tree exists - the $ is the smallest symbol in the alphabet 1 0 625 43 n To get the Suffix array perform a depth-first search (in lexicographical order) this takes O(n) time 1 3 5 0 2 4 6 10 2 3 4 5 6

247.
Suffix tree summary TTb n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 1 3 5 0 2 4 6 nas$ • The (compacted) suffix tree of a (length n) text uses O(n) space • Finding all matches of a pattern P of length m takes O(m + occ) where occ is the number of matches we assumed that the alphabet contains a constant number of symbols • Suffix trees can be built in O(n) time but we have only seen the O(n2) time method

Pattern Matching Part One: Suffix Trees

More Related Content

What's hot

Similar to Pattern Matching Part One: Suffix Trees

More from Benjamin Sach

Recently uploaded

Pattern Matching Part One: Suffix Trees