Tries - Tree Based Structures for Strings

CS 6213 –
Advanced
Data
Structures
TRIES
AN EXCELLENT DATA
STRUCTURE FOR
STRINGS

Instructor
Prof. Amrinder Arora
amrinder@gwu.edu
Please copy TA on emails
Please feel free to call as well
TA
Iswarya Parupudi
iswarya2291@gwmail.gwu.edu
L6 - Tries CS 6213 - Advanced Data Structures - Arora 2
LOGISTICS

Michael T. Goodrich and Roberto Tamassia
Data Structures and Algorithms in Java (4th edition)
John Wiley & Sons, Inc.
ISBN: 0-471-73884-0
Haim Kaplan, Tel Aviv University
Jörg Liebeherr, University of Toronto
CREDITS

Naïve, brute force for searching a text of size n and a
pattern of size m requires O(nm) time.
Preprocessing the pattern speeds up pattern
matching queries. E.g., KMP algorithm performs
pattern matching in time proportional to the text
size: O(n)
If the text is large, immutable and searched often
(e.g., Shakespeare), we may want to preprocess the
text itself. Want to perform the searching in O(m)
time.
MOTIVATION

A trie is a compact data structure for representing a
set of strings, such as all the words in a text. A trie
supports pattern matching queries in time
proportional to the pattern size: O(m)
MOTIVATION (CONT.)

Standard Tries
Compressed Tries
Compact Representation
Suffix Trie
TRIES: TOPICS

 The standard trie for a set of strings S is an ordered tree
such that:
 Each node but the root is labeled with a character
 The children of a node are alphabetically ordered
 The paths from the root to the leaves yield the strings of S
 Example: set of strings S = { bear, bell, bid, bull, buy, sell, stock,
stop }
STANDARD TRIES
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d

A standard trie uses O(n) space and supports
searches, insertions and deletions in time
O(dm), where:
n total size of the strings in S
m size of the string parameter of the operation
d size of the alphabet
ANALYSIS OF STANDARD TRIES
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d

 We insert
the words of
the text into
a trie
 Each leaf
stores the
occurrences
of the
associated
word in the
text
WORD MATCHING WITH A TRIE
s e e b e a r ? s e l l s t o c k !
s e e b u l l ? b u y s t o c k !
b i d s t o c k !
a
a
h e t h e b e l l ? s t o p !
b i d s t o c k !
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
a r
87 88
a
e
b
l
s
u
l
e t
e
0, 24
o
c
i
l
r
6
l
78
d
47, 58
l
30
y
36
l
12
k
17, 40,
51, 62
p
84
h
e
r
69
a

 A compressed trie has
internal nodes of
degree at least two
 It is obtained from
standard trie by
compressing chains of
“redundant” nodes
COMPRESSED TRIES
e
b
ar ll
s
u
ll y
ell to
ck p
id
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d

 Compact representation of a compressed trie for an array of
strings:
 Stores at the nodes ranges of indices instead of substrings
 Uses O(s) space, where s is the number of strings in the array
 Serves as an auxiliary index structure
COMPACT REPRESENTATION
s e e
b e a r
s e l l
s t o c k
b u l l
b u y
b i d
h e
b e l l
s t o p
0 1 2 3 4
a rS[0] =
S[1] =
S[2] =
S[3] =
S[4] =
S[5] =
S[6] =
S[7] =
S[8] =
S[9] =
0 1 2 3 0 1 2 3
1, 1, 1
1, 0, 0 0, 0, 0
4, 1, 1
0, 2, 2
3, 1, 2
1, 2, 3 8, 2, 3
6, 1, 2
4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3
7, 0, 3
0, 1, 1

Begins with: where name like ‘x%’
Ends with: where name like ‘%x’
Substring: where name like ‘%x%’
STRING SEARCHES

 The suffix trie of a string X is the compressed trie of all
the suffixes of X
SUFFIX TRIE
e nimize
nimize ze
zei mi
mize nimize ze
m i n i z em i
0 1 2 3 4 5 6 7

Compact representation of the suffix trie for a
string X of size n from an alphabet of size d
 Uses O(n) space
 Supports arbitrary pattern matching queries in X in O(dm)
time, where m is the size of the pattern
 Can be constructed in O(n) time
ANALYSIS OF SUFFIX TRIES
7, 7 2, 7
2, 7 6, 7
6, 7
4, 7 2, 7 6, 7
1, 1 0, 1
m i n i z em i
0 1 2 3 4 5 6 7

Auto complete: User types “Rob” and you can type
with all words that begin with Rob, or all contacts
that begin with Rob, etc.
Sequence Assembly in Genetics Sequences
Sorting of Large Sets of Strings: BurstSort
Big Data: See “TeraSort.java” source code
APPLICATIONS OF TRIES

SAMPLE APPLICATION – IP ROUTING
Packets of Fun

ROUTING TABLE LOOKUP
Routing
Decision
Forwarding
Decision
Forwarding
Decision
Routing
Table
Routing
Table
Routing
Table
Switch Fabric
Output
Scheduling

A standardized exterior gateway protocol designed to
exchange routing and reachability information
between autonomous systems (AS) on the Internet.
Makes routing decisions based on paths, network
policies and/or rule-sets configured by a network
administrator.
Plays a key role in the overall operation of the
Internet and is involved in making core routing
decisions.
[Itself uses TCP to exchange its own data.]
BORDER GATEWAY PROTOCOL (BGP)

IPV4 ROUTING TABLE SIZE
Source:GeoffHuston,APNIC

Destination address Next hop
10.0.0.0/8 R1
128.143.0.0/16 R2
128.143.64.0/20
R3
128.143.192.0/20 R3
128.143.71.0/24 R4
128.143.71.55/32 R3
Default R5
With CIDR, there can be multiple
matches for a destination address in the
routing table
Longest Prefix Match: Search for the
routing table entry that has the longest
match with the prefix of the destination
IP address (Most Specific Router):
1. Search for a match on all 32 bits
2. Search for a match for 31 bits
…..
32. Search for a match on 0 bits
Needed: Data structure that supports a FAST
longest prefix match lookup!
ROUTING TABLE LOOKUP: LONGEST
PREFIX MATCH
128.143.71.21
The longest prefix match for
128.143.71.21 is with
128.143.71.0/24
 Datagram will be sent to R4

The following algorithms are suitable for Longest
Prefix Match routing table lookups
 Tries
 Path-Compressed Tries
 Disjoint-prefix binary Tries
 Multibit Tries
 Binary Search on Prefix
 Prefix Range Search
IP ADDRESS LOOKUP ALGORITHMS

t p
te to po
t p
e o
ten tea
n a
top
o
pot
o
t
A trie is a tree-based
data structure for
storing strings:
 There is one node for every
common prefix
 The strings are stored in
extra leaf nodes
 Prefixes are not only stored
at leaf nodes but also at
internal nodes
SLIGHTLY DIFFERENT VERSION OF TRIE

Structure
 Each leaf contains a
possible address
 Prefixes in the table are
marked (dark)
Search
 Traverse the tree
according to destination
address
 Most recent marked node
is the current longest
prefix
 Search ends when a leaf
node is reached
BINARY TRIE

Update
 Search for the
new entry
 Search ends
when a leaf node
is reached
 If there is no
branch to take,
insert new
node(s)
BINARY TRIE
z 1010*
1
z
0

 Path Compression:
 Requires to store additional information with nodes Bit number
field is added to node
 Bit string of prefixes must be explicitly stored at nodes
 Need to make comparison when searching the tree
 Goal: Eliminate long
sequences of 1-child
nodes
 Path compression 
collapses 1-child
branches
COMPRESSED BINARY TRIE
d

 Search: “010110”
 Root node: Inspect 1st bit and move left
 “a” node:
 Check with prefix of a (“0*”) and find a match
 Inspect 3rd bit and move left
 “b” node:
 Check with prefix of b (“01000*”) and determine that there is no match
 Search stops. Longest prefix match is with a
COMPRESSED BINARY TRIE
d

 Disjoint prefix:
 Nodes are split so that there is only one match for each prefix (“Leaf pushing”)
 Consequence: Internal nodes do not match with prefixes
 Results:
 a (0*) is split into: a1 (00*), a3 (010*), a2 (01001*)
 d (1*) is represented as d1 (101*)
 Multiple matches in
longest prefix rule
require backtracking
of search
 Goal: Transform tree
as to avoid multiple
matches
DISJOINT-PREFIX BINARY TRIE

 2-bit stride:
 1-bit prefix for a (0*) is split into 00* and 01*
 1-bit prefix for d (1*) is split into 10* and 11*
 3-bit prefix for c has been expanded to two nodes
 Why are the prefixes for b and e not expanded?
 Goal: Accelerate lookup
by inspecting more than
one bit at a time
 “Stride”: number of bits
inspected at one time
 With k-bit stride, node
has up to 2k child nodes
VARIABLE-STRIDE MULTIBIT TRIE

Scheme Lookup Update Memory
Binary trie O(W) O(W) O(NW)
Path-compressed trie O(W) O(W) O(NW)
k-stride multibit trie O(W/k) O(W/k+2k) O(2kNW/k)
COMPLEXITY OF THE LOOKUP
 Bounds are expressed for
 Look-up time: What is the longest lookup time?
 Update time: How long does it take to change an entry?
 Memory: How much memory is required to store the data structure?
 W: length of the address (32 bits)
 N: number of prefix in the routing table

Excellent data structure for managing Strings
Supports prefix and suffix kind of lookups
Extremely fast – After the Trie has been built, the
search time is O(m) where m is the size of the
pattern.
Can be used to build indexes
Various applications in areas that use Strings
(Literature/Dictionary/Content, as well as Networks
and Bioinformatics)
CONCLUSIONS: TRIES

Tries - Tree Based Structures for Strings

More Related Content

What's hot

Viewers also liked

Similar to Tries - Tree Based Structures for Strings

More from Amrinder Arora

Recently uploaded

Tries - Tree Based Structures for Strings