CS 6213 –
Advanced
Data
Structures
TRIES
AN EXCELLENT DATA
STRUCTURE FOR
STRINGS
Instructor
Prof. Amrinder Arora
amrinder@gwu.edu
Please copy TA on emails
Please feel free to call as well
TA
Iswarya Parupudi
iswarya2291@gwmail.gwu.edu
L6 - Tries CS 6213 - Advanced Data Structures - Arora 2
LOGISTICS
Michael T. Goodrich and Roberto Tamassia
Data Structures and Algorithms in Java (4th edition)
John Wiley & Sons, Inc.
ISBN: 0-471-73884-0
Haim Kaplan, Tel Aviv University
Jörg Liebeherr, University of Toronto
L6 - Tries CS 6213 - Advanced Data Structures - Arora 3
CREDITS
Naïve, brute force for searching a text of size n and a
pattern of size m requires O(nm) time.
Preprocessing the pattern speeds up pattern
matching queries. E.g., KMP algorithm performs
pattern matching in time proportional to the text
size: O(n)
If the text is large, immutable and searched often
(e.g., Shakespeare), we may want to preprocess the
text itself. Want to perform the searching in O(m)
time.
L6 - Tries CS 6213 - Advanced Data Structures - Arora 4
MOTIVATION
A trie is a compact data structure for representing a
set of strings, such as all the words in a text. A trie
supports pattern matching queries in time
proportional to the pattern size: O(m)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 5
MOTIVATION (CONT.)
Standard Tries
Compressed Tries
Compact Representation
Suffix Trie
L6 - Tries CS 6213 - Advanced Data Structures - Arora 6
TRIES: TOPICS
 The standard trie for a set of strings S is an ordered tree
such that:
 Each node but the root is labeled with a character
 The children of a node are alphabetically ordered
 The paths from the root to the leaves yield the strings of S
 Example: set of strings S = { bear, bell, bid, bull, buy, sell, stock,
stop }
L6 - Tries CS 6213 - Advanced Data Structures - Arora 7
STANDARD TRIES
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
A standard trie uses O(n) space and supports
searches, insertions and deletions in time
O(dm), where:
n total size of the strings in S
m size of the string parameter of the operation
d size of the alphabet
L6 - Tries CS 6213 - Advanced Data Structures - Arora 8
ANALYSIS OF STANDARD TRIES
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
 We insert
the words of
the text into
a trie
 Each leaf
stores the
occurrences
of the
associated
word in the
text
L6 - Tries CS 6213 - Advanced Data Structures - Arora 9
WORD MATCHING WITH A TRIE
s e e b e a r ? s e l l s t o c k !
s e e b u l l ? b u y s t o c k !
b i d s t o c k !
a
a
h e t h e b e l l ? s t o p !
b i d s t o c k !
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
a r
87 88
a
e
b
l
s
u
l
e t
e
0, 24
o
c
i
l
r
6
l
78
d
47, 58
l
30
y
36
l
12
k
17, 40,
51, 62
p
84
h
e
r
69
a
 A compressed trie has
internal nodes of
degree at least two
 It is obtained from
standard trie by
compressing chains of
“redundant” nodes
L6 - Tries CS 6213 - Advanced Data Structures - Arora 10
COMPRESSED TRIES
e
b
ar ll
s
u
ll y
ell to
ck p
id
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
 Compact representation of a compressed trie for an array of
strings:
 Stores at the nodes ranges of indices instead of substrings
 Uses O(s) space, where s is the number of strings in the array
 Serves as an auxiliary index structure
L6 - Tries CS 6213 - Advanced Data Structures - Arora 11
COMPACT REPRESENTATION
s e e
b e a r
s e l l
s t o c k
b u l l
b u y
b i d
h e
b e l l
s t o p
0 1 2 3 4
a rS[0] =
S[1] =
S[2] =
S[3] =
S[4] =
S[5] =
S[6] =
S[7] =
S[8] =
S[9] =
0 1 2 3 0 1 2 3
1, 1, 1
1, 0, 0 0, 0, 0
4, 1, 1
0, 2, 2
3, 1, 2
1, 2, 3 8, 2, 3
6, 1, 2
4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3
7, 0, 3
0, 1, 1
Begins with: where name like ‘x%’
Ends with: where name like ‘%x’
Substring: where name like ‘%x%’
L6 - Tries CS 6213 - Advanced Data Structures - Arora 12
STRING SEARCHES
 The suffix trie of a string X is the compressed trie of all
the suffixes of X
L6 - Tries CS 6213 - Advanced Data Structures - Arora 13
SUFFIX TRIE
e nimize
nimize ze
zei mi
mize nimize ze
m i n i z em i
0 1 2 3 4 5 6 7
Compact representation of the suffix trie for a
string X of size n from an alphabet of size d
 Uses O(n) space
 Supports arbitrary pattern matching queries in X in O(dm)
time, where m is the size of the pattern
 Can be constructed in O(n) time
L6 - Tries CS 6213 - Advanced Data Structures - Arora 14
ANALYSIS OF SUFFIX TRIES
7, 7 2, 7
2, 7 6, 7
6, 7
4, 7 2, 7 6, 7
1, 1 0, 1
m i n i z em i
0 1 2 3 4 5 6 7
Auto complete: User types “Rob” and you can type
with all words that begin with Rob, or all contacts
that begin with Rob, etc.
Sequence Assembly in Genetics Sequences
Sorting of Large Sets of Strings: BurstSort
Big Data: See “TeraSort.java” source code
L6 - Tries CS 6213 - Advanced Data Structures - Arora 15
APPLICATIONS OF TRIES
L6 - Tries CS 6213 - Advanced Data Structures - Arora 16
SAMPLE APPLICATION – IP ROUTING
Packets of Fun
L6 - Tries CS 6213 - Advanced Data Structures - Arora 17
ROUTING TABLE LOOKUP
Routing
Decision
Forwarding
Decision
Forwarding
Decision
Routing
Table
Routing
Table
Routing
Table
Switch Fabric
Output
Scheduling
A standardized exterior gateway protocol designed to
exchange routing and reachability information
between autonomous systems (AS) on the Internet.
Makes routing decisions based on paths, network
policies and/or rule-sets configured by a network
administrator.
Plays a key role in the overall operation of the
Internet and is involved in making core routing
decisions.
[Itself uses TCP to exchange its own data.]
L6 - Tries CS 6213 - Advanced Data Structures - Arora 18
BORDER GATEWAY PROTOCOL (BGP)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 19
IPV4 ROUTING TABLE SIZE
Source:GeoffHuston,APNIC
Destination address Next hop
10.0.0.0/8 R1
128.143.0.0/16 R2
128.143.64.0/20
R3
128.143.192.0/20 R3
128.143.71.0/24 R4
128.143.71.55/32 R3
Default R5
With CIDR, there can be multiple
matches for a destination address in the
routing table
Longest Prefix Match: Search for the
routing table entry that has the longest
match with the prefix of the destination
IP address (Most Specific Router):
1. Search for a match on all 32 bits
2. Search for a match for 31 bits
…..
32. Search for a match on 0 bits
Needed: Data structure that supports a FAST
longest prefix match lookup!
L6 - Tries CS 6213 - Advanced Data Structures - Arora 20
ROUTING TABLE LOOKUP: LONGEST
PREFIX MATCH
128.143.71.21
The longest prefix match for
128.143.71.21 is with
128.143.71.0/24
 Datagram will be sent to R4
The following algorithms are suitable for Longest
Prefix Match routing table lookups
 Tries
 Path-Compressed Tries
 Disjoint-prefix binary Tries
 Multibit Tries
 Binary Search on Prefix
 Prefix Range Search
L6 - Tries CS 6213 - Advanced Data Structures - Arora 21
IP ADDRESS LOOKUP ALGORITHMS
t p
te to po
t p
e o
ten tea
n a
top
o
pot
o
t
A trie is a tree-based
data structure for
storing strings:
 There is one node for every
common prefix
 The strings are stored in
extra leaf nodes
 Prefixes are not only stored
at leaf nodes but also at
internal nodes
L6 - Tries CS 6213 - Advanced Data Structures - Arora 22
SLIGHTLY DIFFERENT VERSION OF TRIE
Structure
 Each leaf contains a
possible address
 Prefixes in the table are
marked (dark)
Search
 Traverse the tree
according to destination
address
 Most recent marked node
is the current longest
prefix
 Search ends when a leaf
node is reached
L6 - Tries CS 6213 - Advanced Data Structures - Arora 23
BINARY TRIE
Update
 Search for the
new entry
 Search ends
when a leaf node
is reached
 If there is no
branch to take,
insert new
node(s)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 24
BINARY TRIE
z 1010*
1
z
0
 Path Compression:
 Requires to store additional information with nodes Bit number
field is added to node
 Bit string of prefixes must be explicitly stored at nodes
 Need to make comparison when searching the tree
 Goal: Eliminate long
sequences of 1-child
nodes
 Path compression 
collapses 1-child
branches
L6 - Tries CS 6213 - Advanced Data Structures - Arora 25
COMPRESSED BINARY TRIE
d
 Search: “010110”
 Root node: Inspect 1st bit and move left
 “a” node:
 Check with prefix of a (“0*”) and find a match
 Inspect 3rd bit and move left
 “b” node:
 Check with prefix of b (“01000*”) and determine that there is no match
 Search stops. Longest prefix match is with a
L6 - Tries CS 6213 - Advanced Data Structures - Arora 26
COMPRESSED BINARY TRIE
d
 Disjoint prefix:
 Nodes are split so that there is only one match for each prefix (“Leaf pushing”)
 Consequence: Internal nodes do not match with prefixes
 Results:
 a (0*) is split into: a1 (00*), a3 (010*), a2 (01001*)
 d (1*) is represented as d1 (101*)
 Multiple matches in
longest prefix rule
require backtracking
of search
 Goal: Transform tree
as to avoid multiple
matches
L6 - Tries CS 6213 - Advanced Data Structures - Arora 27
DISJOINT-PREFIX BINARY TRIE
 2-bit stride:
 1-bit prefix for a (0*) is split into 00* and 01*
 1-bit prefix for d (1*) is split into 10* and 11*
 3-bit prefix for c has been expanded to two nodes
 Why are the prefixes for b and e not expanded?
 Goal: Accelerate lookup
by inspecting more than
one bit at a time
 “Stride”: number of bits
inspected at one time
 With k-bit stride, node
has up to 2k child nodes
L6 - Tries CS 6213 - Advanced Data Structures - Arora 28
VARIABLE-STRIDE MULTIBIT TRIE
Scheme Lookup Update Memory
Binary trie O(W) O(W) O(NW)
Path-compressed trie O(W) O(W) O(NW)
k-stride multibit trie O(W/k) O(W/k+2k) O(2kNW/k)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 29
COMPLEXITY OF THE LOOKUP
 Bounds are expressed for
 Look-up time: What is the longest lookup time?
 Update time: How long does it take to change an entry?
 Memory: How much memory is required to store the data structure?
 W: length of the address (32 bits)
 N: number of prefix in the routing table
Excellent data structure for managing Strings
Supports prefix and suffix kind of lookups
Extremely fast – After the Trie has been built, the
search time is O(m) where m is the size of the
pattern.
Can be used to build indexes
Various applications in areas that use Strings
(Literature/Dictionary/Content, as well as Networks
and Bioinformatics)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 30
CONCLUSIONS: TRIES

Tries - Tree Based Structures for Strings

  • 1.
    CS 6213 – Advanced Data Structures TRIES ANEXCELLENT DATA STRUCTURE FOR STRINGS
  • 2.
    Instructor Prof. Amrinder Arora amrinder@gwu.edu Pleasecopy TA on emails Please feel free to call as well TA Iswarya Parupudi iswarya2291@gwmail.gwu.edu L6 - Tries CS 6213 - Advanced Data Structures - Arora 2 LOGISTICS
  • 3.
    Michael T. Goodrichand Roberto Tamassia Data Structures and Algorithms in Java (4th edition) John Wiley & Sons, Inc. ISBN: 0-471-73884-0 Haim Kaplan, Tel Aviv University Jörg Liebeherr, University of Toronto L6 - Tries CS 6213 - Advanced Data Structures - Arora 3 CREDITS
  • 4.
    Naïve, brute forcefor searching a text of size n and a pattern of size m requires O(nm) time. Preprocessing the pattern speeds up pattern matching queries. E.g., KMP algorithm performs pattern matching in time proportional to the text size: O(n) If the text is large, immutable and searched often (e.g., Shakespeare), we may want to preprocess the text itself. Want to perform the searching in O(m) time. L6 - Tries CS 6213 - Advanced Data Structures - Arora 4 MOTIVATION
  • 5.
    A trie isa compact data structure for representing a set of strings, such as all the words in a text. A trie supports pattern matching queries in time proportional to the pattern size: O(m) L6 - Tries CS 6213 - Advanced Data Structures - Arora 5 MOTIVATION (CONT.)
  • 6.
    Standard Tries Compressed Tries CompactRepresentation Suffix Trie L6 - Tries CS 6213 - Advanced Data Structures - Arora 6 TRIES: TOPICS
  • 7.
     The standardtrie for a set of strings S is an ordered tree such that:  Each node but the root is labeled with a character  The children of a node are alphabetically ordered  The paths from the root to the leaves yield the strings of S  Example: set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } L6 - Tries CS 6213 - Advanced Data Structures - Arora 7 STANDARD TRIES a e b r l l s u l l y e t l l o c k p i d
  • 8.
    A standard trieuses O(n) space and supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet L6 - Tries CS 6213 - Advanced Data Structures - Arora 8 ANALYSIS OF STANDARD TRIES a e b r l l s u l l y e t l l o c k p i d
  • 9.
     We insert thewords of the text into a trie  Each leaf stores the occurrences of the associated word in the text L6 - Tries CS 6213 - Advanced Data Structures - Arora 9 WORD MATCHING WITH A TRIE s e e b e a r ? s e l l s t o c k ! s e e b u l l ? b u y s t o c k ! b i d s t o c k ! a a h e t h e b e l l ? s t o p ! b i d s t o c k ! 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 a r 87 88 a e b l s u l e t e 0, 24 o c i l r 6 l 78 d 47, 58 l 30 y 36 l 12 k 17, 40, 51, 62 p 84 h e r 69 a
  • 10.
     A compressedtrie has internal nodes of degree at least two  It is obtained from standard trie by compressing chains of “redundant” nodes L6 - Tries CS 6213 - Advanced Data Structures - Arora 10 COMPRESSED TRIES e b ar ll s u ll y ell to ck p id a e b r l l s u l l y e t l l o c k p i d
  • 11.
     Compact representationof a compressed trie for an array of strings:  Stores at the nodes ranges of indices instead of substrings  Uses O(s) space, where s is the number of strings in the array  Serves as an auxiliary index structure L6 - Tries CS 6213 - Advanced Data Structures - Arora 11 COMPACT REPRESENTATION s e e b e a r s e l l s t o c k b u l l b u y b i d h e b e l l s t o p 0 1 2 3 4 a rS[0] = S[1] = S[2] = S[3] = S[4] = S[5] = S[6] = S[7] = S[8] = S[9] = 0 1 2 3 0 1 2 3 1, 1, 1 1, 0, 0 0, 0, 0 4, 1, 1 0, 2, 2 3, 1, 2 1, 2, 3 8, 2, 3 6, 1, 2 4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3 7, 0, 3 0, 1, 1
  • 12.
    Begins with: wherename like ‘x%’ Ends with: where name like ‘%x’ Substring: where name like ‘%x%’ L6 - Tries CS 6213 - Advanced Data Structures - Arora 12 STRING SEARCHES
  • 13.
     The suffixtrie of a string X is the compressed trie of all the suffixes of X L6 - Tries CS 6213 - Advanced Data Structures - Arora 13 SUFFIX TRIE e nimize nimize ze zei mi mize nimize ze m i n i z em i 0 1 2 3 4 5 6 7
  • 14.
    Compact representation ofthe suffix trie for a string X of size n from an alphabet of size d  Uses O(n) space  Supports arbitrary pattern matching queries in X in O(dm) time, where m is the size of the pattern  Can be constructed in O(n) time L6 - Tries CS 6213 - Advanced Data Structures - Arora 14 ANALYSIS OF SUFFIX TRIES 7, 7 2, 7 2, 7 6, 7 6, 7 4, 7 2, 7 6, 7 1, 1 0, 1 m i n i z em i 0 1 2 3 4 5 6 7
  • 15.
    Auto complete: Usertypes “Rob” and you can type with all words that begin with Rob, or all contacts that begin with Rob, etc. Sequence Assembly in Genetics Sequences Sorting of Large Sets of Strings: BurstSort Big Data: See “TeraSort.java” source code L6 - Tries CS 6213 - Advanced Data Structures - Arora 15 APPLICATIONS OF TRIES
  • 16.
    L6 - TriesCS 6213 - Advanced Data Structures - Arora 16 SAMPLE APPLICATION – IP ROUTING Packets of Fun
  • 17.
    L6 - TriesCS 6213 - Advanced Data Structures - Arora 17 ROUTING TABLE LOOKUP Routing Decision Forwarding Decision Forwarding Decision Routing Table Routing Table Routing Table Switch Fabric Output Scheduling
  • 18.
    A standardized exteriorgateway protocol designed to exchange routing and reachability information between autonomous systems (AS) on the Internet. Makes routing decisions based on paths, network policies and/or rule-sets configured by a network administrator. Plays a key role in the overall operation of the Internet and is involved in making core routing decisions. [Itself uses TCP to exchange its own data.] L6 - Tries CS 6213 - Advanced Data Structures - Arora 18 BORDER GATEWAY PROTOCOL (BGP)
  • 19.
    L6 - TriesCS 6213 - Advanced Data Structures - Arora 19 IPV4 ROUTING TABLE SIZE Source:GeoffHuston,APNIC
  • 20.
    Destination address Nexthop 10.0.0.0/8 R1 128.143.0.0/16 R2 128.143.64.0/20 R3 128.143.192.0/20 R3 128.143.71.0/24 R4 128.143.71.55/32 R3 Default R5 With CIDR, there can be multiple matches for a destination address in the routing table Longest Prefix Match: Search for the routing table entry that has the longest match with the prefix of the destination IP address (Most Specific Router): 1. Search for a match on all 32 bits 2. Search for a match for 31 bits ….. 32. Search for a match on 0 bits Needed: Data structure that supports a FAST longest prefix match lookup! L6 - Tries CS 6213 - Advanced Data Structures - Arora 20 ROUTING TABLE LOOKUP: LONGEST PREFIX MATCH 128.143.71.21 The longest prefix match for 128.143.71.21 is with 128.143.71.0/24  Datagram will be sent to R4
  • 21.
    The following algorithmsare suitable for Longest Prefix Match routing table lookups  Tries  Path-Compressed Tries  Disjoint-prefix binary Tries  Multibit Tries  Binary Search on Prefix  Prefix Range Search L6 - Tries CS 6213 - Advanced Data Structures - Arora 21 IP ADDRESS LOOKUP ALGORITHMS
  • 22.
    t p te topo t p e o ten tea n a top o pot o t A trie is a tree-based data structure for storing strings:  There is one node for every common prefix  The strings are stored in extra leaf nodes  Prefixes are not only stored at leaf nodes but also at internal nodes L6 - Tries CS 6213 - Advanced Data Structures - Arora 22 SLIGHTLY DIFFERENT VERSION OF TRIE
  • 23.
    Structure  Each leafcontains a possible address  Prefixes in the table are marked (dark) Search  Traverse the tree according to destination address  Most recent marked node is the current longest prefix  Search ends when a leaf node is reached L6 - Tries CS 6213 - Advanced Data Structures - Arora 23 BINARY TRIE
  • 24.
    Update  Search forthe new entry  Search ends when a leaf node is reached  If there is no branch to take, insert new node(s) L6 - Tries CS 6213 - Advanced Data Structures - Arora 24 BINARY TRIE z 1010* 1 z 0
  • 25.
     Path Compression: Requires to store additional information with nodes Bit number field is added to node  Bit string of prefixes must be explicitly stored at nodes  Need to make comparison when searching the tree  Goal: Eliminate long sequences of 1-child nodes  Path compression  collapses 1-child branches L6 - Tries CS 6213 - Advanced Data Structures - Arora 25 COMPRESSED BINARY TRIE d
  • 26.
     Search: “010110” Root node: Inspect 1st bit and move left  “a” node:  Check with prefix of a (“0*”) and find a match  Inspect 3rd bit and move left  “b” node:  Check with prefix of b (“01000*”) and determine that there is no match  Search stops. Longest prefix match is with a L6 - Tries CS 6213 - Advanced Data Structures - Arora 26 COMPRESSED BINARY TRIE d
  • 27.
     Disjoint prefix: Nodes are split so that there is only one match for each prefix (“Leaf pushing”)  Consequence: Internal nodes do not match with prefixes  Results:  a (0*) is split into: a1 (00*), a3 (010*), a2 (01001*)  d (1*) is represented as d1 (101*)  Multiple matches in longest prefix rule require backtracking of search  Goal: Transform tree as to avoid multiple matches L6 - Tries CS 6213 - Advanced Data Structures - Arora 27 DISJOINT-PREFIX BINARY TRIE
  • 28.
     2-bit stride: 1-bit prefix for a (0*) is split into 00* and 01*  1-bit prefix for d (1*) is split into 10* and 11*  3-bit prefix for c has been expanded to two nodes  Why are the prefixes for b and e not expanded?  Goal: Accelerate lookup by inspecting more than one bit at a time  “Stride”: number of bits inspected at one time  With k-bit stride, node has up to 2k child nodes L6 - Tries CS 6213 - Advanced Data Structures - Arora 28 VARIABLE-STRIDE MULTIBIT TRIE
  • 29.
    Scheme Lookup UpdateMemory Binary trie O(W) O(W) O(NW) Path-compressed trie O(W) O(W) O(NW) k-stride multibit trie O(W/k) O(W/k+2k) O(2kNW/k) L6 - Tries CS 6213 - Advanced Data Structures - Arora 29 COMPLEXITY OF THE LOOKUP  Bounds are expressed for  Look-up time: What is the longest lookup time?  Update time: How long does it take to change an entry?  Memory: How much memory is required to store the data structure?  W: length of the address (32 bits)  N: number of prefix in the routing table
  • 30.
    Excellent data structurefor managing Strings Supports prefix and suffix kind of lookups Extremely fast – After the Trie has been built, the search time is O(m) where m is the size of the pattern. Can be used to build indexes Various applications in areas that use Strings (Literature/Dictionary/Content, as well as Networks and Bioinformatics) L6 - Tries CS 6213 - Advanced Data Structures - Arora 30 CONCLUSIONS: TRIES