Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries

PRACTICAL IMPLEMENTATION OF
SPACE-EFFICIENT DYNAMIC
KEYWORD DICTIONARIES
Shunsuke Kanda, Kazuhiro Morita and Masao Fuketa
26th–29th Sep. 2017, Palermo, ITALY
24th International Symposium on
String Processing and Information Retrieval
Tokushima Univ.
JAPAN

Keyword Dictionaries
2
¨ Associative array with string keys
¤ Such as std::map<std::string, type> in C++
¨ Recent target data are massive [Martínez-Prieto+ 16]
¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc.
String
Processing
And
Information
Retrieval
1
2
3
4
5
Information → 4
Keyword Value

Keyword Dictionaries (cont.)
3
¨ Research objective
¤ Engineering practical implementation of space-efficient
dynamic keyword dictionaries
libCSD
path_decomposed_tries
Marisa-trie
Xcdat
Judy
Hat-Trie
libart
Cedar
Static Dynamic
LARGESMALL
Dictionary size

Data structures related to our work
Trie & Path Decomposition4

Trie
5
¨ Edge-labeled tree for storing a set of strings
¤ Search time: O(k) time where k denotes the query length
Keyword Value
ideal$ 1
ideology$ 2
tec$ 3
technology$ 4
tie$ 5
trie$ 6
tide
rie$ec
hnology$
ology$
ie$
6
4
1 2 5
3
al$
$
Search technology$
※ Strictly speaking, this tree is Patricia Trie

Path Decomposition [Ferragina+ 08]
6
¨ Procedure of transforming a trie to improve the cache
efficiency by lowering the height
¤ Application: Static compressed dictionaries [Grossi+ 14]
technology$
i r $i
tide
rie$ec
hnology$
ology$
ie$
al$
$
Path-Decomposed Trie (PDT)

New implementation of space-efficient dynamic
keyword dictionaries
DynPDT: Dynamic Path-Decomposed Trie7

Incremental Path Decomposition
8
¨ Incrementally constructing a PDT
technology$
v1
EX1 Add key technology$ to an empty dictionary

Incremental Path Decomposition (cont.)
9
technology$
v1
(5, i)
v2
EX2 Add key technics$ 012345
technics$
cs$

Incremental Path Decomposition (cont.)
10
technology$
cs$
(5, i)
v1
v2
(0, q)
v3
EX3 Add key technique$ 012345
technique$
Dynamic Path-Decomposed Trie (DynPDT)
0
que$
ue$

Implementation Approaches
11
¨ Tree representation (with edge labels)
¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
¨ Node label management
¤ Separately storing the labels from the tree structure
¤ Pointer management is important
technology$
cs$
ue$
(5, i)
(0, q)
v1
v2
v3
technology$
technics$
technique$

Implementation Approaches
12
¨ Tree representation (with edge labels)
¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
¨ Node label management
¤ Separately storing the labels from the tree structure
¤ Pointer management is important
technology$
cs$
ue$
(5, i)
(0, q)
v1
v2
v3
technology$
technics$
technique$

Plain Label Management (PLAIN)
13
¨ Introducing pointers to each label
¤ Using array P where P[v] has the pointer of node v
¤ GOOD: Accessing a label in constant time
¤ BAD: Using 64 bits for each slot (too large)
v4 v1 v2 v6 v3
P
technology$ cs$ ue$lly$ cal$

Compact Label Management (BITMAP)
14
¨ Reducing the pointer overhead
¤ Grouping the node labels into ℓ labels over the IDs
¤ Concatenating the labels for each group
n #pointers is divided by ℓ
¤ Using bit array B such that B[v] = 1 if node v has a label
v4 v1 v2 v6 v3
B 1 0 1 1 0 1 0 1
P
3lly10technology2cs 3cal2ue
Group 1 Gourp 2
In ℓ = 4

Compact Label Management (BITMAP)
15
¨ Access procedure
¤ Calculating the target label position in the group
n Constant time using Popcnt
¤ Scanning the concatenated label string until the target label
n O(ℓ) time using Skipping technique (or constant time)
v4 v1 v2 v6 v3
B 1 0 1 1 0 1 0 1
P’
3lly10technology2cs 3cal2ue
In ℓ = 4
Popcnt(B[v4..v1]) = 2
Label of node v1
Skipping

Settings
17
¨ Machine
¤ Intel Xeon E5540 @2.53 GHz CPU，32GB RAM
¤ Ubuntu Server 16.40 LTS
¨ Datasets
¤ Wiki: Titles from English Wikipedia
n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes
¤ WebBase: URLs from WebBase crawler
n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes
¤ LUBM: URIs from LUBM benchmark
n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes

Settings (cont.)
18
¨ Data structures
¤ DynPDT: PLAIN and BITMAP (ℓ = 16)
¤ m-Bonsai: As a naïve trie (not as a dictionary)
¤ Judy: Trie dictionary developed by HP-Lab.
¤ HAT-trie: Hybrid dictionary of trie and array hashing
¤ Cedar: Minimal-prefix double-array trie dictionary
¨ Details
¤ Language: C++
¤ Associated value type: int (4 bytes)
¤ Keyword order: random

Results for Space Usage
19
46.6 47.5
45.0
18.8
21.0
13.8
23.6
29.3
11.4
50.5
53.5
33.9
40.2
68.9
64.7
41.1
29.7
0
10
20
30
40
50
60
70
80
Wiki WebBase LUBM
Bytes per keyword
DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
N/A
3.3x
4.7x

Results for Insertion Time
20
1.14
2.37
1.651.57
2.93
1.99
2.22
7.69
4.80
1.06
2.94
1.53
1.13
1.75
2.58
1.07
2.50
0
1
2
3
4
5
6
7
8
9
Wiki WebBase LUBM
Micro sec. per keyword
N/A
2.6x

Results for Search Time
21
1.13
2.20
1.12
1.61
2.74
1.43
2.06
8.30
3.08
0.88
2.42
0.79
0.35
0.80
0.510.69 0.69
0
1
2
3
4
5
6
7
8
9
Wiki WebBase LUBM
Micro sec. per keyword
N/A
4.6x

Summary
22
¨ Proposing a new dictionary structure, or DynPDT
¤ GOOD: Space efficiency, BAD: Time performance
¤ The traversal speed of m-Bonsai is a bottleneck
n Xorshift random number generator can solve this problem
¨ Future work
¤ To improve m-Bonsai or engineer an alternative trie
¤ To support more complex operations (possible in principle)
n Invertible mapping between keywords and unique IDs
n Prefix-based operations
¤ To develop and publish a useful dictionary library

23
Thank you for your attention!
My English skills are limited
If you have any questions,
please speak slowly and clearly :)
My experimental implementation is available at
https://github.com/kampersanda/dynpdt

Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries

Similar to Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries (20)

Recently uploaded

Recently uploaded (20)

Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries