Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
1. PRACTICAL IMPLEMENTATION OF
SPACE-EFFICIENT DYNAMIC
KEYWORD DICTIONARIES
Shunsuke Kanda, Kazuhiro Morita and Masao Fuketa
26th–29th Sep. 2017, Palermo, ITALY
24th International Symposium on
String Processing and Information Retrieval
Tokushima Univ.
JAPAN
2. Keyword Dictionaries
2
¨ Associative array with string keys
¤ Such as std::map<std::string, type> in C++
¨ Recent target data are massive [Martínez-Prieto+ 16]
¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc.
String
Processing
And
Information
Retrieval
1
2
3
4
5
Information → 4
Keyword Value
3. Keyword Dictionaries (cont.)
3
¨ Research objective
¤ Engineering practical implementation of space-efficient
dynamic keyword dictionaries
libCSD
path_decomposed_tries
Marisa-trie
Xcdat
Judy
Hat-Trie
libart
Cedar
Static Dynamic
LARGESMALL
Dictionary size
5. Trie
5
¨ Edge-labeled tree for storing a set of strings
¤ Search time: O(k) time where k denotes the query length
Keyword Value
ideal$ 1
ideology$ 2
tec$ 3
technology$ 4
tie$ 5
trie$ 6
tide
rie$ec
hnology$
ology$
ie$
6
4
1 2 5
3
al$
$
Search technology$
※ Strictly speaking, this tree is Patricia Trie
6. Path Decomposition [Ferragina+ 08]
6
¨ Procedure of transforming a trie to improve the cache
efficiency by lowering the height
¤ Application: Static compressed dictionaries [Grossi+ 14]
technology$
i r $i
tide
rie$ec
hnology$
ology$
ie$
al$
$
Path-Decomposed Trie (PDT)
7. New implementation of space-efficient dynamic
keyword dictionaries
DynPDT: Dynamic Path-Decomposed Trie7
11. Implementation Approaches
11
¨ Tree representation (with edge labels)
¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
¨ Node label management
¤ Separately storing the labels from the tree structure
¤ Pointer management is important
technology$
cs$
ue$
(5, i)
(0, q)
v1
v2
v3
technology$
technics$
technique$
12. Implementation Approaches
12
¨ Tree representation (with edge labels)
¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
¨ Node label management
¤ Separately storing the labels from the tree structure
¤ Pointer management is important
technology$
cs$
ue$
(5, i)
(0, q)
v1
v2
v3
technology$
technics$
technique$
13. Plain Label Management (PLAIN)
13
¨ Introducing pointers to each label
¤ Using array P where P[v] has the pointer of node v
¤ GOOD: Accessing a label in constant time
¤ BAD: Using 64 bits for each slot (too large)
v4 v1 v2 v6 v3
P
technology$ cs$ ue$lly$ cal$
14. Compact Label Management (BITMAP)
14
¨ Reducing the pointer overhead
¤ Grouping the node labels into ℓ labels over the IDs
¤ Concatenating the labels for each group
n #pointers is divided by ℓ
¤ Using bit array B such that B[v] = 1 if node v has a label
v4 v1 v2 v6 v3
B 1 0 1 1 0 1 0 1
P
3lly10technology2cs 3cal2ue
Group 1 Gourp 2
In ℓ = 4
15. Compact Label Management (BITMAP)
15
¨ Access procedure
¤ Calculating the target label position in the group
n Constant time using Popcnt
¤ Scanning the concatenated label string until the target label
n O(ℓ) time using Skipping technique (or constant time)
v4 v1 v2 v6 v3
B 1 0 1 1 0 1 0 1
P’
3lly10technology2cs 3cal2ue
In ℓ = 4
Popcnt(B[v4..v1]) = 2
Label of node v1
Skipping
17. Settings
17
¨ Machine
¤ Intel Xeon E5540 @2.53 GHz CPU,32GB RAM
¤ Ubuntu Server 16.40 LTS
¨ Datasets
¤ Wiki: Titles from English Wikipedia
n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes
¤ WebBase: URLs from WebBase crawler
n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes
¤ LUBM: URIs from LUBM benchmark
n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes
18. Settings (cont.)
18
¨ Data structures
¤ DynPDT: PLAIN and BITMAP (ℓ = 16)
¤ m-Bonsai: As a naïve trie (not as a dictionary)
¤ Judy: Trie dictionary developed by HP-Lab.
¤ HAT-trie: Hybrid dictionary of trie and array hashing
¤ Cedar: Minimal-prefix double-array trie dictionary
¨ Details
¤ Language: C++
¤ Associated value type: int (4 bytes)
¤ Keyword order: random
19. Results for Space Usage
19
46.6 47.5
45.0
18.8
21.0
13.8
23.6
29.3
11.4
50.5
53.5
33.9
40.2
68.9
64.7
41.1
29.7
0
10
20
30
40
50
60
70
80
Wiki WebBase LUBM
Bytes per keyword
DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
N/A
3.3x
4.7x
20. Results for Insertion Time
20
1.14
2.37
1.651.57
2.93
1.99
2.22
7.69
4.80
1.06
2.94
1.53
1.13
1.75
2.58
1.07
2.50
0
1
2
3
4
5
6
7
8
9
Wiki WebBase LUBM
Micro sec. per keyword
DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
N/A
2.6x
21. Results for Search Time
21
1.13
2.20
1.12
1.61
2.74
1.43
2.06
8.30
3.08
0.88
2.42
0.79
0.35
0.80
0.510.69 0.69
0
1
2
3
4
5
6
7
8
9
Wiki WebBase LUBM
Micro sec. per keyword
DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
N/A
4.6x
22. Summary
22
¨ Proposing a new dictionary structure, or DynPDT
¤ GOOD: Space efficiency, BAD: Time performance
¤ The traversal speed of m-Bonsai is a bottleneck
n Xorshift random number generator can solve this problem
¨ Future work
¤ To improve m-Bonsai or engineer an alternative trie
¤ To support more complex operations (possible in principle)
n Invertible mapping between keywords and unique IDs
n Prefix-based operations
¤ To develop and publish a useful dictionary library
23. 23
Thank you for your attention!
My English skills are limited
If you have any questions,
please speak slowly and clearly :)
My experimental implementation is available at
https://github.com/kampersanda/dynpdt