SlideShare a Scribd company logo
1 of 23
Download to read offline
PRACTICAL IMPLEMENTATION OF
SPACE-EFFICIENT DYNAMIC
KEYWORD DICTIONARIES
Shunsuke Kanda, Kazuhiro Morita and Masao Fuketa
26th–29th Sep. 2017, Palermo, ITALY
24th International Symposium on
String Processing and Information Retrieval
Tokushima Univ.
JAPAN
Keyword Dictionaries
2
¨ Associative array with string keys
¤ Such as std::map<std::string, type> in C++
¨ Recent target data are massive [Martínez-Prieto+ 16]
¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc.
String
Processing
And
Information
Retrieval
1
2
3
4
5
Information → 4
Keyword Value
Keyword Dictionaries (cont.)
3
¨ Research objective
¤ Engineering practical implementation of space-efficient
dynamic keyword dictionaries
libCSD
path_decomposed_tries
Marisa-trie
Xcdat
Judy
Hat-Trie
libart
Cedar
Static Dynamic
LARGESMALL
Dictionary size
Data structures related to our work
Trie & Path Decomposition4
Trie
5
¨ Edge-labeled tree for storing a set of strings
¤ Search time: O(k) time where k denotes the query length
Keyword Value
ideal$ 1
ideology$ 2
tec$ 3
technology$ 4
tie$ 5
trie$ 6
tide
rie$ec
hnology$
ology$
ie$
6
4
1 2 5
3
al$
$
Search technology$
※ Strictly speaking, this tree is Patricia Trie
Path Decomposition [Ferragina+ 08]
6
¨ Procedure of transforming a trie to improve the cache
efficiency by lowering the height
¤ Application: Static compressed dictionaries [Grossi+ 14]
technology$
i r $i
tide
rie$ec
hnology$
ology$
ie$
al$
$
Path-Decomposed Trie (PDT)
New implementation of space-efficient dynamic
keyword dictionaries
DynPDT: Dynamic Path-Decomposed Trie7
Incremental Path Decomposition
8
¨ Incrementally constructing a PDT
technology$
v1
EX1 Add key technology$ to an empty dictionary
Incremental Path Decomposition (cont.)
9
¨ Incrementally constructing a PDT
technology$
v1
(5, i)
v2
EX2 Add key technics$ 012345
technics$
cs$
Incremental Path Decomposition (cont.)
10
¨ Incrementally constructing a PDT
technology$
cs$
(5, i)
v1
v2
(0, q)
v3
EX3 Add key technique$ 012345
technique$
Dynamic Path-Decomposed Trie (DynPDT)
0
que$
ue$
Implementation Approaches
11
¨ Tree representation (with edge labels)
¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
¨ Node label management
¤ Separately storing the labels from the tree structure
¤ Pointer management is important
technology$
cs$
ue$
(5, i)
(0, q)
v1
v2
v3
technology$
technics$
technique$
Implementation Approaches
12
¨ Tree representation (with edge labels)
¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
¨ Node label management
¤ Separately storing the labels from the tree structure
¤ Pointer management is important
technology$
cs$
ue$
(5, i)
(0, q)
v1
v2
v3
technology$
technics$
technique$
Plain Label Management (PLAIN)
13
¨ Introducing pointers to each label
¤ Using array P where P[v] has the pointer of node v
¤ GOOD: Accessing a label in constant time
¤ BAD: Using 64 bits for each slot (too large)
v4 v1 v2 v6 v3
P
technology$ cs$ ue$lly$ cal$
Compact Label Management (BITMAP)
14
¨ Reducing the pointer overhead
¤ Grouping the node labels into ℓ labels over the IDs
¤ Concatenating the labels for each group
n #pointers is divided by ℓ
¤ Using bit array B such that B[v] = 1 if node v has a label
v4 v1 v2 v6 v3
B 1 0 1 1 0 1 0 1
P
3lly10technology2cs 3cal2ue
Group 1 Gourp 2
In ℓ = 4
Compact Label Management (BITMAP)
15
¨ Access procedure
¤ Calculating the target label position in the group
n Constant time using Popcnt
¤ Scanning the concatenated label string until the target label
n O(ℓ) time using Skipping technique (or constant time)
v4 v1 v2 v6 v3
B 1 0 1 1 0 1 0 1
P’
3lly10technology2cs 3cal2ue
In ℓ = 4
Popcnt(B[v4..v1]) = 2
Label of node v1
Skipping
Space & Time
Experiments16
Settings
17
¨ Machine
¤ Intel Xeon E5540 @2.53 GHz CPU,32GB RAM
¤ Ubuntu Server 16.40 LTS
¨ Datasets
¤ Wiki: Titles from English Wikipedia
n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes
¤ WebBase: URLs from WebBase crawler
n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes
¤ LUBM: URIs from LUBM benchmark
n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes
Settings (cont.)
18
¨ Data structures
¤ DynPDT: PLAIN and BITMAP (ℓ = 16)
¤ m-Bonsai: As a naïve trie (not as a dictionary)
¤ Judy: Trie dictionary developed by HP-Lab.
¤ HAT-trie: Hybrid dictionary of trie and array hashing
¤ Cedar: Minimal-prefix double-array trie dictionary
¨ Details
¤ Language: C++
¤ Associated value type: int (4 bytes)
¤ Keyword order: random
Results for Space Usage
19
46.6 47.5
45.0
18.8
21.0
13.8
23.6
29.3
11.4
50.5
53.5
33.9
40.2
68.9
64.7
41.1
29.7
0
10
20
30
40
50
60
70
80
Wiki WebBase LUBM
Bytes per keyword
DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
N/A
3.3x
4.7x
Results for Insertion Time
20
1.14
2.37
1.651.57
2.93
1.99
2.22
7.69
4.80
1.06
2.94
1.53
1.13
1.75
2.58
1.07
2.50
0
1
2
3
4
5
6
7
8
9
Wiki WebBase LUBM
Micro sec. per keyword
DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
N/A
2.6x
Results for Search Time
21
1.13
2.20
1.12
1.61
2.74
1.43
2.06
8.30
3.08
0.88
2.42
0.79
0.35
0.80
0.510.69 0.69
0
1
2
3
4
5
6
7
8
9
Wiki WebBase LUBM
Micro sec. per keyword
DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
N/A
4.6x
Summary
22
¨ Proposing a new dictionary structure, or DynPDT
¤ GOOD: Space efficiency, BAD: Time performance
¤ The traversal speed of m-Bonsai is a bottleneck
n Xorshift random number generator can solve this problem
¨ Future work
¤ To improve m-Bonsai or engineer an alternative trie
¤ To support more complex operations (possible in principle)
n Invertible mapping between keywords and unique IDs
n Prefix-based operations
¤ To develop and publish a useful dictionary library
23
Thank you for your attention!
My English skills are limited
If you have any questions,
please speak slowly and clearly :)
My experimental implementation is available at
https://github.com/kampersanda/dynpdt

More Related Content

What's hot

RDF2Rule PRESENTATION
RDF2Rule PRESENTATIONRDF2Rule PRESENTATION
RDF2Rule PRESENTATION
Efrah Shakir
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
Retrieving big data for the non developer
Retrieving big data for the non developerRetrieving big data for the non developer
Retrieving big data for the non developer
Gustaf Cavanaugh
 

What's hot (12)

Linux resource limits
Linux resource limitsLinux resource limits
Linux resource limits
 
Linux Resource Management - Мариян Маринов (Siteground)
Linux Resource Management - Мариян Маринов (Siteground)Linux Resource Management - Мариян Маринов (Siteground)
Linux Resource Management - Мариян Маринов (Siteground)
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
RDF2Rule PRESENTATION
RDF2Rule PRESENTATIONRDF2Rule PRESENTATION
RDF2Rule PRESENTATION
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Priamry data type
Priamry data typePriamry data type
Priamry data type
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Data storage systems
Data storage systemsData storage systems
Data storage systems
 
Experiment no 4
Experiment no 4Experiment no 4
Experiment no 4
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Retrieving big data for the non developer
Retrieving big data for the non developerRetrieving big data for the non developer
Retrieving big data for the non developer
 

Similar to Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
Yogi Sharo
 

Similar to Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries (20)

XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory database
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
Chapter 8. Partial updates and retrievals.pdf
Chapter 8. Partial updates and retrievals.pdfChapter 8. Partial updates and retrievals.pdf
Chapter 8. Partial updates and retrievals.pdf
 
Python at Ordnance Survey
Python at Ordnance SurveyPython at Ordnance Survey
Python at Ordnance Survey
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
 
Caspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve RenkinCaspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve Renkin
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 

Recently uploaded

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries

  • 1. PRACTICAL IMPLEMENTATION OF SPACE-EFFICIENT DYNAMIC KEYWORD DICTIONARIES Shunsuke Kanda, Kazuhiro Morita and Masao Fuketa 26th–29th Sep. 2017, Palermo, ITALY 24th International Symposium on String Processing and Information Retrieval Tokushima Univ. JAPAN
  • 2. Keyword Dictionaries 2 ¨ Associative array with string keys ¤ Such as std::map<std::string, type> in C++ ¨ Recent target data are massive [Martínez-Prieto+ 16] ¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc. String Processing And Information Retrieval 1 2 3 4 5 Information → 4 Keyword Value
  • 3. Keyword Dictionaries (cont.) 3 ¨ Research objective ¤ Engineering practical implementation of space-efficient dynamic keyword dictionaries libCSD path_decomposed_tries Marisa-trie Xcdat Judy Hat-Trie libart Cedar Static Dynamic LARGESMALL Dictionary size
  • 4. Data structures related to our work Trie & Path Decomposition4
  • 5. Trie 5 ¨ Edge-labeled tree for storing a set of strings ¤ Search time: O(k) time where k denotes the query length Keyword Value ideal$ 1 ideology$ 2 tec$ 3 technology$ 4 tie$ 5 trie$ 6 tide rie$ec hnology$ ology$ ie$ 6 4 1 2 5 3 al$ $ Search technology$ ※ Strictly speaking, this tree is Patricia Trie
  • 6. Path Decomposition [Ferragina+ 08] 6 ¨ Procedure of transforming a trie to improve the cache efficiency by lowering the height ¤ Application: Static compressed dictionaries [Grossi+ 14] technology$ i r $i tide rie$ec hnology$ ology$ ie$ al$ $ Path-Decomposed Trie (PDT)
  • 7. New implementation of space-efficient dynamic keyword dictionaries DynPDT: Dynamic Path-Decomposed Trie7
  • 8. Incremental Path Decomposition 8 ¨ Incrementally constructing a PDT technology$ v1 EX1 Add key technology$ to an empty dictionary
  • 9. Incremental Path Decomposition (cont.) 9 ¨ Incrementally constructing a PDT technology$ v1 (5, i) v2 EX2 Add key technics$ 012345 technics$ cs$
  • 10. Incremental Path Decomposition (cont.) 10 ¨ Incrementally constructing a PDT technology$ cs$ (5, i) v1 v2 (0, q) v3 EX3 Add key technique$ 012345 technique$ Dynamic Path-Decomposed Trie (DynPDT) 0 que$ ue$
  • 11. Implementation Approaches 11 ¨ Tree representation (with edge labels) ¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$
  • 12. Implementation Approaches 12 ¨ Tree representation (with edge labels) ¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$
  • 13. Plain Label Management (PLAIN) 13 ¨ Introducing pointers to each label ¤ Using array P where P[v] has the pointer of node v ¤ GOOD: Accessing a label in constant time ¤ BAD: Using 64 bits for each slot (too large) v4 v1 v2 v6 v3 P technology$ cs$ ue$lly$ cal$
  • 14. Compact Label Management (BITMAP) 14 ¨ Reducing the pointer overhead ¤ Grouping the node labels into ℓ labels over the IDs ¤ Concatenating the labels for each group n #pointers is divided by ℓ ¤ Using bit array B such that B[v] = 1 if node v has a label v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P 3lly10technology2cs 3cal2ue Group 1 Gourp 2 In ℓ = 4
  • 15. Compact Label Management (BITMAP) 15 ¨ Access procedure ¤ Calculating the target label position in the group n Constant time using Popcnt ¤ Scanning the concatenated label string until the target label n O(ℓ) time using Skipping technique (or constant time) v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P’ 3lly10technology2cs 3cal2ue In ℓ = 4 Popcnt(B[v4..v1]) = 2 Label of node v1 Skipping
  • 17. Settings 17 ¨ Machine ¤ Intel Xeon E5540 @2.53 GHz CPU,32GB RAM ¤ Ubuntu Server 16.40 LTS ¨ Datasets ¤ Wiki: Titles from English Wikipedia n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes ¤ WebBase: URLs from WebBase crawler n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes ¤ LUBM: URIs from LUBM benchmark n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes
  • 18. Settings (cont.) 18 ¨ Data structures ¤ DynPDT: PLAIN and BITMAP (ℓ = 16) ¤ m-Bonsai: As a naïve trie (not as a dictionary) ¤ Judy: Trie dictionary developed by HP-Lab. ¤ HAT-trie: Hybrid dictionary of trie and array hashing ¤ Cedar: Minimal-prefix double-array trie dictionary ¨ Details ¤ Language: C++ ¤ Associated value type: int (4 bytes) ¤ Keyword order: random
  • 19. Results for Space Usage 19 46.6 47.5 45.0 18.8 21.0 13.8 23.6 29.3 11.4 50.5 53.5 33.9 40.2 68.9 64.7 41.1 29.7 0 10 20 30 40 50 60 70 80 Wiki WebBase LUBM Bytes per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 3.3x 4.7x
  • 20. Results for Insertion Time 20 1.14 2.37 1.651.57 2.93 1.99 2.22 7.69 4.80 1.06 2.94 1.53 1.13 1.75 2.58 1.07 2.50 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 2.6x
  • 21. Results for Search Time 21 1.13 2.20 1.12 1.61 2.74 1.43 2.06 8.30 3.08 0.88 2.42 0.79 0.35 0.80 0.510.69 0.69 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 4.6x
  • 22. Summary 22 ¨ Proposing a new dictionary structure, or DynPDT ¤ GOOD: Space efficiency, BAD: Time performance ¤ The traversal speed of m-Bonsai is a bottleneck n Xorshift random number generator can solve this problem ¨ Future work ¤ To improve m-Bonsai or engineer an alternative trie ¤ To support more complex operations (possible in principle) n Invertible mapping between keywords and unique IDs n Prefix-based operations ¤ To develop and publish a useful dictionary library
  • 23. 23 Thank you for your attention! My English skills are limited If you have any questions, please speak slowly and clearly :) My experimental implementation is available at https://github.com/kampersanda/dynpdt