SlideShare a Scribd company logo
Faster Practical Block Compression
for Rank/Select Dictionaries
Yusaku Kaneta | yusaku.kaneta@rakuten.com
Rakuten Institute of Technology, Rakuten, Inc.
2
Background
§ Compressed data structures in Web companies.
• Web companies generate massive amount of logs in text formats.
• Analyzing such huge logs is vital for our decision making.
• Practical improvements of compressed data structures are important.
§ RRR compression [Raman, Raman, Rao, SODA’02]
• Basic building block in many compressed data structures.
• Rank/Select queries on compressed bit strings in constant time:
‣ Rankb(B, i): Number of b’s in B’s prefix of length i.
‣ Selectb(B, i): Position of B’s i-th b.
B is an input bit string
b: a bit in {0, 1}
3
RRR = Block compression + succinct index
§ Represents a block B of w bits into a pair (class(B), offset(B)).
• class(B): Number of ones in B.
• offset(B): Number of preceding blocks of class same as B for some
order (e.g., lexicographical order of bit strings).
§ log w bits for class(B) and log2
w
class(B)
bits for offset(B).
§ Two practical approaches to block compression:
• Blockwise approach [Claude and Navarro, SPIRE’09]
• Bitwise approach [Navarro and Providel, SEA’12]
4
Block compression in practice
Good: O(1) time.
Bad: Low compression ratio.
§The tables limit use of larger w.
§log w bits for class(B) become non-
negligible.
§Ex) 25% overhead for w = 15.
1. Blockwise approach
[Claude and Navarro, SPIRE’09]
2. Bitwise approach
[Navarro and Providel, SEA’12]
Idea: O(2ww)-bit universal tables. Idea: O(w3)-bit binomial coefficients.
Good: High compression ratio.
Bad: O(w) time.
§Count bit strings lexicographically
smaller than block B bit by bit.
§In practice, heuristics of encoding
and decoding blocks with a few
ones in O(1) time can be used.
Less flexible in practice
5
Main result
§ Practical encoder/decoder for block compression
• Generalization of existing blockwise and bitwise approaches.
• Idea: chunkwise processing with multiple universal tables.
• Faster and more stable on artificial data.
Method Encode Decode Space (in bits)
Blockwise [Claude and Navarro, SPIRE’09] O(1) O(1) O(2ww)
Bitwise [Navarro and Provital, SEA’12] O(w) O(w) O(w3)
Chunkwise (This work) O(w/t) O((w/t) log t) O(w3 + 2t t)
This talk uses w and t for block and chunk lengths, respectively.
6
Our algorithm
7
Overview of our algorithm
§ Main idea: Process a block B in a chunkwise fashion.
• Bi: The i-th chunk of length t. (Suppose t divides w.)
‣ Encoded/Decoded in O(1) time using O(2tt)-bit universal tables.
• Efficiently count up blocks X satisfying X < B by using a combination
formula and chunkwise order:
A lexicographical order with:
1. class(Xi) < class(Bi) or
2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi)
t
c
×
n − t
m − c
c m − c
n − tt
Number of ones:
Number of bits:
Block X
Combination formula: Chunkwise order: X < B
8
Block encoding in O(w/t) time
Lemma: Block encoding can be implemented in
O(w/t) time with O(w3+2tt)-bit universal tables.
` 1
oi+1
B0···Bi-1
oi
2
X[0, i) X[i] •••
Blocks X of class same as B
in descending order of offset(X)
from top to bottom.
oi = X 	X0···Xi-1 < B0···Bi-1
ci
ni
class(B)
− ci − class(Bi)
w − ni − t#bits
#ones
Bi Bi+1···Bw/t-1
• w−	ni − t is in {0, t, 2t, …, (w/t)t=w}.
• class(B) − ci − c ranges in [0, w).
• class(Bi) ranges in [0, t).
• Each value can be represented in w bits.
2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi)
Idea:
Multiplication
offset(Bi)×
w − ni − t
class(B) − ci − class(Bi)
1. class(Xi) < class(Bi):
Idea:
Summation
%
t
c
×
w − ni − t
class(B) − ci − c
class(Bi)&1
c = 0
9
Block decoding in O((w/t)log t) time
§ Reverse operation of block encoding.
• class(Bi): O(log t) time by a successor query on a universal table.
• offset(Bi): O(1) time by integer division.
min k	 ∑ t
c
×
w − ni − t
class(B) − ci − c
	≥ offset(B) − oi
k
c = 0
Lemma: Block decoding can be implemented in
O((w/t)log t) time with O(w3+2tt)-bit universal tables.
Idea:
Successor
query
10
Experimental results
11
Experiment 1: Encoding/Decoding
§ Method: Measured average time for block encoding and decoding.
§ Input: 1M random blocks of length w = 64 for each class.
Our chunkwise encoding and decoding:
§ Time: Significantly faster and less sensitive to densities.
§ Space: Comparable (t = 8) and 10 times more (t = 16).
Average time (in microseconds) for encoding and decodiing
Bitwise
Bitwise
Our chunkwise (t = 8)
Our chunkwise (t = 8)
Our chunkwise (t = 16)
Our chunkwise (t = 16)
Class of blocks Class of blocks Class of blocks
Decoding
time
Enoding
time
12
Experiment 2: Rank/Select queries
§ Method: Measured average time for 1M rank/select on RRR.
§ Input: Random bit strings of length 228 with densities 5, 10, and 20 %.
Density 5% 10% 20%
Operation Rank1 Select1 Rank1 Select1 Rank1 Select1
bitwise 0.226 0.276 0.288 0.310 0.375 0.417
chunkwise (t=8) 0.212 0.288 0.279 0.312 0.297 0.321
chunkwise (t=16) 0.187 0.250 0.219 0.254 0.235 0.265
Average time (in microseconds) for rank and select
Our chunkwise approach improved rank/select queries on RRR
although our improvement is smaller than that in Experiment 1.
13
Conclusion
§ Practical block encoding and decoding for RRR
• New time-space tradeoff based on chunkwise processing:
‣ O(w/t) encoding
‣ O((w/t)log t) decoding
‣ O(w3 + 2tt) bits of space.
• Generalize previous blockwise and bitwise approaches.
• Fast and stable on artificial data with various densities.
§ Future work:
• More experimental evaluation on real data.
THANK YOU

More Related Content

What's hot

Tpr star tree
Tpr star treeTpr star tree
Tpr star tree
Win Yu
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
Tomonari Masada
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
HONGJOO LEE
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
Muhammad Aslam
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM Allocation
Tomonari Masada
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their Applications
Alex Pruden
 
Recursive algorithms
Recursive algorithmsRecursive algorithms
Recursive algorithms
subhashchandra197
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic Encryption
Victor Pereira
 
Ch8
Ch8Ch8
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
Ding Li
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
동호 이
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & Trends
Luc Brun
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei
 
Introduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsIntroduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious Algorithms
Christopher Gilbert
 
Graph kernels
Graph kernelsGraph kernels
Graph kernels
Luc Brun
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
BigMine
 

What's hot (20)

Tpr star tree
Tpr star treeTpr star tree
Tpr star tree
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM Allocation
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their Applications
 
Recursive algorithms
Recursive algorithmsRecursive algorithms
Recursive algorithms
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic Encryption
 
Ch8
Ch8Ch8
Ch8
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & Trends
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Introduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsIntroduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious Algorithms
 
Graph kernels
Graph kernelsGraph kernels
Graph kernels
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 

Similar to Faster Practical Block Compression for Rank/Select Dictionaries

Lecture 25
Lecture 25Lecture 25
Lecture 25
Berkay TURAN
 
SISAP17
SISAP17SISAP17
SISAP17
Yasuo Tabei
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
Amin Chowdhury
 
Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1
Olivier Bonaventure
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecture
suma1991
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Hemant Jha
 
Cache recap
Cache recapCache recap
Cache recap
Tony Nguyen
 
Cache recap
Cache recapCache recap
Cache recap
Luis Goldster
 
Cache recap
Cache recapCache recap
Cache recap
Fraboni Ec
 
Cache recap
Cache recapCache recap
Cache recap
Harry Potter
 
Cache recap
Cache recapCache recap
Cache recap
James Wong
 
Cache recap
Cache recapCache recap
Cache recap
Young Alista
 
Cache recap
Cache recapCache recap
Cache recap
Hoang Nguyen
 
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Arun Kumar
 
Algorithms Exam Help
Algorithms Exam HelpAlgorithms Exam Help
Algorithms Exam Help
Programming Exam Help
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Cloudera, Inc.
 
Part 1 : reliable transmission
Part 1 : reliable transmissionPart 1 : reliable transmission
Part 1 : reliable transmission
Olivier Bonaventure
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Salah Amean
 
Editors l21 l24
Editors l21 l24Editors l21 l24
Editors l21 l24
Neha Pachauri
 
Memory caching
Memory cachingMemory caching
Memory caching
Fraboni Ec
 

Similar to Faster Practical Block Compression for Rank/Select Dictionaries (20)

Lecture 25
Lecture 25Lecture 25
Lecture 25
 
SISAP17
SISAP17SISAP17
SISAP17
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecture
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
 
Algorithms Exam Help
Algorithms Exam HelpAlgorithms Exam Help
Algorithms Exam Help
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Part 1 : reliable transmission
Part 1 : reliable transmissionPart 1 : reliable transmission
Part 1 : reliable transmission
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
 
Editors l21 l24
Editors l21 l24Editors l21 l24
Editors l21 l24
 
Memory caching
Memory cachingMemory caching
Memory caching
 

More from Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
Rakuten Group, Inc.
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
Rakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
Rakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
Rakuten Group, Inc.
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
Rakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
Rakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
Rakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
Rakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
Rakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
Rakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
Rakuten Group, Inc.
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
Rakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
Rakuten Group, Inc.
 

More from Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Recently uploaded

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 

Recently uploaded (20)

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 

Faster Practical Block Compression for Rank/Select Dictionaries

  • 1. Faster Practical Block Compression for Rank/Select Dictionaries Yusaku Kaneta | yusaku.kaneta@rakuten.com Rakuten Institute of Technology, Rakuten, Inc.
  • 2. 2 Background § Compressed data structures in Web companies. • Web companies generate massive amount of logs in text formats. • Analyzing such huge logs is vital for our decision making. • Practical improvements of compressed data structures are important. § RRR compression [Raman, Raman, Rao, SODA’02] • Basic building block in many compressed data structures. • Rank/Select queries on compressed bit strings in constant time: ‣ Rankb(B, i): Number of b’s in B’s prefix of length i. ‣ Selectb(B, i): Position of B’s i-th b. B is an input bit string b: a bit in {0, 1}
  • 3. 3 RRR = Block compression + succinct index § Represents a block B of w bits into a pair (class(B), offset(B)). • class(B): Number of ones in B. • offset(B): Number of preceding blocks of class same as B for some order (e.g., lexicographical order of bit strings). § log w bits for class(B) and log2 w class(B) bits for offset(B). § Two practical approaches to block compression: • Blockwise approach [Claude and Navarro, SPIRE’09] • Bitwise approach [Navarro and Providel, SEA’12]
  • 4. 4 Block compression in practice Good: O(1) time. Bad: Low compression ratio. §The tables limit use of larger w. §log w bits for class(B) become non- negligible. §Ex) 25% overhead for w = 15. 1. Blockwise approach [Claude and Navarro, SPIRE’09] 2. Bitwise approach [Navarro and Providel, SEA’12] Idea: O(2ww)-bit universal tables. Idea: O(w3)-bit binomial coefficients. Good: High compression ratio. Bad: O(w) time. §Count bit strings lexicographically smaller than block B bit by bit. §In practice, heuristics of encoding and decoding blocks with a few ones in O(1) time can be used. Less flexible in practice
  • 5. 5 Main result § Practical encoder/decoder for block compression • Generalization of existing blockwise and bitwise approaches. • Idea: chunkwise processing with multiple universal tables. • Faster and more stable on artificial data. Method Encode Decode Space (in bits) Blockwise [Claude and Navarro, SPIRE’09] O(1) O(1) O(2ww) Bitwise [Navarro and Provital, SEA’12] O(w) O(w) O(w3) Chunkwise (This work) O(w/t) O((w/t) log t) O(w3 + 2t t) This talk uses w and t for block and chunk lengths, respectively.
  • 7. 7 Overview of our algorithm § Main idea: Process a block B in a chunkwise fashion. • Bi: The i-th chunk of length t. (Suppose t divides w.) ‣ Encoded/Decoded in O(1) time using O(2tt)-bit universal tables. • Efficiently count up blocks X satisfying X < B by using a combination formula and chunkwise order: A lexicographical order with: 1. class(Xi) < class(Bi) or 2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi) t c × n − t m − c c m − c n − tt Number of ones: Number of bits: Block X Combination formula: Chunkwise order: X < B
  • 8. 8 Block encoding in O(w/t) time Lemma: Block encoding can be implemented in O(w/t) time with O(w3+2tt)-bit universal tables. ` 1 oi+1 B0···Bi-1 oi 2 X[0, i) X[i] ••• Blocks X of class same as B in descending order of offset(X) from top to bottom. oi = X X0···Xi-1 < B0···Bi-1 ci ni class(B) − ci − class(Bi) w − ni − t#bits #ones Bi Bi+1···Bw/t-1 • w− ni − t is in {0, t, 2t, …, (w/t)t=w}. • class(B) − ci − c ranges in [0, w). • class(Bi) ranges in [0, t). • Each value can be represented in w bits. 2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi) Idea: Multiplication offset(Bi)× w − ni − t class(B) − ci − class(Bi) 1. class(Xi) < class(Bi): Idea: Summation % t c × w − ni − t class(B) − ci − c class(Bi)&1 c = 0
  • 9. 9 Block decoding in O((w/t)log t) time § Reverse operation of block encoding. • class(Bi): O(log t) time by a successor query on a universal table. • offset(Bi): O(1) time by integer division. min k ∑ t c × w − ni − t class(B) − ci − c ≥ offset(B) − oi k c = 0 Lemma: Block decoding can be implemented in O((w/t)log t) time with O(w3+2tt)-bit universal tables. Idea: Successor query
  • 11. 11 Experiment 1: Encoding/Decoding § Method: Measured average time for block encoding and decoding. § Input: 1M random blocks of length w = 64 for each class. Our chunkwise encoding and decoding: § Time: Significantly faster and less sensitive to densities. § Space: Comparable (t = 8) and 10 times more (t = 16). Average time (in microseconds) for encoding and decodiing Bitwise Bitwise Our chunkwise (t = 8) Our chunkwise (t = 8) Our chunkwise (t = 16) Our chunkwise (t = 16) Class of blocks Class of blocks Class of blocks Decoding time Enoding time
  • 12. 12 Experiment 2: Rank/Select queries § Method: Measured average time for 1M rank/select on RRR. § Input: Random bit strings of length 228 with densities 5, 10, and 20 %. Density 5% 10% 20% Operation Rank1 Select1 Rank1 Select1 Rank1 Select1 bitwise 0.226 0.276 0.288 0.310 0.375 0.417 chunkwise (t=8) 0.212 0.288 0.279 0.312 0.297 0.321 chunkwise (t=16) 0.187 0.250 0.219 0.254 0.235 0.265 Average time (in microseconds) for rank and select Our chunkwise approach improved rank/select queries on RRR although our improvement is smaller than that in Experiment 1.
  • 13. 13 Conclusion § Practical block encoding and decoding for RRR • New time-space tradeoff based on chunkwise processing: ‣ O(w/t) encoding ‣ O((w/t)log t) decoding ‣ O(w3 + 2tt) bits of space. • Generalize previous blockwise and bitwise approaches. • Fast and stable on artificial data with various densities. § Future work: • More experimental evaluation on real data.