SlideShare a Scribd company logo
20th String Processing and Information Retrieval (SPIRE2013),
Jerusalem, Israel, October 9th, 2013

Fully-Online Grammar
Compression
Yasuo Tabei (PREST, JST)
Collaboration with
Shirou Maruyama (PFI, Inc)
Hiroshi Sakamoto (Kyutech)
Kunihiko Sadakane (NII)
Motivation
• Large-scale and highly repetitive text collections
have become ubiquitous
– Personal genomes, version controlled documents,
source codes in repository, reports by studentsnew

• Repair = representative grammar compression
– Not applicable to large-scale repetitive texts

• Present a scalable grammar compression
Straight Line Program (SLP)
• Canonical form of a CFG deriving a single string
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
X5

Example:
aabbabb

X1➝ab
X2➝X1a
X3➝X1X2
X4➝X3X2

X2
a

X4
X1

ab

b

X3

b

X1
ab
Straight Line Program (SLP)
• Canonical form of a CFG deriving a single string
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
X5

Example:
aabbabb
N:text length

X1➝ab
X2➝X1a
n
X3➝X1X2
X4➝X3X2

X2
a

X4
X1

ab

h:
b height

X3

b

X1
ab
Grammar Compression (GC)
• Build a small CFG from an input string
– Size n = number of production rules

• Two crucial data structures
1. Dictionary : Given Xk, returns XiXj for Xk ➝ XiXj
- Array : 2nlgn bits
2. Reverse dictionary: Given XiXj, return Xk
- Hash table : O(nlgn) bits
X1➝ab
X2➝X1a
X3➝X1X2
X4➝X3X2

Access : Xk ➝ A[2k-1][2k]
Existing grammar compression
• Compression time and working space are
important for scalability
• Online LCA (OLCA) [CCP,2011] = efficient GC
Compression
Method
time
CCP,2011 O(N/α)
SPIRE,2012 O(N/α)
CPM,2013 O(Nlgn)

Working space (bits)
(3+α)nlgn
(11/4+α nlgn
2nlgn(1+o(1))+2nlgp (p << √n)

• Drawbacks : they need a large working space
• Challenge : developing fast GC of smaller
working space
Fully-Online LCA (FOLCA)
Direct encoding of an SLP
SLP (Parse Tree)
Text
abaababa

Partial Parse Tree

Succinct
Representation
12345678910
B:0010101011
L:abaX1X2
P:123469

• Smaller working space : (1+α)nlgn+n(3+lg(αn))
bits
• Optimal encoding: lgn+2n+o(n) bits
– Almost equal to the lower bound [CPM,2013]
Menu
• Review of Online LCA
• FOLCA

• Compressed hash table for smaller working space
• Substring extractions
• Experiments
Basic idea of OLCA
• Replace the same pairs of symbols in common
substrings by the same non-terminal symbols as
many as possible
• Build 2-trees or 2-2-trees
X2
X1

X2

X2
X3

X4

X1

X3

X4

X1

a b r a k a d a b r a k a d a b r
common substrings

• Iterate this procedure to novel non-terminal
symbols until it builds a single parse tree
Land mark : local feature decided by
a triple of symbols ABC
• B is a landmark if B belongs to one of the
following : i) repetitive: A = B = C, ii) maximum:
A < B > C, iii) minimum: A > B < C
• Enable an bottom up construction of a parse tree
in an online manner
• Build a parse subtree from a sequence of
symbols of length four
i)B is a landmark
Z

ii) Otherwise
Z

Y

ABCD
A B C

D
Online construction of a parse tree
• Use a queue corresponding to each level of a parse tree
• Read a character, build a subtree in each queue, and
enqueue a non-terminal symbol of the root to the higher
queue
(i) q1 is land mark

enqueue

z

Qi+1

(ii) Otherwise

z

Qi+1
z

z

y
Qi q0 q1 q2 q3
q0q1

Qi q0 q1 q2 q3
dequeue

q0q1q2

enqueue

dequeue
Demonstration of OLCA
Q3
d

X3

1

2

d

X1 X1

b X2

1

2

4

Rules
X1→aa
X2→ab
X3→X1X1

3

4

5

Q2
3

5

Q1

Input string

d
1

a a a a b a b a a a a b
2

3

4

5

Courtesy by Shirou Maruyama
Efficiency of OLCA
• The approximation ratio : O(lg2N)
• Compression time : O(N/α)

• Working space : (3 + α)nlgn bits
• Parse tree is balanced and its height is h =
O(lgN)
Fully-Online LCA (FOLCA)
• Build post-order partial parse tree (POPPT)
– Partial parse tree whose internal nodes have postorder variables
Parse tree

POPPT

• Enable direct encoding to a post-order
succinct tree : nlgn + 2n + o(n) bits
Online construction of POPPT
• A replacing pair in queues are shifted to the right
position of OLCA
(i) q1 is land mark

enqueue

z

Qi+1

(ii) otherwise

enqueue

z

Qi+1
z

z

y
Qi q0 q1 q2 q3 q4
q0q1

Qi q0 q1 q2 q3 q4
dequeue

q0q1q2

dequeue

• Approximation ratio is the same as that of OLCA
Succinct encoding of POPPT
• FOLCA builds POPPT in an online manner, it
encodes the POPPT into dynamic RMM tree
[Sadakane and Navarro,2009]
– ‘0’ for a leaf and ‘1’ for an internal node
– L : a label sequence for leaves
POPPT

Succinct tree
B : 0010101011
L : abaX1X2

nlgn + 2n + o(n) bits
• Simulate tree operations using rank/select
dictionary : random access to Xk ➝ XiXj
Compression of reverse dictionary :
Given XiXj, it returens Xk for Xk➝XiXj
• Implemented as chaining hash table
– αnlgn bits for the table, n(1+α)lgn bits for the lists (α:
load factor of hash table)

• Observation : FOLCA generates post-order
variables in increasing order
– Variables in each list can be organized in increasing
order.

• Compress each list by gap-encoding and the
delta code
• Space : (1+α)nlgn + n(3+lg(αn)) bits
• Access time : O(1/α)
Substring extraction
• Keep the starting position of the substring
encoded by each variable Xi in position array P
– Naïve representation : nlgN bits

• Observation : position array is a monotonically
increasing sequence [Grossi et al., 2003]
• nlg(N/n)+3n+o(n) bits
• Extraction time of a substring
of length l is O(l+h)

P
Increasing
Experiments
• Ecoli (108MB) and kernel texts (247MB) from
repetitive collections in pizza & chili corpus
• Evaluate compression time, working space and
substring extraction time
• Compare FOLCA with LZend [Kreft and
Navarro’10]
• Applicability to 100 human genomes (300GB)
Compression time and working
space for the Ecoli text
FOLCA: Spaces for hash table (H) dictionary (D)
and position array (P)
load
H+D
H+D+P
factor
time (sec) H (MB) (MB)
(MB)
0.01
1,328
23
45
50
0.05
728
37
59
64
0.1
553
48
70
75
0.3
416
65
87
92
0.5
408
90
112
117
LZend
time (sec) space (MB)
2,217
2,410
Compression time and working
space for the kernel text
FOLCA: Spaces for hash table (H) dictionary (D)
and position array (P)
load
H+D
H+D+P
factor
time (sec) H (MB) (MB)
(MB)
0.01
2,891
11
21
23
0.05
2,071
13
23
25
0.1
1,472
16
26
28
0.3
951
30
40
42
0.5
882
42
52
54
LZend
time (sec) space (MB)
4,547
4,653
Substring extraction time and
working space for the kernel text
Time [sec]
Length
101
102
103
104
105

FOLCA

LZend

0.00007
0.00026
0.00224
0.02176
0.21328

0.00002
0.00011
0.00100
0.00954
0.09215

Working space [MB]
FOLCA
12

LZend
14
Compression size for 100 human
genomes (300GB)
Compression time for 100 human
genomes (300GB)
Summary of FOLCA
• Directly encode an SLP into a succinct
representation of nlgn+2n+o(o) bits
• Asymptotically equivalent to the information
theoretic lower bound [CPM,2013]
• Compressed hash table for small working space
of (1+α)nlgn+n(3+lg(αn)) bits
• Support substring extraction in O(l+h) time using
additional space of nlg(N/n)+3n+o(n) bits

More Related Content

What's hot

Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
Big o notation
Big o notationBig o notation
Big o notation
hamza mushtaq
 
Sortsearch
SortsearchSortsearch
Linear sorting
Linear sortingLinear sorting
Linear sorting
Krishna Chaytaniah
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
Burak Himmetoglu
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Albert Bifet
 
01. haskell introduction
01. haskell introduction01. haskell introduction
01. haskell introduction
Sebastian Rettig
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
Casiano Rodriguez-leon
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
Casiano Rodriguez-leon
 
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
Asai Masataro
 
NumPy/SciPy Statistics
NumPy/SciPy StatisticsNumPy/SciPy Statistics
NumPy/SciPy Statistics
Enthought, Inc.
 
Heapsort 1
Heapsort 1Heapsort 1
Heapsort 1
sana younas
 
Monad presentation scala as a category
Monad presentation   scala as a categoryMonad presentation   scala as a category
Monad presentation scala as a category
samthemonad
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Sean Moran
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Matlab bode diagram_instructions
Matlab bode diagram_instructionsMatlab bode diagram_instructions
Matlab bode diagram_instructions
Keihin de Mexico S.A. de C.V.
 
LSH
LSHLSH
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
Sameera Horawalavithana
 
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
Of Sampling and Smoothing: Approximating Distributions over Linked Open DataOf Sampling and Smoothing: Approximating Distributions over Linked Open Data
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
Thomas Gottron
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
Inside Analysis
 

What's hot (20)

Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Big o notation
Big o notationBig o notation
Big o notation
 
Sortsearch
SortsearchSortsearch
Sortsearch
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
01. haskell introduction
01. haskell introduction01. haskell introduction
01. haskell introduction
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
 
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
 
NumPy/SciPy Statistics
NumPy/SciPy StatisticsNumPy/SciPy Statistics
NumPy/SciPy Statistics
 
Heapsort 1
Heapsort 1Heapsort 1
Heapsort 1
 
Monad presentation scala as a category
Monad presentation   scala as a categoryMonad presentation   scala as a category
Monad presentation scala as a category
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Matlab bode diagram_instructions
Matlab bode diagram_instructionsMatlab bode diagram_instructions
Matlab bode diagram_instructions
 
LSH
LSHLSH
LSH
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
Of Sampling and Smoothing: Approximating Distributions over Linked Open DataOf Sampling and Smoothing: Approximating Distributions over Linked Open Data
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 

Similar to SPIRE2013-tabei20131009

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
Dev Nath
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
fnothaft
 
Speech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognitionSpeech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognition
Takuya Yoshioka
 
Prolog & lisp
Prolog & lispProlog & lisp
Prolog & lisp
Ismail El Gayar
 
lex and yacc.pdf
lex and yacc.pdflex and yacc.pdf
lex and yacc.pdf
SurajRavi16
 
PS
PSPS
Verification with LoLA
Verification with LoLAVerification with LoLA
Verification with LoLA
Universität Rostock
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
Arumugam90
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
Yu Liu
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
Rajeev Raman
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functions
Rebekah Mercer
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden
 
Mit cilk
Mit cilkMit cilk
Mit cilk
Raymond Kung
 
Query Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream ProcessingQuery Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream Processing
Jean-Paul Calbimonte
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
Qiangning Hong
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
Association for Computational Linguistics
 
Memory allocation
Memory allocationMemory allocation
Memory allocation
sanya6900
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
fnothaft
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Tokyo Institute of Technology
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 

Similar to SPIRE2013-tabei20131009 (20)

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Speech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognitionSpeech enhancement for distant talking speech recognition
Speech enhancement for distant talking speech recognition
 
Prolog & lisp
Prolog & lispProlog & lisp
Prolog & lisp
 
lex and yacc.pdf
lex and yacc.pdflex and yacc.pdf
lex and yacc.pdf
 
PS
PSPS
PS
 
Verification with LoLA
Verification with LoLAVerification with LoLA
Verification with LoLA
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
snarks <3 hash functions
snarks <3 hash functionssnarks <3 hash functions
snarks <3 hash functions
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
Mit cilk
Mit cilkMit cilk
Mit cilk
 
Query Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream ProcessingQuery Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream Processing
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Memory allocation
Memory allocationMemory allocation
Memory allocation
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 

More from Yasuo Tabei

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
Yasuo Tabei
 
SISAP17
SISAP17SISAP17
SISAP17
Yasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Yasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
GIW2013
GIW2013GIW2013
GIW2013
Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
Yasuo Tabei
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
Yasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
Yasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
Yasuo Tabei
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
Yasuo Tabei
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
Yasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
Yasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
Yasuo Tabei
 

More from Yasuo Tabei (17)

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
 
SISAP17
SISAP17SISAP17
SISAP17
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
GIW2013
GIW2013GIW2013
GIW2013
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 

SPIRE2013-tabei20131009

  • 1. 20th String Processing and Information Retrieval (SPIRE2013), Jerusalem, Israel, October 9th, 2013 Fully-Online Grammar Compression Yasuo Tabei (PREST, JST) Collaboration with Shirou Maruyama (PFI, Inc) Hiroshi Sakamoto (Kyutech) Kunihiko Sadakane (NII)
  • 2. Motivation • Large-scale and highly repetitive text collections have become ubiquitous – Personal genomes, version controlled documents, source codes in repository, reports by studentsnew • Repair = representative grammar compression – Not applicable to large-scale repetitive texts • Present a scalable grammar compression
  • 3. Straight Line Program (SLP) • Canonical form of a CFG deriving a single string • Every production rule satisfies – Right-hand side is a digram – Subscripts of the left symbol is larger than subscripts of the right symbols X5 Example: aabbabb X1➝ab X2➝X1a X3➝X1X2 X4➝X3X2 X2 a X4 X1 ab b X3 b X1 ab
  • 4. Straight Line Program (SLP) • Canonical form of a CFG deriving a single string • Every production rule satisfies – Right-hand side is a digram – Subscripts of the left symbol is larger than subscripts of the right symbols X5 Example: aabbabb N:text length X1➝ab X2➝X1a n X3➝X1X2 X4➝X3X2 X2 a X4 X1 ab h: b height X3 b X1 ab
  • 5. Grammar Compression (GC) • Build a small CFG from an input string – Size n = number of production rules • Two crucial data structures 1. Dictionary : Given Xk, returns XiXj for Xk ➝ XiXj - Array : 2nlgn bits 2. Reverse dictionary: Given XiXj, return Xk - Hash table : O(nlgn) bits X1➝ab X2➝X1a X3➝X1X2 X4➝X3X2 Access : Xk ➝ A[2k-1][2k]
  • 6. Existing grammar compression • Compression time and working space are important for scalability • Online LCA (OLCA) [CCP,2011] = efficient GC Compression Method time CCP,2011 O(N/α) SPIRE,2012 O(N/α) CPM,2013 O(Nlgn) Working space (bits) (3+α)nlgn (11/4+α nlgn 2nlgn(1+o(1))+2nlgp (p << √n) • Drawbacks : they need a large working space • Challenge : developing fast GC of smaller working space
  • 7. Fully-Online LCA (FOLCA) Direct encoding of an SLP SLP (Parse Tree) Text abaababa Partial Parse Tree Succinct Representation 12345678910 B:0010101011 L:abaX1X2 P:123469 • Smaller working space : (1+α)nlgn+n(3+lg(αn)) bits • Optimal encoding: lgn+2n+o(n) bits – Almost equal to the lower bound [CPM,2013]
  • 8. Menu • Review of Online LCA • FOLCA • Compressed hash table for smaller working space • Substring extractions • Experiments
  • 9. Basic idea of OLCA • Replace the same pairs of symbols in common substrings by the same non-terminal symbols as many as possible • Build 2-trees or 2-2-trees X2 X1 X2 X2 X3 X4 X1 X3 X4 X1 a b r a k a d a b r a k a d a b r common substrings • Iterate this procedure to novel non-terminal symbols until it builds a single parse tree
  • 10. Land mark : local feature decided by a triple of symbols ABC • B is a landmark if B belongs to one of the following : i) repetitive: A = B = C, ii) maximum: A < B > C, iii) minimum: A > B < C • Enable an bottom up construction of a parse tree in an online manner • Build a parse subtree from a sequence of symbols of length four i)B is a landmark Z ii) Otherwise Z Y ABCD A B C D
  • 11. Online construction of a parse tree • Use a queue corresponding to each level of a parse tree • Read a character, build a subtree in each queue, and enqueue a non-terminal symbol of the root to the higher queue (i) q1 is land mark enqueue z Qi+1 (ii) Otherwise z Qi+1 z z y Qi q0 q1 q2 q3 q0q1 Qi q0 q1 q2 q3 dequeue q0q1q2 enqueue dequeue
  • 12. Demonstration of OLCA Q3 d X3 1 2 d X1 X1 b X2 1 2 4 Rules X1→aa X2→ab X3→X1X1 3 4 5 Q2 3 5 Q1 Input string d 1 a a a a b a b a a a a b 2 3 4 5 Courtesy by Shirou Maruyama
  • 13. Efficiency of OLCA • The approximation ratio : O(lg2N) • Compression time : O(N/α) • Working space : (3 + α)nlgn bits • Parse tree is balanced and its height is h = O(lgN)
  • 14. Fully-Online LCA (FOLCA) • Build post-order partial parse tree (POPPT) – Partial parse tree whose internal nodes have postorder variables Parse tree POPPT • Enable direct encoding to a post-order succinct tree : nlgn + 2n + o(n) bits
  • 15. Online construction of POPPT • A replacing pair in queues are shifted to the right position of OLCA (i) q1 is land mark enqueue z Qi+1 (ii) otherwise enqueue z Qi+1 z z y Qi q0 q1 q2 q3 q4 q0q1 Qi q0 q1 q2 q3 q4 dequeue q0q1q2 dequeue • Approximation ratio is the same as that of OLCA
  • 16. Succinct encoding of POPPT • FOLCA builds POPPT in an online manner, it encodes the POPPT into dynamic RMM tree [Sadakane and Navarro,2009] – ‘0’ for a leaf and ‘1’ for an internal node – L : a label sequence for leaves POPPT Succinct tree B : 0010101011 L : abaX1X2 nlgn + 2n + o(n) bits • Simulate tree operations using rank/select dictionary : random access to Xk ➝ XiXj
  • 17. Compression of reverse dictionary : Given XiXj, it returens Xk for Xk➝XiXj • Implemented as chaining hash table – αnlgn bits for the table, n(1+α)lgn bits for the lists (α: load factor of hash table) • Observation : FOLCA generates post-order variables in increasing order – Variables in each list can be organized in increasing order. • Compress each list by gap-encoding and the delta code • Space : (1+α)nlgn + n(3+lg(αn)) bits • Access time : O(1/α)
  • 18. Substring extraction • Keep the starting position of the substring encoded by each variable Xi in position array P – Naïve representation : nlgN bits • Observation : position array is a monotonically increasing sequence [Grossi et al., 2003] • nlg(N/n)+3n+o(n) bits • Extraction time of a substring of length l is O(l+h) P Increasing
  • 19. Experiments • Ecoli (108MB) and kernel texts (247MB) from repetitive collections in pizza & chili corpus • Evaluate compression time, working space and substring extraction time • Compare FOLCA with LZend [Kreft and Navarro’10] • Applicability to 100 human genomes (300GB)
  • 20. Compression time and working space for the Ecoli text FOLCA: Spaces for hash table (H) dictionary (D) and position array (P) load H+D H+D+P factor time (sec) H (MB) (MB) (MB) 0.01 1,328 23 45 50 0.05 728 37 59 64 0.1 553 48 70 75 0.3 416 65 87 92 0.5 408 90 112 117 LZend time (sec) space (MB) 2,217 2,410
  • 21. Compression time and working space for the kernel text FOLCA: Spaces for hash table (H) dictionary (D) and position array (P) load H+D H+D+P factor time (sec) H (MB) (MB) (MB) 0.01 2,891 11 21 23 0.05 2,071 13 23 25 0.1 1,472 16 26 28 0.3 951 30 40 42 0.5 882 42 52 54 LZend time (sec) space (MB) 4,547 4,653
  • 22. Substring extraction time and working space for the kernel text Time [sec] Length 101 102 103 104 105 FOLCA LZend 0.00007 0.00026 0.00224 0.02176 0.21328 0.00002 0.00011 0.00100 0.00954 0.09215 Working space [MB] FOLCA 12 LZend 14
  • 23. Compression size for 100 human genomes (300GB)
  • 24. Compression time for 100 human genomes (300GB)
  • 25. Summary of FOLCA • Directly encode an SLP into a succinct representation of nlgn+2n+o(o) bits • Asymptotically equivalent to the information theoretic lower bound [CPM,2013] • Compressed hash table for small working space of (1+α)nlgn+n(3+lg(αn)) bits • Support substring extraction in O(l+h) time using additional space of nlg(N/n)+3n+o(n) bits