DCC2014 - Fully Online Grammar Compression in Constant Space
Fully Online Grammar
Compression in Constant Space
Shirou Maruyama1 and Yasuo Tabei2
1Preferred Infrastructure, Inc.
2PRESTO, JST
Data Compression Conference (DCC)
March 26, 2014
Compression of large-scale
repetitive texts
Ex) Personal genomes, version controlled documents,
source code in repositories
• Fully online LCA (FOLCA) [SPIRE,13]: builds a CFG and
directly encodes it into a succinct representation
– Working in the CFG size and taking linear time to the
length of a text
• Require a large working space for noisy repetitive texts
– Average 9% differences between human genomes in recent
database [Nature, 2010]
• Present novel variants of FOLCA working in constant
space
Straight Line Program (SLP)
• Canonical form of a CFG deriving a single text
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
Example:
X1➝ab
aabbabb X2➝X1a
X3➝X1X2
X4➝X3X2
a b
a
X2
X1
X1 X3
X4
X5
b
b
a b
Straight Line Program (SLP)
• Canonical form of a CFG deriving a single text
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
Example:
X1➝ab
aabbabb X2➝X1a
X3➝X1X2
X4➝X3X2
a b
a
X2
X1
X1 X3
X4
X5
b
b
a b
n
N:text length
Grammar compression (GC)
• Build a small SLP from an input text
– Bottom-up construction of a parse tree
• Hash table (a.k.a. reverse dictionary) is a crucial
data structure
– Given XiXj, it returns Xk for Xk→XiXj
– Access time:O(1/α) Memory: n(3+α)lg(n+σ) bits
α: load factor σ: alphabet size
a a b b a b b
X1 X1X2
X3
Existing GCs
• Compression time and working space are
important for scalability
• Online LCA (OLCA) [CCP,2011] = efficient GC
• Drawbacks: they need a large working space
• Challenge: developing fast GC of smaller
working space
Method
Compression
time Working space (bits)
CCP,2011 O(N/α) (3+α)nlgn
SPIRE,2012 O(N/α) (11/4+α)nlgn
CPM,2013 O(Nlgn) 2nlgn(1+o(1))+2nlgp (p << √n)
Menu
• Review of FOLCA in compressed space
• FOLCA in constant space
• Decompression in constant space
• Experiments
Fully Online LCA (FOLCA) [SPIRE,2013]
• Smaller working space : (1+α)nlgn+n(3+lg(αn)) bits
• Optimal encoding: nlgn+2n+o(n) bits
– Almost equal to the lower bound [CPM,2013]
abaababa
12345678910
B:0010101011
L:abaX1X2
P:123469
Text
SLP (Parse Tree) Partial Parse Tree Succinct
Representation
Direct encoding of an SLP
Basic idea of FOLCA
• Replace the same pairs of symbols in common
substrings by as many as possible of the same
non-terminal symbols
• Build 2-trees or 2-2-trees
a b r a k a d a b r a k a d a b r
common substrings
X1
X2
X1
X2
X4 X1
X2
X3
X3 X4
• Iterate this procedure to novel non-terminal
symbols until it builds a single parse tree
Online construction of a parse tree
• Use a queue corresponding to each level of a parse tree
• (i)Read a character, (ii)build a subtree in each queue,
and (iii)enqueue a non-terminal symbol of the root to the
higher queue
Qi q0 q1 q2 q3 q4
z
zQi+1
enqueue
dequeue
q0q1
Qi q0 q1 q2 q3 q4
zQi+1
enqueue
dequeue
q0q1q2
y
z
(i) q1 is land mark (ii) otherwise
Demonstration
1 2 3 4 5
d
1 2 3 4 5
d
1 2 3 4 5
d
Q1
Q2
Q3
aaa
X1→aa
X1
a abab a a a b
X1
X2→ab
b X2
X3→X1X1
X3
Rules
Input string
Courtesy by S.Maruyama
FOLCA in compressed space
• Succinct PPT is output to a secondary storage
– Size: nlgn + 2n bits
• Hash table is kept in a main memory
– Each element = triple (Xk,Xi,Xj) for Xk→XiXj
• Working space depends only on the SLP size n
– n(3+α)lg(n+σ) bits
Partial Parse Tree (PPT) Succinct PPT
B: 0010101011
L : abaX1X2
Secondary storage
Hash table
ab→X1 X1a→X2
X2X1→X3 X3X2→X4
Main memory
FOLCA in constant space
• Basic idea: compute the frequencies of production rules in
hash table and remove infrequent ones
• Naive = divide a text into fixed-length blocks and apply
FOLCA into each block
• Apply stream mining techniques
– frequency counting [Demaine et al., 02]: FREQ_FOLCA
– lossy counting [Manku et al., 02]: LOSSY_FOLCA
a a b b a b b a b a…
X1 X1X2
X3
Freq
2
2
1
FREQ_FOLCA
• Basic idea: (i)use a hash table of the maximum
entry k and (ii)remove the lowest ε percent of
infrequent ones
• Remove infrequent production rules every time
the hash table size reaches k
• Built on relative frequencies
• Working space: bits
• Computational time:
LOSSY_FOLCA
• Basic idea: (i)divide a text into blocks of fixed-length l,
and (ii)keep production rules in the next successive
blocks according to frequencies
– A production rule appearing q times, it is kept for q
successive blocks
• Remove infrequent production rules on absolute
frequencies
• Working space: bits
• Computational time:
l
Decompression in constant space
• FREQ/LOSSY_FOLCA outputs multiple succinct PPTs
• Recover a subtext per PPT
– Detect one PPT by counting 0 and 1 in B
• Working space is the same as FREQ/LOSSY_FOLCA
B: 0010101011
L : abaX1X2 abaababa
I) Succinct PPT II) Recover SLP III) Recover a
subtext
Experiments
• Use 100 human genomes (≒300GB) from 1000
human genomes project [Nature, 2010]
• Compare FREQ_FOLCA, LOSSY_FOLCA and
naïve approach(BLOCK_FOLCA)
• Use working space, compression ratio, and
compression time as evaluation measure
Compression ratio and working space for
100 human genomes (≒306GB)
• Compression ratio (CR)
• Compression time (CT) in seconds (s)
• Maximum working space (WS) in mega bytes (MB)
Method CR WS (MB) CT (s)
FREQ_FOLCA (k=1000MB) 31.39 38,048 86,098
FREQ_FOLCA (k=2000MB) 19.71 76,096 93,823
LOSSY_FOLCA (l=5000MB) 20.07 36,246 87,548
LOSSY_FOLCA (l=10000MB) 17.45 56,878 87,446
BLOCK_FOLCA (l=5000MB) 31.85 23,276 88,501
BLOCK_FOOCA (l=10000MB) 25.91 34,665 92,007
Summary
• Two variants of FOLCA working in constant
space
• Frequecy-based algorhtm:
– compute frequencies of production rules in a hash
table and remove infrequent ones
• Built on stream mining techniques
• Can compress 100 human genomes (300GB) in
about one day
Editor's Notes
In this talk, I will deal with compression of large-scale repetitive texts.
Examples are personal genomes, version controlled documents, source code in repositories.
We presented fully online LCA called FOLCA that builds an SLP and directly encodes it into a succinct representation.
Working space is the SLP size and computational time is linear to the length of a text
However, recent sequencing technology generates noisy repetitive texts. Actually, there is 9% difference on average between human genomes in recent database,
qlthough it is said that the difference between individual genomes is 0.01%.
For such noisy repetitve texts, FOLCA working in the SLP size consumes a large amount of memory.
We present novel variants of FOLCA working in constant space.
In this talk, we assumes straight line programs for grammars.
SLP is a canonical form of a CFG deriving a single string.
Every production rule satisfies:
right-hand side is a digram
Subscripts of the left symbol is larger than subscripts of the right symbols.
Grammar compression (GC) builds a small SLP from an input text.
It builds a parse tree corresponding to an SLP in a bottom-up manner.
Hash table also known as reverse dictionary is a crucial data structure in grammar compressions.
Given right hand side of symbols XiXj, it return the right symbol Xk in a production rule Xk ¥to XiXj
Access time is O(1/alpha), memory is n(3+alpha)lg(n+alpha) bits
Alpha: load factor
Compression time and working space are important for applying grammar compression for large-scale repetitive texts.
Online LCA (OLCA) is an efficient grammar compression.
OLCA is extend as achieving a smaller working space.
But, they still need a large working space.
Now our challenge is to develop fast GC of smaller working space.
We modify FOLCA as working in compressed space.
FOLCA builds POPPT that is output to a secondary storage device.
The succinct representation is indexed by a rank/select dictionary.
There is no small O(n) here.
In addition, hash table is kept in a main memory.
The hash table consumes most of the memory.
Working space is n(3+alpha)lg(n+sigma) bits.
Thus, the working space depends only on the SLP size n.
From this slide, I will present FOLCA working in constant space.
Basic idea of our novel variants of FOLCA is to compute the frequencies of production rules in hash table and remove infrequent ones at a point
We apply stream mining techniques in data mining area for extracting frequent items in data streams.
We apply two techniques.
First is frequency counting proposed by Demaine et al 2002.
We shall referrer to FOLCA using frequency counting as FREQ_FOLCA.
Second is lossy counting proposed by Manku et al in 2002.
We shall referrer to FOLCA using lossy counting as LOSSY_FOLCA.
Naïve approach to compress long repetitive texts is to divide a text into fixed-length blocks and apply compressors into each block.
Compression is ruined because long range repetitions are not captured.
On the otherhand, our variants of FOLCA can capture long range repetitions.
Basic idea of FREQ_FOLCA is to use a hash table of the maximum entry k and remove the lowest ε percent of infrequent one.
Basic idea of LOSSY_
First figure shows that working space by increasing the length of text.
The horizontal axis represents the length of texts.
The vertical axis represents working space in megabytes.
We tried two parapeters for LOSSY_FOLCA and FREQ_FOLCA.
The working space of FOLCA is increasing for the long input texts.
FOLCA works in the SLP size. It is not applicable to large-scale, noisy repetitive texts.
On the otherhan, our method, LOSSY_FOLCA and FREQ_FOLCA works in the constant space that does not depend on the text length.
Second figure shows that the working space for decompression.
The horizontal axis represents the length of texts.
The vertical axis represents the working space in megabytes.
You can see the same trends in the working space for decompression as in that for compression.
The working space for LOSSY_FOLCA and FREQ_FOLCA remains constant not depending on the length of text.
The last figure shows that compression ratio and working space for 100 human genomes.
Compression fished for about one day.
You can see the trade off between compression ratio and working space for each method.
The larger value of parameters achieves high compression ratio.
There are trade off between compression ratio and working space.
The compression ratio of LOSSY_FOLCA is better than that of BLOCK_FOLCA for the same block length, which showed that the strategy of LOSSY_FOLCA for removing infrequent production rules
was more effective that that of BLOCK_FOLCA.
LOSSY_FOLCA using a smaller working space achieved a high compression ratio that FREQ_FOLCA.
These results demostrate that applicalities of our method to large-scale repetitive texts.