DCC2014 - Fully Online Grammar Compression in Constant Space

Fully Online Grammar
Compression in Constant Space
Shirou Maruyama1 and Yasuo Tabei2
1Preferred Infrastructure, Inc.
2PRESTO, JST
Data Compression Conference (DCC)
March 26, 2014

Compression of large-scale
repetitive texts
Ex) Personal genomes, version controlled documents,
source code in repositories
• Fully online LCA (FOLCA) [SPIRE,13]: builds a CFG and
directly encodes it into a succinct representation
– Working in the CFG size and taking linear time to the
length of a text
• Require a large working space for noisy repetitive texts
– Average 9% differences between human genomes in recent
database [Nature, 2010]
• Present novel variants of FOLCA working in constant
space

Straight Line Program (SLP)
• Canonical form of a CFG deriving a single text
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
Example:
X1➝ab
aabbabb X2➝X1a
X3➝X1X2
X4➝X3X2
a b
a
X2
X1
X1 X3
X4
X5
b
b
a b

Straight Line Program (SLP)
• Canonical form of a CFG deriving a single text
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
Example:
X1➝ab
aabbabb X2➝X1a
X3➝X1X2
X4➝X3X2
a b
a
X2
X1
X1 X3
X4
X5
b
b
a b
n
N:text length

Grammar compression (GC)
• Build a small SLP from an input text
– Bottom-up construction of a parse tree
• Hash table (a.k.a. reverse dictionary) is a crucial
data structure
– Given XiXj, it returns Xk for Xk→XiXj
– Access time:O(1/α) Memory: n(3+α)lg(n+σ) bits
α: load factor σ: alphabet size
a a b b a b b
X1 X1X2
X3

Existing GCs
• Compression time and working space are
important for scalability
• Online LCA (OLCA) [CCP,2011] = efficient GC
• Drawbacks: they need a large working space
• Challenge: developing fast GC of smaller
working space
Method
Compression
time Working space (bits)
CCP,2011 O(N/α) (3+α)nlgn
SPIRE,2012 O(N/α) (11/4+α）nlgn
CPM,2013 O(Nlgn) 2nlgn(1+o(1))+2nlgp (p << √n)

Menu
• Review of FOLCA in compressed space
• FOLCA in constant space
• Decompression in constant space
• Experiments

Fully Online LCA (FOLCA) [SPIRE,2013]
• Smaller working space : (1+α)nlgn+n(3+lg(αn)) bits
• Optimal encoding: nlgn+2n+o(n) bits
– Almost equal to the lower bound [CPM,2013]
abaababa
12345678910
B:0010101011
L:abaX1X2
P:123469
Text
SLP (Parse Tree) Partial Parse Tree Succinct
Representation
Direct encoding of an SLP

Basic idea of FOLCA
• Replace the same pairs of symbols in common
substrings by as many as possible of the same
non-terminal symbols
• Build 2-trees or 2-2-trees
a b r a k a d a b r a k a d a b r
common substrings
X1
X2
X1
X2
X4 X1
X2
X3
X3 X4
• Iterate this procedure to novel non-terminal
symbols until it builds a single parse tree

Online construction of a parse tree
• Use a queue corresponding to each level of a parse tree
• (i)Read a character, (ii)build a subtree in each queue,
and (iii)enqueue a non-terminal symbol of the root to the
higher queue
Qi q0 q1 q2 q3 q4
z
zQi+1
enqueue
dequeue
q0q1
Qi q0 q1 q2 q3 q4
zQi+1
enqueue
dequeue
q0q1q2
y
z
(i) q1 is land mark (ii) otherwise

Demonstration
1 2 3 4 5
d
1 2 3 4 5
d
1 2 3 4 5
d
Q1
Q2
Q3
aaa
X1→aa
X1
a abab a a a b
X1
X2→ab
b X2
X3→X1X1
X3
Rules
Input string
Courtesy by S.Maruyama

FOLCA in compressed space
• Succinct PPT is output to a secondary storage
– Size: nlgn + 2n bits
• Hash table is kept in a main memory
– Each element = triple (Xk,Xi,Xj) for Xk→XiXj
• Working space depends only on the SLP size n
– n(3+α)lg(n+σ) bits
Partial Parse Tree (PPT) Succinct PPT
B: 0010101011
L : abaX1X2
Secondary storage
Hash table
ab→X1 X1a→X2
X2X1→X3 X3X2→X4
Main memory

FOLCA in constant space
• Basic idea: compute the frequencies of production rules in
hash table and remove infrequent ones
• Naive = divide a text into fixed-length blocks and apply
FOLCA into each block
• Apply stream mining techniques
– frequency counting [Demaine et al., 02]: FREQ_FOLCA
– lossy counting [Manku et al., 02]: LOSSY_FOLCA
a a b b a b b a b a…
X1 X1X2
X3
Freq
2
2
1

FREQ_FOLCA
• Basic idea: (i)use a hash table of the maximum
entry k and (ii)remove the lowest ε percent of
infrequent ones
• Remove infrequent production rules every time
the hash table size reaches k
• Built on relative frequencies
• Working space: bits
• Computational time:

LOSSY_FOLCA
• Basic idea: (i)divide a text into blocks of fixed-length l,
and (ii)keep production rules in the next successive
blocks according to frequencies
– A production rule appearing q times, it is kept for q
successive blocks
• Remove infrequent production rules on absolute
frequencies
• Working space: bits
• Computational time:
l

Decompression in constant space
• FREQ/LOSSY_FOLCA outputs multiple succinct PPTs
• Recover a subtext per PPT
– Detect one PPT by counting 0 and 1 in B
• Working space is the same as FREQ/LOSSY_FOLCA
B: 0010101011
L : abaX1X2 abaababa
I) Succinct PPT II) Recover SLP III) Recover a
subtext

Experiments
• Use 100 human genomes (≒300GB) from 1000
human genomes project [Nature, 2010]
• Compare FREQ_FOLCA, LOSSY_FOLCA and
naïve approach(BLOCK_FOLCA)
• Use working space, compression ratio, and
compression time as evaluation measure

Working space for decompression

Compression ratio and working space for
100 human genomes (≒306GB)
• Compression ratio (CR)
• Compression time (CT) in seconds (s)
• Maximum working space (WS) in mega bytes (MB)
Method CR WS (MB) CT (s)
FREQ_FOLCA (k=1000MB) 31.39 38,048 86,098
FREQ_FOLCA (k=2000MB) 19.71 76,096 93,823
LOSSY_FOLCA (l=5000MB) 20.07 36,246 87,548
LOSSY_FOLCA (l=10000MB) 17.45 56,878 87,446
BLOCK_FOLCA (l=5000MB) 31.85 23,276 88,501
BLOCK_FOOCA (l=10000MB) 25.91 34,665 92,007

Summary
• Two variants of FOLCA working in constant
space
• Frequecy-based algorhtm:
– compute frequencies of production rules in a hash
table and remove infrequent ones
• Built on stream mining techniques
• Can compress 100 human genomes (300GB) in
about one day

DCC2014 - Fully Online Grammar Compression in Constant Space

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DCC2014 - Fully Online Grammar Compression in Constant Space

Similar to DCC2014 - Fully Online Grammar Compression in Constant Space (20)

DCC2014 - Fully Online Grammar Compression in Constant Space

Editor's Notes