Algorithmics on SLP-compressed strings

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction The Smallest Grammar Problem Algorithms and Hardness for Compressed Problems
Algorithmics on SLP-compressed strings
Antonis Antonopoulos
CoReLab
National Technical University of Athens

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
Introduction
In many areas, large string data have to be not only stored
in compressed form, but the initial data has to be
processed and analyzed as well.
Design of algorithms that operate directly on the
compressed data.
Decompress-and-solve strategy needs many resources.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
Introduction
A compressed representation of a string makes
regularities in the string explicit. These regularities may
be exploited in a second phase for speeding up an
algorithm.
So, we need a mathematical model for compressed
representations of strings.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
Introduction
A compressed representation of a string makes
regularities in the string explicit. These regularities may
be exploited in a second phase for speeding up an
algorithm.
So, we need a mathematical model for compressed
representations of strings.
Such a model should have two properties:
Cover many compression schemes used in practice
Be mathematically easy to handle

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
Straight Line Programs
Definition (Straight Line Programs)
A Straight Line Program (SLP) over the terminal alphabet Σ is
a context-free grammar A = (V, Σ, S, P), P ⊆ V × (V ∪ Σ)∗ such
that:
1 For every A ∈ V there exists exactly one production of the
form (A, α) ∈ P.
2 The relation {(A, B) | (A, α) ∈ P, B ∈ alph(α)} is acyclic.
The size of an SLP is |A| =
∑
(A,α)∈P |α|.
The (singleton) language generated by A is denoted
eval(A).
Let alph(s) be the set of symbols occuring in s.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
Example (Fibonacci Words)
Let SLP A over the terminal alphabet {a, b} with the following
productions:
A1 → b
A2 → a
Ai → Ai−1Ai−2, for 3 ≤ i ≤ 7
The starting symbol is A7.
Then eval(A) = abaababaabaab, the 7th Fibonacci Word.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
SLPs can capture all the usual compression methods. For
example:
Theorem
From the LZ77-factorization of a given string w ∈ Σ∗, we can
compute an SLP of size O
(
log |w|
m · m
)
for w in time
O
(
log |w|
m · m
)
, where m is the number of factors in the
LZ77-factorization of w.
Also, we can easily design polynomial-time algorithms to
compute |eval(A)| and eval(A)[i], given an SLP A.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
The Smallest Grammar Problem
Given a string, what is the smallest SLP for it?
This is a Kolmogorov Complexity (decidable) variant.
Let opt(w) the size of a minimal SLP for w, that is, an SLP
A with eval(A) = w, |A| = opt(w) and for every SLP B with
eval(B) = w, |B| ≥ |A|.
Definition
Given a string w, compute a minimal SLP for w.
Theorem (Approximation of an SLP)
There is a O (log |Σ| · n)-time algorithm that computes for a
given word w ∈ Σ∗ of length n an SLP A such that eval(A) = w
and |A| ≤ O
(
log n
opt(w)
)
· opt(w).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
The Smallest Grammar Problem
Theorem
Unless P = NP there is no polynomial time algorithm with the
following properties:
The input consists of a string w over some alphabet Σ.
The output is an SLP A such that eval(A) = w and
|A| ≤ 8569
8568 · opt(w).
The proof uses a reduction from the vertex cover problem
for graphs with max degree 3, which is hard to
approximate below a ration of 145
144 , unless P = NP.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Algorithms and Hardness for Compressed Problems
Compressed Equality Checking
Definition (Compressed Equality Checking)
Given two SLPs A and B, is eval(A) = eval(B)?
The algorithms for equality checking use combinatorial
properties of strings, such as the periodicity lemma.
Some results on sequential and parallel models:
Theorem
Compressed Equality Checking can be solved in
O
(
(|A| + |B|)2
)
.
Theorem
Compressed Equality Checking belongs to coRNC2
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Compressed Hamming Distance
Let dH(a, b) denote the Hamming Distance of a, b ∈ Σ∗.
(the numbers of symbols that a and b differ).
Theorem
The function dH(eval(A), eval(B)), where A, B are SLPs, is
#P-complete.
Proof:
Let:
G(S, T, y) =
{
“yes” , if Ty ̸= Sy
“no” , otherwise
G ∈ FP.
We will reduce the #P-complete problem #SUBSET SUM
to dH using an 1-Turing Reduction.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Proof:
#SUBSET SUM asks, given W = {w1, . . . , wn}, t in binary,
the number of W′s subsets with elements summing up to
t, that is, the number of x ∈ {0, 1}n for which
x · (w1, . . . , wn) = x · w = t.
Let s =
∑n
i=1 wi.
Consider the texts:
T = (0t
10s−t
)2n
S = ⃝x∈{0,1}n (0x·w
10s−x·w
)
Notice that dH(T, S) is exactly two times the number of
W’s subsets that are not equal to t.
We can easily construct an SLP B such that eval(B) = T.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Consider the following SLP A with the following rules:
A1 → 10s+w1
1
Ak+1 → Ak0s−sk+wk+1 Ak (1 ≤ k ≤ n − 1)
Using induction, we can prove that
eval(A) = ⃝x∈{0,1}n (0x·w10s−x·w) = S
The size of A is polynomial in the length of the binary
encoding of w.
Thus we can compute the answer to #SUBSET SUM as
2n − 1
2 dH(T, S).
□

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fully Compressed Pattern Matching
In its most general form, the Pattern Matching Problem
asks for given strings T and P, if P is a factor of T.
Many linear-time algorithms for the uncompressed case
(e.g. Knuth-Morris-Pratt).
Definition (Fully Compressed Pattern Matching)
Given two SLPs P and T, is eval(P) a factor of eval(T)?
An important observation that implies most algorithms is
that if a pattern p is a factor of eval(T), then there exists a
production X → YZ in eval(T) such that p has an
“occurence” in evalT(X) = evalT(Y)evalT(Z) that touches
the cut between evalT(Y) and evalT(Z).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
It is a consequence of the Periodicity Lemmma (Fine,
Wilf, 1965) that the set of all starting positions of the
occurences of p in evalT(X) that touch the cut of X forms
an arithmetic progression.
Lifshit’s algorithm, for example, computes for every
nonterminal A of the pattern SLP P and each nonterminal
X of the text SLP T the arithmetic progression
corresponding to the occurences of evalP(A) in evalT(X)
that touch the cut of X.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
These arithmetic progression can be computed
bottom-up, resulting in overall time bound O
(
|P|2|T|
)
.
Jez’s algorithm beats that time bound with
O ((|T| + |P|) log(|eval(P)|) log(|P| + |T|)) = O
(
n2 log n
)
For uncompressed patterns, we can achieve a bound of
O (|p| · ||T||).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Subsequence Problems
In many applications, especially in Computational
Biology, approximate occurences of a pattern are more
relavant than exact matches.
Subsequence Problems consist very useful similarity
measures between sequences.
Definition (Fully Compressed Subsequence Problem)
Given two SLPs P and T, is eval(P) a subsequence of eval(T)?
Theorem
The Fully Compressed Subsequence Problem is in PSPACE and
it is PP-hard.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Querying Compressed Texts
Definition (Compressed Querying)
Given an SLP A, a binary-coded number i, 1 ≤ i ≤ |eval(A)|
and a symbol a ∈ Σ, does eval(A)[i] = a hold?
Theorem
Compressed Querying is P-complete.
The proof of the aforementioned result is a logspace
reduction from the P-complete problem “super
increasing subset sum”.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
References I
Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy,
Manoj Prabhakaran, Amit Sahai, and Abhi Shelat.
The smallest grammar problem.
IEEE Trans. Information Theory, 51(7):2554–2576, 2005.
Dan Gusfield.
Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology.
Cambridge University Press, New York, NY, USA, 1997.
Yury Lifshits.
Processing compressed texts: A tractability border.
In Combinatorial Pattern Matching, 18th Annual
Symposium, CPM 2007, London, Canada, July 9-11, 2007,
Proceedings, pages 228–240, 2007.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
References II
Markus Lohrey.
Word problems and membership problems on
compressed words.
SIAM J. Comput., 35(5):1210–1240, 2006.
Markus Lohrey.
Algorithmics on SLP-compressed strings: A survey.
Groups Complexity Cryptology, 4(2):241–299, 2012.
Markus Lohrey.
Equality testing of compressed strings.
In Combinatorics on Words - 10th International
Conference, WORDS 2015, Kiel, Germany, September
14-17, 2015, Proceedings, pages 14–26, 2015.

Algorithmics on SLP-compressed strings

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Algorithmics on SLP-compressed strings

Similar to Algorithmics on SLP-compressed strings (20)

Recently uploaded

Recently uploaded (20)

Algorithmics on SLP-compressed strings