Space-efficient detection of unusual words

Space-efficient detection
of unusual words
Djamal Belazzougui1
, Fabio Cunial2
(1) Department of Computer Science, University of Helsinki, Finland.
(2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.

Exact substring discovery
n. of exact occurrences
model of the sourceMarkov chain of order 3
measure of "surprise"
T =
= 8
W

T =
= 8
W
Find all substrings W with or

Find all substrings W with or
T =
= 8
W
Quadratic output

W
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
Some surprise scores are monotonic

W YX

W YX
same
W
XWY

W YX
same
W
XWY
STT
Linear output

[2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
The variance of one substring

1997, pages 215–229. IEEE, 1998.
IID source
can be computed in constant time from:

1997, pages 215–229. IEEE, 1998.
IID source
By the recursive structure of borders:

1997, pages 215–229. IEEE, 1998.
IID source
By the recursive structure of borders:
time with the Morris-Pratt algorithm

The border of all right-maximal substrings
W
= Waaaaa
d
e
a
b
c

0 7
STT
W
eW
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.

0 7
STT
W
eW
dW
a b c
= Wee
V=bord(eW)

0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wee
V=bord(eW)

0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
V=bord(eW)

0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
V=bord(eW)
The border of all minimal elements

words
time time
From the string:
Reduce the output
Score minimal absent words
(Truncated) Sufﬁx tree BWT + rangeDistinct
bits
In this work
O(n log σ) bits
Randomized O(n) time

Removing the dependency on σ
Tv
b
c
d

Tv
b
"Return arcs"
c
d
a
[4] Simon. String matching algorithms and automata. In First South American Workshop on String Processing, 1993.

Tv w
w cannot be the root
b
c
d
a
"Return arcs"

Tv1
v2
wc
d

Tv1
v2
wc
c
d
d

Tv1
v2
w

Tv1
v2
wa b
a

0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
V=bord(eW)

STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
V=bord(eW)
7+1

STT
W
eW
f
f
h
g
h
dW
a
a
a
a
W
eW
b
b
b
c
c d
c
c d
V=bord(eW)
Minimal elements

STT
W
eW
f
f
h
g
h
dW
a
a
a
a
bord(eW)
W
eW
b
b
b
c
c d
c
c d
V=bord(eW)
Minimal elements

STT
W
eW
f
f
h
g
h
dW
a
a
a
a
bord(eW)
W
eW
b
b
b
c
c d
c
c d
V=bord(eW)
Minimal elements
words

Variance

Variance
So, we store:
In every node: ,
In every element of stack b, for every b:
a
bord(eW)
W
eW
b c d

left-extensions, unordered
all right-maximal W of T
no order
O(nd) time
Enumerating nodes in no order
[5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014.
[6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013.
node enumerator
rangeDistinct stack
n log σ
O(n log log σ), O(σ2
log2
n)
Burrows-Wheeler tr.
bits

rangeDistinct queries
T
[7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text
document retrieval. JDA, 18:3–13, January 2013.
O(out) time, σ log n bits of working space using [1]
Let's say O(out d) time in general
unordered

T
ordered
time
for statistical reasons
rangeDistinct
n log σ
O(n log log σ)
Burrows-Wheeler tr.
node enumerator
border computations
enumerator stack ,
border+variance stack
bits

T
ordered
rangeDistinct
string
n log σ
randomized O(n) time, O(n log σ) bits
randomized O(n) time, O(n log σ) bits
O(n log log σ)
Burrows-Wheeler tr.
node enumerator
border computations
[5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014.
[6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013.
[7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text document retrieval. JDA, 18:3–13, January 2013.
enumerator stack
border+variance stack
bits

We can limit the computation to
maximal repeats and
minimal rare words
We can compute the score of all
minimal absent words
in time

Minimal absent words
Total number:
Time:
Space:
Total number:
Time:
Space:
Minimal rare words
for every proper substring V of W
Minimal unique substrings
aWb : W is a maximal repeat of T
[8] Belazzougui, Cunial. A framework for space-efficient string kernels. CPM 2015.
[9] Crochemore, Mignosi, Restivo. Automata and forbidden words. IPL, 1998.
[10] Herold, Kurtz, Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 2008.
[11] Ileri, Külekci, Xu. A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. TCS, 2015.

STT
W
eW
a
a
a
eW
W
b
b
b
c
c d
c d e
eW|deW|ceW|beW|a Right-extensions of W
We

STT
W
eW
heW
a
a
a
a
eW
W
heW
b
b
b
c
c d
c
c d e
eW|deW|ceW|beW|a Right-extensions of W
Right-extensions of eW
Weh

STT
W
eW
heW
a
a
a
a
eW
W
heW
bord(heW)
B=bord(heW)
B=bord(heW)
suf(bord(heW))
b
b
b
c
c d
c
c d e
eW|d
B|d B|e
eW|c
B|c
eW|b
B|b
eW|a
B|a
Right-extensions of W
Right-extensions of suf(B)
Weh

STT
W
eW
heW
a
a
a
a
eW
W
heW
bord(heW)
B=bord(heW)
B=bord(heW)
suf(bord(heW))
b
b
b
c
c d
c
c d e
eW|d
B|d B|e
eW|c
B|c
eW|b
B|b
eW|a
B|a
Right-extensions of W
Right-extensions of suf(B)
Weh
words

Prototype implementation
Genome of length 14.8 million

Genome of length 14.8 million
33 seconds
Verbumculus
length ≤ 12
Verbumculus
length ≤ 24
Verbumculus
length ≤ 36
57 seconds, 6 GB
2 minutes, 14 GB
4 minutes, 14 GB
Our prototype
any length
[12] Apostolico, Gong, Lonardi. Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology, 2004.
Prototype implementation

Space-efficient detection of unusual words

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Space-efficient detection of unusual words

Similar to Space-efficient detection of unusual words (20)

More from Fabio Cunial

More from Fabio Cunial (6)

Recently uploaded

Recently uploaded (20)

Space-efficient detection of unusual words