Paper: https://link.springer.com/chapter/10.1007/978-3-319-23826-5_22
Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of $O(\sigma^2\log^2 n)$ bits, where $n$ is the length of the string and $\sigma$ is the size of the alphabet. The size of the stack is $o(n)$ except for very large values of $\sigma$. We further improve the algorithm by removing its time dependency on $\sigma$, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that \emph{do not occur} in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.
1. Space-efficient detection
of unusual words
Djamal Belazzougui1
, Fabio Cunial2
(1) Department of Computer Science, University of Helsinki, Finland.
(2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.
2. Exact substring discovery
n. of exact occurrences
model of the sourceMarkov chain of order 3
measure of "surprise"
T =
= 8
W
3. Exact substring discovery
n. of exact occurrences
model of the sourceMarkov chain of order 3
measure of "surprise"
T =
= 8
W
Find all substrings W with or
4. Exact substring discovery
Find all substrings W with or
n. of exact occurrences
model of the sourceMarkov chain of order 3
measure of "surprise"
T =
= 8
W
Quadratic output
6. W
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
Some surprise scores are monotonic
7. W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
Some surprise scores are monotonic
8. W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
Some surprise scores are monotonic
9. W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
same
W
XWY
Some surprise scores are monotonic
10. Some surprise scores are monotonic
W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
same
W
XWY
STT
Linear output
11. [2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
The variance of one substring
12. [2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
can be computed in constant time from:
The variance of one substring
13. [2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
can be computed in constant time from:
By the recursive structure of borders:
The variance of one substring
14. The variance of one substring
[2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
can be computed in constant time from:
By the recursive structure of borders:
time with the Morris-Pratt algorithm
15. The border of all right-maximal substrings
W
= Waaaaa
d
e
a
b
c
16. The border of all right-maximal substrings
W
= Waaaaa
d
e
a
b
c
17. 0 7
STT
W
eW
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
The border of all right-maximal substrings
18. 0 7
STT
W
eW
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
19. 0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
20. 0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
21. 0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
22. 0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all minimal elements
23. words
time time
From the string:
Reduce the output
Score minimal absent words
(Truncated) Suffix tree BWT + rangeDistinct
bits
In this work
O(n log σ) bits
Randomized O(n) time
26. Removing the dependency on σ
Tv
b
"Return arcs"
c
d
a
[4] Simon. String matching algorithms and automata. In First South American Workshop on String Processing, 1993.
27. Removing the dependency on σ
Tv
b
"Return arcs"
c
d
a
[4] Simon. String matching algorithms and automata. In First South American Workshop on String Processing, 1993.
42. left-extensions, unordered
all right-maximal W of T
no order
O(nd) time
Enumerating nodes in no order
[5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014.
[6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013.
node enumerator
rangeDistinct stack
n log σ
O(n log log σ), O(σ2
log2
n)
Burrows-Wheeler tr.
bits
44. rangeDistinct queries
T
[7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text
document retrieval. JDA, 18:3–13, January 2013.
O(out) time, σ log n bits of working space using [1]
Let's say O(out d) time in general
unordered
46. T
ordered
rangeDistinct
string
n log σ
randomized O(n) time, O(n log σ) bits
randomized O(n) time, O(n log σ) bits
O(n log log σ)
Burrows-Wheeler tr.
node enumerator
border computations
[5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014.
[6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013.
[7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text document retrieval. JDA, 18:3–13, January 2013.
enumerator stack
border+variance stack
bits
48. We can limit the computation to
maximal repeats and
minimal rare words
We can compute the score of all
minimal absent words
in time
49. Minimal absent words
Total number:
Time:
Space:
Total number:
Time:
Space:
Minimal rare words
for every proper substring V of W
Minimal unique substrings
aWb : W is a maximal repeat of T
[8] Belazzougui, Cunial. A framework for space-efficient string kernels. CPM 2015.
[9] Crochemore, Mignosi, Restivo. Automata and forbidden words. IPL, 1998.
[10] Herold, Kurtz, Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 2008.
[11] Ileri, Külekci, Xu. A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. TCS, 2015.
56. Genome of length 14.8 million
33 seconds
Verbumculus
length ≤ 12
Verbumculus
length ≤ 24
Verbumculus
length ≤ 36
57 seconds, 6 GB
2 minutes, 14 GB
4 minutes, 14 GB
Our prototype
any length
[12] Apostolico, Gong, Lonardi. Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology, 2004.
Prototype implementation
57. Space-efficient detection
of unusual words
Djamal Belazzougui1
, Fabio Cunial2
(1) Department of Computer Science, University of Helsinki, Finland.
(2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.