SlideShare a Scribd company logo
1 of 57
Download to read offline
Space-efficient detection
of unusual words
Djamal Belazzougui1
, Fabio Cunial2
(1) Department of Computer Science, University of Helsinki, Finland.
(2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.
Exact substring discovery
n. of exact occurrences
model of the sourceMarkov chain of order 3
measure of "surprise"
T =
= 8
W
Exact substring discovery
n. of exact occurrences
model of the sourceMarkov chain of order 3
measure of "surprise"
T =
= 8
W
Find all substrings W with or
Exact substring discovery
Find all substrings W with or
n. of exact occurrences
model of the sourceMarkov chain of order 3
measure of "surprise"
T =
= 8
W
Quadratic output
SPIRE 2001
W
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
Some surprise scores are monotonic
W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
Some surprise scores are monotonic
W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
Some surprise scores are monotonic
W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
same
W
XWY
Some surprise scores are monotonic
Some surprise scores are monotonic
W YX
[1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003.
same
W
XWY
STT
Linear output
[2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
The variance of one substring
[2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
can be computed in constant time from:
The variance of one substring
[2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
can be computed in constant time from:
By the recursive structure of borders:
The variance of one substring
The variance of one substring
[2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences
1997, pages 215–229. IEEE, 1998.
IID source
can be computed in constant time from:
By the recursive structure of borders:
time with the Morris-Pratt algorithm
The border of all right-maximal substrings
W
= Waaaaa
d
e
a
b
c
The border of all right-maximal substrings
W
= Waaaaa
d
e
a
b
c
0 7
STT
W
eW
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
The border of all right-maximal substrings
0 7
STT
W
eW
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wee
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all right-maximal substrings
0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
[3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000.
V=bord(eW)
The border of all minimal elements
words
time time
From the string:
Reduce the output
Score minimal absent words
(Truncated) Suffix tree BWT + rangeDistinct
bits
In this work
O(n log σ) bits
Randomized O(n) time
Removing the dependency on σ
Removing the dependency on σ
Tv
b
c
d
Removing the dependency on σ
Tv
b
"Return arcs"
c
d
a
[4] Simon. String matching algorithms and automata. In First South American Workshop on String Processing, 1993.
Removing the dependency on σ
Tv
b
"Return arcs"
c
d
a
[4] Simon. String matching algorithms and automata. In First South American Workshop on String Processing, 1993.
Removing the dependency on σ
Tv w
w cannot be the root
b
c
d
a
"Return arcs"
Removing the dependency on σ
Tv1
v2
wc
d
Removing the dependency on σ
Tv1
v2
wc
c
d
d
Removing the dependency on σ
Tv1
v2
wc
c
d
d
Removing the dependency on σ
Tv1
v2
w
Removing the dependency on σ
Tv1
v2
wa b
a
0 7
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
V=bord(eW)
Removing the dependency on σ
STT
W
eW
f
f
h
g
h
dW
a b c
= Wefe
V=bord(eW)
Removing the dependency on σ
7+1
STT
W
eW
f
f
h
g
h
dW
a
a
a
a
W
eW
b
b
b
c
c d
c
c d
V=bord(eW)
Minimal elements
STT
W
eW
f
f
h
g
h
dW
a
a
a
a
bord(eW)
W
eW
b
b
b
c
c d
c
c d
V=bord(eW)
Minimal elements
STT
W
eW
f
f
h
g
h
dW
a
a
a
a
bord(eW)
W
eW
b
b
b
c
c d
c
c d
V=bord(eW)
Minimal elements
words
Variance
can be computed in constant time from:
Variance
can be computed in constant time from:
So, we store:
In every node: ,
In every element of stack b, for every b:
a
bord(eW)
W
eW
b c d
In small space
left-extensions, unordered
all right-maximal W of T
no order
O(nd) time
Enumerating nodes in no order
[5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014.
[6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013.
node enumerator
rangeDistinct stack
n log σ
O(n log log σ), O(σ2
log2
n)
Burrows-Wheeler tr.
bits
T
ordered
T
rangeDistinct queries
T
[7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text
document retrieval. JDA, 18:3–13, January 2013.
O(out) time, σ log n bits of working space using [1]
Let's say O(out d) time in general
unordered
T
ordered
time
for statistical reasons
rangeDistinct
n log σ
O(n log log σ)
Burrows-Wheeler tr.
node enumerator
border computations
enumerator stack ,
border+variance stack
bits
T
ordered
rangeDistinct
string
n log σ
randomized O(n) time, O(n log σ) bits
randomized O(n) time, O(n log σ) bits
O(n log log σ)
Burrows-Wheeler tr.
node enumerator
border computations
[5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014.
[6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013.
[7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text document retrieval. JDA, 18:3–13, January 2013.
enumerator stack
border+variance stack
bits
Extensions
We can limit the computation to
maximal repeats and
minimal rare words
We can compute the score of all
minimal absent words
in time
Minimal absent words
Total number:
Time:
Space:
Total number:
Time:
Space:
Minimal rare words
for every proper substring V of W
Minimal unique substrings
aWb : W is a maximal repeat of T
[8] Belazzougui, Cunial. A framework for space-efficient string kernels. CPM 2015.
[9] Crochemore, Mignosi, Restivo. Automata and forbidden words. IPL, 1998.
[10] Herold, Kurtz, Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 2008.
[11] Ileri, Külekci, Xu. A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. TCS, 2015.
STT
W
eW
a
a
a
eW
W
b
b
b
c
c d
c d e
Minimal absent words
eW|deW|ceW|beW|a Right-extensions of W
We
STT
W
eW
heW
a
a
a
a
eW
W
heW
b
b
b
c
c d
c
c d e
Minimal absent words
eW|deW|ceW|beW|a Right-extensions of W
Right-extensions of eW
Weh
STT
W
eW
heW
a
a
a
a
eW
W
heW
bord(heW)
B=bord(heW)
B=bord(heW)
suf(bord(heW))
b
b
b
c
c d
c
c d e
Minimal absent words
eW|d
B|d B|e
eW|c
B|c
eW|b
B|b
eW|a
B|a
Right-extensions of W
Right-extensions of suf(B)
Right-extensions of eW
Weh
STT
W
eW
heW
a
a
a
a
eW
W
heW
bord(heW)
B=bord(heW)
B=bord(heW)
suf(bord(heW))
b
b
b
c
c d
c
c d e
Minimal absent words
eW|d
B|d B|e
eW|c
B|c
eW|b
B|b
eW|a
B|a
Right-extensions of W
Right-extensions of suf(B)
Right-extensions of eW
Weh
words
Prototype implementation
Genome of length 14.8 million
Prototype implementation
Genome of length 14.8 million
Genome of length 14.8 million
33 seconds
Verbumculus
length ≤ 12
Verbumculus
length ≤ 24
Verbumculus
length ≤ 36
57 seconds, 6 GB
2 minutes, 14 GB
4 minutes, 14 GB
Our prototype
any length
[12] Apostolico, Gong, Lonardi. Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology, 2004.
Prototype implementation
Space-efficient detection
of unusual words
Djamal Belazzougui1
, Fabio Cunial2
(1) Department of Computer Science, University of Helsinki, Finland.
(2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.

More Related Content

What's hot

On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...
On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...
On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...inventionjournals
 
The End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired PolymerThe End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired PolymerLi Tai Fang
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)irjes
 
Some Fixed Point Theorems in b G -cone Metric Space
Some Fixed Point Theorems in b G -cone Metric Space Some Fixed Point Theorems in b G -cone Metric Space
Some Fixed Point Theorems in b G -cone Metric Space Komal Goyal
 
Seminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasSeminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasHeinrich Hartmann
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)inventionjournals
 
Cd32939943
Cd32939943Cd32939943
Cd32939943IJMER
 
Best Approximation in Real Linear 2-Normed Spaces
Best Approximation in Real Linear 2-Normed SpacesBest Approximation in Real Linear 2-Normed Spaces
Best Approximation in Real Linear 2-Normed SpacesIOSR Journals
 
Decomposition of continuity and separation axioms via lower and upper approxi...
Decomposition of continuity and separation axioms via lower and upper approxi...Decomposition of continuity and separation axioms via lower and upper approxi...
Decomposition of continuity and separation axioms via lower and upper approxi...Alexander Decker
 
(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological Spaces
(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological Spaces(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological Spaces
(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological SpacesIOSR Journals
 
A Note on a Three Variables Analogue of Bessel Polynomials
A Note on a Three Variables Analogue of Bessel PolynomialsA Note on a Three Variables Analogue of Bessel Polynomials
A Note on a Three Variables Analogue of Bessel PolynomialsIJMER
 

What's hot (13)

On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...
On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...
On Coincidence Points in Pseudocompact Tichonov Spaces and Common Fixed Point...
 
The End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired PolymerThe End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired Polymer
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 
Some Fixed Point Theorems in b G -cone Metric Space
Some Fixed Point Theorems in b G -cone Metric Space Some Fixed Point Theorems in b G -cone Metric Space
Some Fixed Point Theorems in b G -cone Metric Space
 
Seminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasSeminar on Motivic Hall Algebras
Seminar on Motivic Hall Algebras
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
 
Cd32939943
Cd32939943Cd32939943
Cd32939943
 
Best Approximation in Real Linear 2-Normed Spaces
Best Approximation in Real Linear 2-Normed SpacesBest Approximation in Real Linear 2-Normed Spaces
Best Approximation in Real Linear 2-Normed Spaces
 
Structure of unital 3-fields, by S.Duplij, W.Werner
Structure of unital 3-fields, by S.Duplij, W.WernerStructure of unital 3-fields, by S.Duplij, W.Werner
Structure of unital 3-fields, by S.Duplij, W.Werner
 
E42012426
E42012426E42012426
E42012426
 
Decomposition of continuity and separation axioms via lower and upper approxi...
Decomposition of continuity and separation axioms via lower and upper approxi...Decomposition of continuity and separation axioms via lower and upper approxi...
Decomposition of continuity and separation axioms via lower and upper approxi...
 
(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological Spaces
(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological Spaces(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological Spaces
(𝛕𝐢, 𝛕𝐣)− RGB Closed Sets in Bitopological Spaces
 
A Note on a Three Variables Analogue of Bessel Polynomials
A Note on a Three Variables Analogue of Bessel PolynomialsA Note on a Three Variables Analogue of Bessel Polynomials
A Note on a Three Variables Analogue of Bessel Polynomials
 

Similar to Space-efficient detection of unusual words

Cambridge 2014 Complexity, tails and trends
Cambridge 2014  Complexity, tails and trendsCambridge 2014  Complexity, tails and trends
Cambridge 2014 Complexity, tails and trendsNick Watkins
 
ON 2-REPEATED SOLID BURST ERRORS
ON 2-REPEATED SOLID BURST ERRORSON 2-REPEATED SOLID BURST ERRORS
ON 2-REPEATED SOLID BURST ERRORSijfcstjournal
 
Mimo radar detection in compound gaussian clutter using orthogonal discrete f...
Mimo radar detection in compound gaussian clutter using orthogonal discrete f...Mimo radar detection in compound gaussian clutter using orthogonal discrete f...
Mimo radar detection in compound gaussian clutter using orthogonal discrete f...ijma
 
Sub artex spaces of an artex space over a bi monoid
Sub artex spaces of an artex space over a bi monoidSub artex spaces of an artex space over a bi monoid
Sub artex spaces of an artex space over a bi monoidAlexander Decker
 
新たなRNNと自然言語処理
新たなRNNと自然言語処理新たなRNNと自然言語処理
新たなRNNと自然言語処理hytae
 
redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo SomESCOM
 
Smaller fully-functional bidirectional BWT indexes
Smaller fully-functional bidirectional BWT indexesSmaller fully-functional bidirectional BWT indexes
Smaller fully-functional bidirectional BWT indexesFabio Cunial
 
Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...
Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...
Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...Animesh Chaturvedi
 
The Missing Fundamental Element
The Missing Fundamental ElementThe Missing Fundamental Element
The Missing Fundamental ElementSaurav Roy
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstractsJoseph Park
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptManimaran A
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...cseiitgn
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...André Panisson
 
Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...
Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...
Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...Federico Gobbo
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...IJERA Editor
 
Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...
Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...
Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...MattiaMantovani6
 

Similar to Space-efficient detection of unusual words (20)

Cambridge 2014 Complexity, tails and trends
Cambridge 2014  Complexity, tails and trendsCambridge 2014  Complexity, tails and trends
Cambridge 2014 Complexity, tails and trends
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
ON 2-REPEATED SOLID BURST ERRORS
ON 2-REPEATED SOLID BURST ERRORSON 2-REPEATED SOLID BURST ERRORS
ON 2-REPEATED SOLID BURST ERRORS
 
Mimo radar detection in compound gaussian clutter using orthogonal discrete f...
Mimo radar detection in compound gaussian clutter using orthogonal discrete f...Mimo radar detection in compound gaussian clutter using orthogonal discrete f...
Mimo radar detection in compound gaussian clutter using orthogonal discrete f...
 
Sub artex spaces of an artex space over a bi monoid
Sub artex spaces of an artex space over a bi monoidSub artex spaces of an artex space over a bi monoid
Sub artex spaces of an artex space over a bi monoid
 
Presentation
PresentationPresentation
Presentation
 
新たなRNNと自然言語処理
新たなRNNと自然言語処理新たなRNNと自然言語処理
新たなRNNと自然言語処理
 
redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo Som
 
Smaller fully-functional bidirectional BWT indexes
Smaller fully-functional bidirectional BWT indexesSmaller fully-functional bidirectional BWT indexes
Smaller fully-functional bidirectional BWT indexes
 
Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...
Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...
Shortest path, Bellman-Ford's algorithm, Dijkastra's algorithm, their Java co...
 
The Missing Fundamental Element
The Missing Fundamental ElementThe Missing Fundamental Element
The Missing Fundamental Element
 
fored
foredfored
fored
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstracts
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.ppt
 
Unit 6: All
Unit 6: AllUnit 6: All
Unit 6: All
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...
Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...
Adpositional Argumentation: How Logic Originates In Natural Argumentative Dis...
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
 
Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...
Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...
Microwave spectroscopy reveals the quantum geometric tensor of topological Jo...
 

More from Fabio Cunial

Accurate repeat reconstruction from long reads
Accurate repeat reconstruction from long readsAccurate repeat reconstruction from long reads
Accurate repeat reconstruction from long readsFabio Cunial
 
Fast matching statistics in small space
Fast matching statistics in small spaceFast matching statistics in small space
Fast matching statistics in small spaceFabio Cunial
 
Composite repetition-aware data structures
Composite repetition-aware data structuresComposite repetition-aware data structures
Composite repetition-aware data structuresFabio Cunial
 
Suffix links survival kit
Suffix links survival kitSuffix links survival kit
Suffix links survival kitFabio Cunial
 
Indexed matching statistics and shortest unique substrings
Indexed matching statistics and shortest unique substringsIndexed matching statistics and shortest unique substrings
Indexed matching statistics and shortest unique substringsFabio Cunial
 
Fully-functional bidirectional Burrows-Wheeler indexes
Fully-functional bidirectional Burrows-Wheeler indexesFully-functional bidirectional Burrows-Wheeler indexes
Fully-functional bidirectional Burrows-Wheeler indexesFabio Cunial
 

More from Fabio Cunial (6)

Accurate repeat reconstruction from long reads
Accurate repeat reconstruction from long readsAccurate repeat reconstruction from long reads
Accurate repeat reconstruction from long reads
 
Fast matching statistics in small space
Fast matching statistics in small spaceFast matching statistics in small space
Fast matching statistics in small space
 
Composite repetition-aware data structures
Composite repetition-aware data structuresComposite repetition-aware data structures
Composite repetition-aware data structures
 
Suffix links survival kit
Suffix links survival kitSuffix links survival kit
Suffix links survival kit
 
Indexed matching statistics and shortest unique substrings
Indexed matching statistics and shortest unique substringsIndexed matching statistics and shortest unique substrings
Indexed matching statistics and shortest unique substrings
 
Fully-functional bidirectional Burrows-Wheeler indexes
Fully-functional bidirectional Burrows-Wheeler indexesFully-functional bidirectional Burrows-Wheeler indexes
Fully-functional bidirectional Burrows-Wheeler indexes
 

Recently uploaded

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravitySubhadipsau21168
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 

Recently uploaded (20)

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified Gravity
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 

Space-efficient detection of unusual words

  • 1. Space-efficient detection of unusual words Djamal Belazzougui1 , Fabio Cunial2 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.
  • 2. Exact substring discovery n. of exact occurrences model of the sourceMarkov chain of order 3 measure of "surprise" T = = 8 W
  • 3. Exact substring discovery n. of exact occurrences model of the sourceMarkov chain of order 3 measure of "surprise" T = = 8 W Find all substrings W with or
  • 4. Exact substring discovery Find all substrings W with or n. of exact occurrences model of the sourceMarkov chain of order 3 measure of "surprise" T = = 8 W Quadratic output
  • 6. W [1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003. Some surprise scores are monotonic
  • 7. W YX [1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003. Some surprise scores are monotonic
  • 8. W YX [1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003. Some surprise scores are monotonic
  • 9. W YX [1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003. same W XWY Some surprise scores are monotonic
  • 10. Some surprise scores are monotonic W YX [1] Apostolico, Bock, Lonardi. Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology, 2003. same W XWY STT Linear output
  • 11. [2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences 1997, pages 215–229. IEEE, 1998. IID source The variance of one substring
  • 12. [2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences 1997, pages 215–229. IEEE, 1998. IID source can be computed in constant time from: The variance of one substring
  • 13. [2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences 1997, pages 215–229. IEEE, 1998. IID source can be computed in constant time from: By the recursive structure of borders: The variance of one substring
  • 14. The variance of one substring [2] Apostolico, Bock, Xu. Annotated statistical indices for sequence analysis. In Proceedingns of Compression and Complexity of Sequences 1997, pages 215–229. IEEE, 1998. IID source can be computed in constant time from: By the recursive structure of borders: time with the Morris-Pratt algorithm
  • 15. The border of all right-maximal substrings W = Waaaaa d e a b c
  • 16. The border of all right-maximal substrings W = Waaaaa d e a b c
  • 17. 0 7 STT W eW dW a b c = Wee [3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000. The border of all right-maximal substrings
  • 18. 0 7 STT W eW dW a b c = Wee [3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000. V=bord(eW) The border of all right-maximal substrings
  • 19. 0 7 STT W eW f f h g h dW a b c = Wee [3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000. V=bord(eW) The border of all right-maximal substrings
  • 20. 0 7 STT W eW f f h g h dW a b c = Wee [3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000. V=bord(eW) The border of all right-maximal substrings
  • 21. 0 7 STT W eW f f h g h dW a b c = Wefe [3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000. V=bord(eW) The border of all right-maximal substrings
  • 22. 0 7 STT W eW f f h g h dW a b c = Wefe [3] Apostolico, Bock, Lonardi, Xu. Efficient detection of unusual words. Journal of Computational Biology, 2000. V=bord(eW) The border of all minimal elements
  • 23. words time time From the string: Reduce the output Score minimal absent words (Truncated) Suffix tree BWT + rangeDistinct bits In this work O(n log σ) bits Randomized O(n) time
  • 25. Removing the dependency on σ Tv b c d
  • 26. Removing the dependency on σ Tv b "Return arcs" c d a [4] Simon. String matching algorithms and automata. In First South American Workshop on String Processing, 1993.
  • 27. Removing the dependency on σ Tv b "Return arcs" c d a [4] Simon. String matching algorithms and automata. In First South American Workshop on String Processing, 1993.
  • 28. Removing the dependency on σ Tv w w cannot be the root b c d a "Return arcs"
  • 29. Removing the dependency on σ Tv1 v2 wc d
  • 30. Removing the dependency on σ Tv1 v2 wc c d d
  • 31. Removing the dependency on σ Tv1 v2 wc c d d
  • 32. Removing the dependency on σ Tv1 v2 w
  • 33. Removing the dependency on σ Tv1 v2 wa b a
  • 34. 0 7 STT W eW f f h g h dW a b c = Wefe V=bord(eW) Removing the dependency on σ
  • 35. STT W eW f f h g h dW a b c = Wefe V=bord(eW) Removing the dependency on σ 7+1
  • 39. Variance can be computed in constant time from:
  • 40. Variance can be computed in constant time from: So, we store: In every node: , In every element of stack b, for every b: a bord(eW) W eW b c d
  • 42. left-extensions, unordered all right-maximal W of T no order O(nd) time Enumerating nodes in no order [5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014. [6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013. node enumerator rangeDistinct stack n log σ O(n log log σ), O(σ2 log2 n) Burrows-Wheeler tr. bits
  • 44. rangeDistinct queries T [7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text document retrieval. JDA, 18:3–13, January 2013. O(out) time, σ log n bits of working space using [1] Let's say O(out d) time in general unordered
  • 45. T ordered time for statistical reasons rangeDistinct n log σ O(n log log σ) Burrows-Wheeler tr. node enumerator border computations enumerator stack , border+variance stack bits
  • 46. T ordered rangeDistinct string n log σ randomized O(n) time, O(n log σ) bits randomized O(n) time, O(n log σ) bits O(n log log σ) Burrows-Wheeler tr. node enumerator border computations [5] D. Belazzougui. Linear time construction of compressed text indices in compact space. STOC 2014. [6] D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. ESA 2013. [7] D. Belazzougui, G. Navarro, D. Valenzuela. Improved compressed indexes for full-text document retrieval. JDA, 18:3–13, January 2013. enumerator stack border+variance stack bits
  • 48. We can limit the computation to maximal repeats and minimal rare words We can compute the score of all minimal absent words in time
  • 49. Minimal absent words Total number: Time: Space: Total number: Time: Space: Minimal rare words for every proper substring V of W Minimal unique substrings aWb : W is a maximal repeat of T [8] Belazzougui, Cunial. A framework for space-efficient string kernels. CPM 2015. [9] Crochemore, Mignosi, Restivo. Automata and forbidden words. IPL, 1998. [10] Herold, Kurtz, Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 2008. [11] Ileri, Külekci, Xu. A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. TCS, 2015.
  • 50. STT W eW a a a eW W b b b c c d c d e Minimal absent words eW|deW|ceW|beW|a Right-extensions of W We
  • 51. STT W eW heW a a a a eW W heW b b b c c d c c d e Minimal absent words eW|deW|ceW|beW|a Right-extensions of W Right-extensions of eW Weh
  • 52. STT W eW heW a a a a eW W heW bord(heW) B=bord(heW) B=bord(heW) suf(bord(heW)) b b b c c d c c d e Minimal absent words eW|d B|d B|e eW|c B|c eW|b B|b eW|a B|a Right-extensions of W Right-extensions of suf(B) Right-extensions of eW Weh
  • 53. STT W eW heW a a a a eW W heW bord(heW) B=bord(heW) B=bord(heW) suf(bord(heW)) b b b c c d c c d e Minimal absent words eW|d B|d B|e eW|c B|c eW|b B|b eW|a B|a Right-extensions of W Right-extensions of suf(B) Right-extensions of eW Weh words
  • 54. Prototype implementation Genome of length 14.8 million
  • 55. Prototype implementation Genome of length 14.8 million
  • 56. Genome of length 14.8 million 33 seconds Verbumculus length ≤ 12 Verbumculus length ≤ 24 Verbumculus length ≤ 36 57 seconds, 6 GB 2 minutes, 14 GB 4 minutes, 14 GB Our prototype any length [12] Apostolico, Gong, Lonardi. Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology, 2004. Prototype implementation
  • 57. Space-efficient detection of unusual words Djamal Belazzougui1 , Fabio Cunial2 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.