SlideShare a Scribd company logo
1 of 45
Download to read offline
Annotated sux trees
Algorithms
Implementation
LM Monitor
Text Analysis with
Enhanced Annotated Sux Trees
Algorithms and Implementation
Mikhail Dubov
1
National Research University Higher School of Economics
Computer Science faculty, Moscow, Russia
AIST'2015
1/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
2/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Annotated sux trees
Letter-based method for text analysis
Annotated sux trees: full-text index
Basic computation: relevance score of a keyphrase to the
text collection indexed by AST
Range of applications:
Text classication (e.g. spam ltering)
Feature extraction
Keyphrase analysis (stay tuned)
3/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Sux trees
Figure: Sux tree for string XABXAC
Sux tree for a string S (|S| = n) is a rooted directed tree
encoding all the suxes of that string [3]
The concatenation of edge labels on every path from the root node
to one of the leaves makes up one of the suxes of that string, i.e.
S[i . . . n].
It is also required that each internal node has two or more children,
and each edge is labeled with a non-empty substring of S.
4/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Sux trees
Various O(n) construction algorithms exist (Ukkonen, Weiner)
Establishes a linear-time solution for the exact pattern
matching problem
Sux tree is a full-text index
5/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Annotated sux trees
Figure: Annotated sux tree for string XABXAC
Extension: node labels
Node label f (v) indicates the number of entries of the
substring on the path from root to v in the text collection
6/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score
Figure: Naive AST representation (as a trie) for a collection of 3 strings
gondition—l pro˜—˜ility of — node given its parent:
ˆp(v) =
f (v)
f (parent(v))
7/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score
Relevance score computation for keyword S in text collection T
(described in terms of a trie, not a tree):
For each sux S[i . . . n] of S, try to match it against the sux
tree AST(T), starting at the root.
If, for sux s, we matched exactly k symbols in the tree, then
scoresuff (s) =
k
i=1
ˆp(vi )
k
,
where vi is the i-th node on the matching path starting at the
root (if k = 0, then scoresuff (s) = 0).
The nal score for keyword S is obtained as
SCORE(S) =
|S|
i=1
scoresuff (S[i :])
|S|
.
8/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: Example 1
T = [“XAB“, “XAC“, “CAB“]
SCORE(“ABC“) =
score(“ABC“) + score(“BC“) + score(“C“)
3
=
=
(0.33 + 0.67)/2 + (0.22)/1 + (0.22)/1
3
= 0.31
9/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: Example 2
T = [“XAB“, “XAC“, “CAB“]
SCORE(“XYZ“) =
score(“XYZ“) + score(“YZ“) + score(“Z“)
3
=
=
(0.22)/1 + 0 + 0
3
= 0.07
10/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: Example 3
SCORE(“Alice“) = 0.32
SCORE(“Bob“) = 0.04
@…su—llyD ƒgy‚i b HFP is — strong eviden™e of relev—n™eA
11/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: alternatives  summary
Alternative solution: count the number of occurrences of a
keyword in the text colleciton
Word-based approach
Requires at least normalization, NLP involved
Can also use the Levenstein distance for more sensitivity
Relevance score denition  interpretation is not obvious
AST Relevance score:
Letter-based, fuzzy approach
Language-independent, no NLP involved
Interpretation: average conditional probability of an
occurrence of a single symbol of the input key phrase in the
text collection
12/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
13/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux trees
In the original papers, AST was represented as a trie =⇒
O(n2
) time  space complexity.
Even when implemented properly with sux trees, the AST
construction time  space usage still has a large hidden
constant behind O(n).
We propose an enhanced implementation that uses sux
arrays.
14/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux tries to sux trees
Ensure the linearity of our data structure:
Sux trie: one node per letter, O(n2
) time  space
Sux tree: compacted edges, no chains, O(n) time  space
Figure: Sux trie
Figure: Sux tree
15/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux tries to sux trees
To construct —nnot—ted su0x trees in O(n), simple preprocessing is
needed:
Algorithm LinearASTConstruction(C)
snputF String collection C = {S1, . . . , Sm}
yutputF Generalized annotated sux tree for C.
1 Construct C = {S1$1, . . . , Sm$m}, where $i are unique
characters that do not appear in S1 . . . Sm.
2 Construct a generalized sux tree T for collection C using a
linear-time algorithm (e.g. the …kkonen —lgorithm).
3 for l in leaves(T)
4 do set f (l) ← 1
5 Run a postx depth-rst tree traversal on the sux tree T.
For each inner node v, set f (v) ← u∈children(v) f (u).
16/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux tries to sux trees
One minor change in the sux relevance score:
Figure: Sux trie
scoresuff (s) =
k
i=1
ˆp(vi )
k
Figure: Sux tree
If l is the number of symbols in
the match, then
scoresuff (s) =
k
i=1
ˆp(vi ) + l − k
l
17/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux trees to sux arrays
Sux array for a string S (|S| = n) is an array of n integer
numbers, enumerating the n suxes of S in lexicographic
order.
Table: Sux array for string XABXAC
(the suxes are not actually stored)
i sux array S[su[i]:]
0 2 ABXAC
1 5 AC
2 3 BXAC
3 6 C
4 1 XABXAC
5 4 XAC
Sux arrays are more space ecient than sux trees.
18/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced sux arrays
Abouelhoda, Kurtz,  Ohlebusch [1] have shown that it is
possible to systematically replace every algorithm that uses
sux trees with another one based on sux arrays.
Need to enhance the sux array with two auxiliary arrays:
lcp-table for bottom-up traversal
child-table for top-down traversal
Can be implemented to take no more than 10 bytes per input
symbol (at least 20 for sux trees)
19/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced sux arrays
Table: Enhanced sux array for string XABXAC
i sux array lcp-table child-table S[su[i]:]1. 2. 3.
0 1 0 1 2 ABXAC
1 4 1 AC
2 2 0 1 3 BXAC
3 5 0 4 C
4 0 0 XABXAC
5 3 2 XAC
20/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
We need to store annotations for sux tree nodes
The number of nodes in a sux tree cannot exceed (2n − 1)
After preprocessing, all the leaves will be annotated with 1, so
there is no need to store these annotations explicitly
We are left with at most (n − 1) numbers to store =⇒ can
introduce one more auxiliary array of length n
(annotation-table)
21/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Table: Enhanced annotated sux array for string XABXAC
i sux array lcp-table child-table annotation S[su[i]:]1. 2. 3.
0 1 0 1 2 6 ABXAC
1 4 1 2 AC
2 2 0 1 3 BXAC
3 5 0 4 C
4 0 0 XABXAC
5 3 2 2 XAC
22/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Figure: Annotated sux tree for string XABXAC
23/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Node-to-array mapping  via virtual lcp-trees:
Can be restored from the lcp-table
Nodes correspond to the inner nodes of the sux tree
Nodes are represented as l, i, j : the l™pEv—lue l and the left
and right boundaries of the l™pEinterv—l (i, j)
For each l™pEinterv—l v = l, i, j there exists a unique index,
index(v) ∈ [0; n − 1], which is equal to the smallest k, such
that k  i and lcp[k] = l. It is this mapping that we use to
store the inner node frequency annotations.
24/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Algorithm LinearEASAConstruction(C)
snputF String collection C = {S1, . . . , Sm}
yutputF Enhanced sux array for C with substring frequency
annotations.
1 Construct a string S = S1$1 + · · · + Sm$m, where $i are
unique termination symbols.
2 Construct a sux array A for string S using a linear-time
algorithm (e.g. the u¤—rkk¤—inenEƒ—nders —lgorithm) and two
auxiliary arrays: l™pE—rr—y and ™hildE—rr—y.
3 Simulate a postx depth-rst tree traversal on the sux array
A. At each of the virtu—l inner nodes, corresponding to an
l™pEinterv—l v = l, i, j , where i  j, set
annotation[index(v)] =
u∈children(v) annotation[index(u)] + #( l, i, j : i = j).
25/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays: Experimental results
Figure: Experimental results
Implementation: Python 2.7
10x less memory  due to sux arrays + the xumpy library
26/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
27/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Package EAST
EAST = inh—n™ed ennot—ted ƒu0x „rees
Open-source:
https://github.com/msdubov/AST-text-analysis
Registered in Python Package Index and is easy to install:
$ pip install EAST
28/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Package EAST
Provides command-line user interface:
$ east keyphrases table keyphrases_list.txt
path/to/the/text/collection/
Can be used as a Python library:
 from east.asts.base import AST
 ast = AST.get_ast([''XAB'', ''XAC'', ''CAB''])
 ast.score(''ABC'')
0.3148148148148149
29/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Synonym extraction
EAST implements one laguage-dependent feature: synonym
extraction
Motivation: Relevance scores should be similar, say, for pl—nt
t—xonomy and pl—nt ™l—ssi(™—tion, even if the latter can be
rarely found in the text collection.
Algorithm: distributional synonym extraction algorithm based
on that by Lin [4], which employs the so-called dependency
triples (w1, r, w2) (idea: simil—r texts —ppe—r in simil—r
™ontexts)
Domain-specic synonyms are likely to be found with this
context-based approach
30/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Synonym extraction
Dependency triples extraction is done by Yandex Tomita
parser (based on grammatical templates like —dje™tive C
su˜st—ntive or ver˜ C —rver˜)
Grammar:
S - adj_mod_of interp (Relation.adj_mod_of::...) |
adv_of interp (Relation.adv_of::norm=inf) |
adv interp (Relation.adv::norm=inf) |
...
adj_mod_of - Adjgnc-agr[1] Noungnc-agr[1];
adv_of - Adv Verb;
adv - Verb Adv;
...
31/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Synonym extraction
Synonyms extracted from a text collection from the Izvestia
newspaper:
head (ãëàâà) ⇔ CEO (ãåíäèðåêòîð)
high (âûñîêèé) ⇔ low (íèçêèé)
. . .
Low precision is not very critical: among synonimous key
phrases we chose the one that has maxw∈syn(S) SCORE(w)
To extract synonyms before computing relevance scores:
$ east -s keyphrases table keyphrases_list.txt
path/to/the/text/collection/
32/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
33/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Concept
LM Monitor = v—tent we—ning wonitor [2]
1 Web crawling
RuNeWC: Russian Newspaper Web Corpus
5 sources available now
2 Keyphrase analysis
Keyphrases are provided by the user
Using the AST relevance scores for these keyphrases, a
keyphrase reference graph is built
Text visualization tool
34/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Development
1 Research group Text analysis and visualization methods
2 Head: Boris Mirkin (Sc.D, prof.)
3 Sta: Bachelor/Master/PhD students
35/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Architecture
36/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: System
37/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
Keyphrase reference graphs
Keyphrase reference graphs:
Model directed relations between keyphrases
Nodes are keyphrases
For a keyphrase A,
r ∈ [0; 1] is a relevance threshold : if SCOREAST(T)(A)  r,
then A is considered to be relevant to text T (usually r = 0.2)
F(A) = {T : SCOREAST(T)(A)  r}
c ∈ [0; 1] is a ™on(den™e threshold
For keyphrases A and B, if
|F(B)∩F(A)|
|F(A)| ≥ c, then there is an
edge in the graph from keyphrase A to keyphrase B (usually
c = 0.6)
A → B is like an —sso™i—tive rule
38/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
39/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
40/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
Figure: Keyphrase reference graph built for Oct/Nov 2014
41/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
Figure: Keyphrase reference graph built USA/France constitutions
42/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
Future work
Automated graph analysis (central nodes visualization etc.)
Temporal graph analysis (how do graphs change over time?)
Better support for synonyms
43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
e˜ouelhod—D wF sF Replacing Sux Trees with Enhanced Sux
Arrays / M. I. Abouelhoda, S. Kurtz, E. Ohlebusch // Journal
of Discrete Algorithms, Amsterdam: Elsevier.  2004.   2. 
pp. 53-86.
hu˜ovD wF Automatic Russian Text Processing System /
M. Dubov, B. Mirkin, A. Shal // Open Systems. DBMS 
2014.  v. 22  10  pp. 15-17
qus(eldD hF Algorithms on Strings, Trees, and Sequences:
Computer Science and Computational Biology / D. Guseld. 
Cambridge University Press, 1997.
vinD hF Automatic Retrieval and Clustering of Similar Words. /
D. Lin // Proceedings of the 17th International Conference on
Computational Linguistics.  1998.  pp. 768-774.
43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
wirkinD fF Method of annotated sux tree for scoring the
extent of presence of a string in text / B. Mirkin, E. Chernyak,
O. Chugunova // Business-Informatics  2012.   3(21). 
pp. 31-41.
43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees

More Related Content

What's hot

Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlibPiyush rai
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Implementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLABImplementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLABLeo Asselborn
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Stack data structure in Data Structure using C
Stack data structure in Data Structure using C Stack data structure in Data Structure using C
Stack data structure in Data Structure using C Meghaj Mallick
 
Graphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their RepresentationsGraphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their RepresentationsAmrinder Arora
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
 
La tex basics
La tex basicsLa tex basics
La tex basicsawverret
 
Advanced latex
Advanced latexAdvanced latex
Advanced latexawverret
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms宇 傅
 
Java 8 Concurrency Updates
Java 8 Concurrency UpdatesJava 8 Concurrency Updates
Java 8 Concurrency UpdatesDamian Łukasik
 

What's hot (17)

Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlib
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
Go Course Day2
Go Course Day2Go Course Day2
Go Course Day2
 
Lec18
Lec18Lec18
Lec18
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Implementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLABImplementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLAB
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Lecture12
Lecture12Lecture12
Lecture12
 
Numpy
NumpyNumpy
Numpy
 
Stack data structure in Data Structure using C
Stack data structure in Data Structure using C Stack data structure in Data Structure using C
Stack data structure in Data Structure using C
 
Graphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their RepresentationsGraphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their Representations
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
La tex basics
La tex basicsLa tex basics
La tex basics
 
Advanced latex
Advanced latexAdvanced latex
Advanced latex
 
Numpy
NumpyNumpy
Numpy
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
 
Java 8 Concurrency Updates
Java 8 Concurrency UpdatesJava 8 Concurrency Updates
Java 8 Concurrency Updates
 

Similar to Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsunyil96
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefixSanjeev Gupta
 
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingEelco Visser
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticIJERA Editor
 
メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介Takashi Hoshino
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdfNash229987
 
Signals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 BatchSignals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 BatchAmairullah Khan Lodhi
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAmrinder Arora
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOpsPooyan Jamshidi
 
DSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxDSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxHamedNassar5
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingEelco Visser
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals reviewElifTech
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
Review session2
Review session2Review session2
Review session2NEEDY12345
 

Similar to Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation (20)

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefix
 
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
 
MLMM_16_08_2022.pdf
MLMM_16_08_2022.pdfMLMM_16_08_2022.pdf
MLMM_16_08_2022.pdf
 
メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdf
 
Lecture 2 f17
Lecture 2 f17Lecture 2 f17
Lecture 2 f17
 
Signals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 BatchSignals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 Batch
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data Structures
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 
DSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxDSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptx
 
algorithm Unit 5
algorithm Unit 5 algorithm Unit 5
algorithm Unit 5
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals review
 
Algorithms Question bank
Algorithms Question bankAlgorithms Question bank
Algorithms Question bank
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
 
Review session2
Review session2Review session2
Review session2
 

More from AIST

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray ImagesAIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныAIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискAIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAAIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeAIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesAIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsAIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceAIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumAIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingAIST
 

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 

Recently uploaded

CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Chameera Dedduwage
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMoumonDas2
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardsticksaastr
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 

Recently uploaded (20)

CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 

Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

  • 1. Annotated sux trees Algorithms Implementation LM Monitor Text Analysis with Enhanced Annotated Sux Trees Algorithms and Implementation Mikhail Dubov 1 National Research University Higher School of Economics Computer Science faculty, Moscow, Russia AIST'2015 1/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 2. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 2/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 3. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Annotated sux trees Letter-based method for text analysis Annotated sux trees: full-text index Basic computation: relevance score of a keyphrase to the text collection indexed by AST Range of applications: Text classication (e.g. spam ltering) Feature extraction Keyphrase analysis (stay tuned) 3/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 4. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Sux trees Figure: Sux tree for string XABXAC Sux tree for a string S (|S| = n) is a rooted directed tree encoding all the suxes of that string [3] The concatenation of edge labels on every path from the root node to one of the leaves makes up one of the suxes of that string, i.e. S[i . . . n]. It is also required that each internal node has two or more children, and each edge is labeled with a non-empty substring of S. 4/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 5. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Sux trees Various O(n) construction algorithms exist (Ukkonen, Weiner) Establishes a linear-time solution for the exact pattern matching problem Sux tree is a full-text index 5/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 6. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Annotated sux trees Figure: Annotated sux tree for string XABXAC Extension: node labels Node label f (v) indicates the number of entries of the substring on the path from root to v in the text collection 6/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 7. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score Figure: Naive AST representation (as a trie) for a collection of 3 strings gondition—l pro˜—˜ility of — node given its parent: ˆp(v) = f (v) f (parent(v)) 7/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 8. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score Relevance score computation for keyword S in text collection T (described in terms of a trie, not a tree): For each sux S[i . . . n] of S, try to match it against the sux tree AST(T), starting at the root. If, for sux s, we matched exactly k symbols in the tree, then scoresuff (s) = k i=1 ˆp(vi ) k , where vi is the i-th node on the matching path starting at the root (if k = 0, then scoresuff (s) = 0). The nal score for keyword S is obtained as SCORE(S) = |S| i=1 scoresuff (S[i :]) |S| . 8/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 9. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: Example 1 T = [“XAB“, “XAC“, “CAB“] SCORE(“ABC“) = score(“ABC“) + score(“BC“) + score(“C“) 3 = = (0.33 + 0.67)/2 + (0.22)/1 + (0.22)/1 3 = 0.31 9/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 10. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: Example 2 T = [“XAB“, “XAC“, “CAB“] SCORE(“XYZ“) = score(“XYZ“) + score(“YZ“) + score(“Z“) 3 = = (0.22)/1 + 0 + 0 3 = 0.07 10/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 11. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: Example 3 SCORE(“Alice“) = 0.32 SCORE(“Bob“) = 0.04 @…su—llyD ƒgy‚i b HFP is — strong eviden™e of relev—n™eA 11/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 12. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: alternatives summary Alternative solution: count the number of occurrences of a keyword in the text colleciton Word-based approach Requires at least normalization, NLP involved Can also use the Levenstein distance for more sensitivity Relevance score denition interpretation is not obvious AST Relevance score: Letter-based, fuzzy approach Language-independent, no NLP involved Interpretation: average conditional probability of an occurrence of a single symbol of the input key phrase in the text collection 12/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 13. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 13/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 14. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux trees In the original papers, AST was represented as a trie =⇒ O(n2 ) time space complexity. Even when implemented properly with sux trees, the AST construction time space usage still has a large hidden constant behind O(n). We propose an enhanced implementation that uses sux arrays. 14/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 15. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux tries to sux trees Ensure the linearity of our data structure: Sux trie: one node per letter, O(n2 ) time space Sux tree: compacted edges, no chains, O(n) time space Figure: Sux trie Figure: Sux tree 15/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 16. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux tries to sux trees To construct —nnot—ted su0x trees in O(n), simple preprocessing is needed: Algorithm LinearASTConstruction(C) snputF String collection C = {S1, . . . , Sm} yutputF Generalized annotated sux tree for C. 1 Construct C = {S1$1, . . . , Sm$m}, where $i are unique characters that do not appear in S1 . . . Sm. 2 Construct a generalized sux tree T for collection C using a linear-time algorithm (e.g. the …kkonen —lgorithm). 3 for l in leaves(T) 4 do set f (l) ← 1 5 Run a postx depth-rst tree traversal on the sux tree T. For each inner node v, set f (v) ← u∈children(v) f (u). 16/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 17. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux tries to sux trees One minor change in the sux relevance score: Figure: Sux trie scoresuff (s) = k i=1 ˆp(vi ) k Figure: Sux tree If l is the number of symbols in the match, then scoresuff (s) = k i=1 ˆp(vi ) + l − k l 17/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 18. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux trees to sux arrays Sux array for a string S (|S| = n) is an array of n integer numbers, enumerating the n suxes of S in lexicographic order. Table: Sux array for string XABXAC (the suxes are not actually stored) i sux array S[su[i]:] 0 2 ABXAC 1 5 AC 2 3 BXAC 3 6 C 4 1 XABXAC 5 4 XAC Sux arrays are more space ecient than sux trees. 18/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 19. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced sux arrays Abouelhoda, Kurtz, Ohlebusch [1] have shown that it is possible to systematically replace every algorithm that uses sux trees with another one based on sux arrays. Need to enhance the sux array with two auxiliary arrays: lcp-table for bottom-up traversal child-table for top-down traversal Can be implemented to take no more than 10 bytes per input symbol (at least 20 for sux trees) 19/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 20. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced sux arrays Table: Enhanced sux array for string XABXAC i sux array lcp-table child-table S[su[i]:]1. 2. 3. 0 1 0 1 2 ABXAC 1 4 1 AC 2 2 0 1 3 BXAC 3 5 0 4 C 4 0 0 XABXAC 5 3 2 XAC 20/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 21. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays We need to store annotations for sux tree nodes The number of nodes in a sux tree cannot exceed (2n − 1) After preprocessing, all the leaves will be annotated with 1, so there is no need to store these annotations explicitly We are left with at most (n − 1) numbers to store =⇒ can introduce one more auxiliary array of length n (annotation-table) 21/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 22. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Table: Enhanced annotated sux array for string XABXAC i sux array lcp-table child-table annotation S[su[i]:]1. 2. 3. 0 1 0 1 2 6 ABXAC 1 4 1 2 AC 2 2 0 1 3 BXAC 3 5 0 4 C 4 0 0 XABXAC 5 3 2 2 XAC 22/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 23. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Figure: Annotated sux tree for string XABXAC 23/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 24. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Node-to-array mapping via virtual lcp-trees: Can be restored from the lcp-table Nodes correspond to the inner nodes of the sux tree Nodes are represented as l, i, j : the l™pEv—lue l and the left and right boundaries of the l™pEinterv—l (i, j) For each l™pEinterv—l v = l, i, j there exists a unique index, index(v) ∈ [0; n − 1], which is equal to the smallest k, such that k i and lcp[k] = l. It is this mapping that we use to store the inner node frequency annotations. 24/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 25. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Algorithm LinearEASAConstruction(C) snputF String collection C = {S1, . . . , Sm} yutputF Enhanced sux array for C with substring frequency annotations. 1 Construct a string S = S1$1 + · · · + Sm$m, where $i are unique termination symbols. 2 Construct a sux array A for string S using a linear-time algorithm (e.g. the u¤—rkk¤—inenEƒ—nders —lgorithm) and two auxiliary arrays: l™pE—rr—y and ™hildE—rr—y. 3 Simulate a postx depth-rst tree traversal on the sux array A. At each of the virtu—l inner nodes, corresponding to an l™pEinterv—l v = l, i, j , where i j, set annotation[index(v)] = u∈children(v) annotation[index(u)] + #( l, i, j : i = j). 25/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 26. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays: Experimental results Figure: Experimental results Implementation: Python 2.7 10x less memory due to sux arrays + the xumpy library 26/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 27. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 27/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 28. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Package EAST EAST = inh—n™ed ennot—ted ƒu0x „rees Open-source: https://github.com/msdubov/AST-text-analysis Registered in Python Package Index and is easy to install: $ pip install EAST 28/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 29. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Package EAST Provides command-line user interface: $ east keyphrases table keyphrases_list.txt path/to/the/text/collection/ Can be used as a Python library: from east.asts.base import AST ast = AST.get_ast([''XAB'', ''XAC'', ''CAB'']) ast.score(''ABC'') 0.3148148148148149 29/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 30. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Synonym extraction EAST implements one laguage-dependent feature: synonym extraction Motivation: Relevance scores should be similar, say, for pl—nt t—xonomy and pl—nt ™l—ssi(™—tion, even if the latter can be rarely found in the text collection. Algorithm: distributional synonym extraction algorithm based on that by Lin [4], which employs the so-called dependency triples (w1, r, w2) (idea: simil—r texts —ppe—r in simil—r ™ontexts) Domain-specic synonyms are likely to be found with this context-based approach 30/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 31. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Synonym extraction Dependency triples extraction is done by Yandex Tomita parser (based on grammatical templates like —dje™tive C su˜st—ntive or ver˜ C —rver˜) Grammar: S - adj_mod_of interp (Relation.adj_mod_of::...) | adv_of interp (Relation.adv_of::norm=inf) | adv interp (Relation.adv::norm=inf) | ... adj_mod_of - Adjgnc-agr[1] Noungnc-agr[1]; adv_of - Adv Verb; adv - Verb Adv; ... 31/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 32. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Synonym extraction Synonyms extracted from a text collection from the Izvestia newspaper: head (ãëàâà) ⇔ CEO (ãåíäèðåêòîð) high (âûñîêèé) ⇔ low (íèçêèé) . . . Low precision is not very critical: among synonimous key phrases we chose the one that has maxw∈syn(S) SCORE(w) To extract synonyms before computing relevance scores: $ east -s keyphrases table keyphrases_list.txt path/to/the/text/collection/ 32/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 33. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 33/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 34. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Concept LM Monitor = v—tent we—ning wonitor [2] 1 Web crawling RuNeWC: Russian Newspaper Web Corpus 5 sources available now 2 Keyphrase analysis Keyphrases are provided by the user Using the AST relevance scores for these keyphrases, a keyphrase reference graph is built Text visualization tool 34/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 35. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Development 1 Research group Text analysis and visualization methods 2 Head: Boris Mirkin (Sc.D, prof.) 3 Sta: Bachelor/Master/PhD students 35/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 36. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Architecture 36/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 37. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: System 37/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 38. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs Keyphrase reference graphs Keyphrase reference graphs: Model directed relations between keyphrases Nodes are keyphrases For a keyphrase A, r ∈ [0; 1] is a relevance threshold : if SCOREAST(T)(A) r, then A is considered to be relevant to text T (usually r = 0.2) F(A) = {T : SCOREAST(T)(A) r} c ∈ [0; 1] is a ™on(den™e threshold For keyphrases A and B, if |F(B)∩F(A)| |F(A)| ≥ c, then there is an edge in the graph from keyphrase A to keyphrase B (usually c = 0.6) A → B is like an —sso™i—tive rule 38/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 39. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs 39/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 40. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs 40/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 41. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs Figure: Keyphrase reference graph built for Oct/Nov 2014 41/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 42. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs Figure: Keyphrase reference graph built USA/France constitutions 42/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 43. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs Future work Automated graph analysis (central nodes visualization etc.) Temporal graph analysis (how do graphs change over time?) Better support for synonyms 43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 44. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs e˜ouelhod—D wF sF Replacing Sux Trees with Enhanced Sux Arrays / M. I. Abouelhoda, S. Kurtz, E. Ohlebusch // Journal of Discrete Algorithms, Amsterdam: Elsevier. 2004.  2. pp. 53-86. hu˜ovD wF Automatic Russian Text Processing System / M. Dubov, B. Mirkin, A. Shal // Open Systems. DBMS 2014. v. 22  10 pp. 15-17 qus(eldD hF Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology / D. Guseld. Cambridge University Press, 1997. vinD hF Automatic Retrieval and Clustering of Similar Words. / D. Lin // Proceedings of the 17th International Conference on Computational Linguistics. 1998. pp. 768-774. 43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 45. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs wirkinD fF Method of annotated sux tree for scoring the extent of presence of a string in text / B. Mirkin, E. Chernyak, O. Chugunova // Business-Informatics 2012.  3(21). pp. 31-41. 43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees