SlideShare a Scribd company logo
1 of 45
Download to read offline
Annotated sux trees
Algorithms
Implementation
LM Monitor
Text Analysis with
Enhanced Annotated Sux Trees
Algorithms and Implementation
Mikhail Dubov
1
National Research University Higher School of Economics
Computer Science faculty, Moscow, Russia
AIST'2015
1/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
2/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Annotated sux trees
Letter-based method for text analysis
Annotated sux trees: full-text index
Basic computation: relevance score of a keyphrase to the
text collection indexed by AST
Range of applications:
Text classication (e.g. spam ltering)
Feature extraction
Keyphrase analysis (stay tuned)
3/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Sux trees
Figure: Sux tree for string XABXAC
Sux tree for a string S (|S| = n) is a rooted directed tree
encoding all the suxes of that string [3]
The concatenation of edge labels on every path from the root node
to one of the leaves makes up one of the suxes of that string, i.e.
S[i . . . n].
It is also required that each internal node has two or more children,
and each edge is labeled with a non-empty substring of S.
4/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Sux trees
Various O(n) construction algorithms exist (Ukkonen, Weiner)
Establishes a linear-time solution for the exact pattern
matching problem
Sux tree is a full-text index
5/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
Annotated sux trees
Figure: Annotated sux tree for string XABXAC
Extension: node labels
Node label f (v) indicates the number of entries of the
substring on the path from root to v in the text collection
6/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score
Figure: Naive AST representation (as a trie) for a collection of 3 strings
gondition—l pro˜—˜ility of — node given its parent:
ˆp(v) =
f (v)
f (parent(v))
7/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score
Relevance score computation for keyword S in text collection T
(described in terms of a trie, not a tree):
For each sux S[i . . . n] of S, try to match it against the sux
tree AST(T), starting at the root.
If, for sux s, we matched exactly k symbols in the tree, then
scoresuff (s) =
k
i=1
ˆp(vi )
k
,
where vi is the i-th node on the matching path starting at the
root (if k = 0, then scoresuff (s) = 0).
The nal score for keyword S is obtained as
SCORE(S) =
|S|
i=1
scoresuff (S[i :])
|S|
.
8/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: Example 1
T = [“XAB“, “XAC“, “CAB“]
SCORE(“ABC“) =
score(“ABC“) + score(“BC“) + score(“C“)
3
=
=
(0.33 + 0.67)/2 + (0.22)/1 + (0.22)/1
3
= 0.31
9/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: Example 2
T = [“XAB“, “XAC“, “CAB“]
SCORE(“XYZ“) =
score(“XYZ“) + score(“YZ“) + score(“Z“)
3
=
=
(0.22)/1 + 0 + 0
3
= 0.07
10/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: Example 3
SCORE(“Alice“) = 0.32
SCORE(“Bob“) = 0.04
@…su—llyD ƒgy‚i b HFP is — strong eviden™e of relev—n™eA
11/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Sux trees
Annotated sux trees
AST relevance score
AST relevance score: alternatives  summary
Alternative solution: count the number of occurrences of a
keyword in the text colleciton
Word-based approach
Requires at least normalization, NLP involved
Can also use the Levenstein distance for more sensitivity
Relevance score denition  interpretation is not obvious
AST Relevance score:
Letter-based, fuzzy approach
Language-independent, no NLP involved
Interpretation: average conditional probability of an
occurrence of a single symbol of the input key phrase in the
text collection
12/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
13/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux trees
In the original papers, AST was represented as a trie =⇒
O(n2
) time  space complexity.
Even when implemented properly with sux trees, the AST
construction time  space usage still has a large hidden
constant behind O(n).
We propose an enhanced implementation that uses sux
arrays.
14/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux tries to sux trees
Ensure the linearity of our data structure:
Sux trie: one node per letter, O(n2
) time  space
Sux tree: compacted edges, no chains, O(n) time  space
Figure: Sux trie
Figure: Sux tree
15/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux tries to sux trees
To construct —nnot—ted su0x trees in O(n), simple preprocessing is
needed:
Algorithm LinearASTConstruction(C)
snputF String collection C = {S1, . . . , Sm}
yutputF Generalized annotated sux tree for C.
1 Construct C = {S1$1, . . . , Sm$m}, where $i are unique
characters that do not appear in S1 . . . Sm.
2 Construct a generalized sux tree T for collection C using a
linear-time algorithm (e.g. the …kkonen —lgorithm).
3 for l in leaves(T)
4 do set f (l) ← 1
5 Run a postx depth-rst tree traversal on the sux tree T.
For each inner node v, set f (v) ← u∈children(v) f (u).
16/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux tries to sux trees
One minor change in the sux relevance score:
Figure: Sux trie
scoresuff (s) =
k
i=1
ˆp(vi )
k
Figure: Sux tree
If l is the number of symbols in
the match, then
scoresuff (s) =
k
i=1
ˆp(vi ) + l − k
l
17/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
From sux trees to sux arrays
Sux array for a string S (|S| = n) is an array of n integer
numbers, enumerating the n suxes of S in lexicographic
order.
Table: Sux array for string XABXAC
(the suxes are not actually stored)
i sux array S[su[i]:]
0 2 ABXAC
1 5 AC
2 3 BXAC
3 6 C
4 1 XABXAC
5 4 XAC
Sux arrays are more space ecient than sux trees.
18/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced sux arrays
Abouelhoda, Kurtz,  Ohlebusch [1] have shown that it is
possible to systematically replace every algorithm that uses
sux trees with another one based on sux arrays.
Need to enhance the sux array with two auxiliary arrays:
lcp-table for bottom-up traversal
child-table for top-down traversal
Can be implemented to take no more than 10 bytes per input
symbol (at least 20 for sux trees)
19/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced sux arrays
Table: Enhanced sux array for string XABXAC
i sux array lcp-table child-table S[su[i]:]1. 2. 3.
0 1 0 1 2 ABXAC
1 4 1 AC
2 2 0 1 3 BXAC
3 5 0 4 C
4 0 0 XABXAC
5 3 2 XAC
20/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
We need to store annotations for sux tree nodes
The number of nodes in a sux tree cannot exceed (2n − 1)
After preprocessing, all the leaves will be annotated with 1, so
there is no need to store these annotations explicitly
We are left with at most (n − 1) numbers to store =⇒ can
introduce one more auxiliary array of length n
(annotation-table)
21/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Table: Enhanced annotated sux array for string XABXAC
i sux array lcp-table child-table annotation S[su[i]:]1. 2. 3.
0 1 0 1 2 6 ABXAC
1 4 1 2 AC
2 2 0 1 3 BXAC
3 5 0 4 C
4 0 0 XABXAC
5 3 2 2 XAC
22/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Figure: Annotated sux tree for string XABXAC
23/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Node-to-array mapping  via virtual lcp-trees:
Can be restored from the lcp-table
Nodes correspond to the inner nodes of the sux tree
Nodes are represented as l, i, j : the l™pEv—lue l and the left
and right boundaries of the l™pEinterv—l (i, j)
For each l™pEinterv—l v = l, i, j there exists a unique index,
index(v) ∈ [0; n − 1], which is equal to the smallest k, such
that k  i and lcp[k] = l. It is this mapping that we use to
store the inner node frequency annotations.
24/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays
Algorithm LinearEASAConstruction(C)
snputF String collection C = {S1, . . . , Sm}
yutputF Enhanced sux array for C with substring frequency
annotations.
1 Construct a string S = S1$1 + · · · + Sm$m, where $i are
unique termination symbols.
2 Construct a sux array A for string S using a linear-time
algorithm (e.g. the u¤—rkk¤—inenEƒ—nders —lgorithm) and two
auxiliary arrays: l™pE—rr—y and ™hildE—rr—y.
3 Simulate a postx depth-rst tree traversal on the sux array
A. At each of the virtu—l inner nodes, corresponding to an
l™pEinterv—l v = l, i, j , where i  j, set
annotation[index(v)] =
u∈children(v) annotation[index(u)] + #( l, i, j : i = j).
25/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
From sux tries to sux trees
From sux trees to sux arrays
Enhanced annotated sux arrays: Experimental results
Figure: Experimental results
Implementation: Python 2.7
10x less memory  due to sux arrays + the xumpy library
26/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
27/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Package EAST
EAST = inh—n™ed ennot—ted ƒu0x „rees
Open-source:
https://github.com/msdubov/AST-text-analysis
Registered in Python Package Index and is easy to install:
$ pip install EAST
28/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Package EAST
Provides command-line user interface:
$ east keyphrases table keyphrases_list.txt
path/to/the/text/collection/
Can be used as a Python library:
 from east.asts.base import AST
 ast = AST.get_ast([''XAB'', ''XAC'', ''CAB''])
 ast.score(''ABC'')
0.3148148148148149
29/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Synonym extraction
EAST implements one laguage-dependent feature: synonym
extraction
Motivation: Relevance scores should be similar, say, for pl—nt
t—xonomy and pl—nt ™l—ssi(™—tion, even if the latter can be
rarely found in the text collection.
Algorithm: distributional synonym extraction algorithm based
on that by Lin [4], which employs the so-called dependency
triples (w1, r, w2) (idea: simil—r texts —ppe—r in simil—r
™ontexts)
Domain-specic synonyms are likely to be found with this
context-based approach
30/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Synonym extraction
Dependency triples extraction is done by Yandex Tomita
parser (based on grammatical templates like —dje™tive C
su˜st—ntive or ver˜ C —rver˜)
Grammar:
S - adj_mod_of interp (Relation.adj_mod_of::...) |
adv_of interp (Relation.adv_of::norm=inf) |
adv interp (Relation.adv::norm=inf) |
...
adj_mod_of - Adjgnc-agr[1] Noungnc-agr[1];
adv_of - Adv Verb;
adv - Verb Adv;
...
31/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Package EAST
Synonym extraction
Synonym extraction
Synonyms extracted from a text collection from the Izvestia
newspaper:
head (ãëàâà) ⇔ CEO (ãåíäèðåêòîð)
high (âûñîêèé) ⇔ low (íèçêèé)
. . .
Low precision is not very critical: among synonimous key
phrases we chose the one that has maxw∈syn(S) SCORE(w)
To extract synonyms before computing relevance scores:
$ east -s keyphrases table keyphrases_list.txt
path/to/the/text/collection/
32/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
Table of Contents
1 Annotated sux trees
Sux trees
Annotated sux trees
AST relevance score
2 Algorithms
From sux tries to sux trees
From sux trees to sux arrays
3 Implementation
Package EAST
Synonym extraction
4 LM Monitor
Concept  architecture
Keyphrase reference graphs
33/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Concept
LM Monitor = v—tent we—ning wonitor [2]
1 Web crawling
RuNeWC: Russian Newspaper Web Corpus
5 sources available now
2 Keyphrase analysis
Keyphrases are provided by the user
Using the AST relevance scores for these keyphrases, a
keyphrase reference graph is built
Text visualization tool
34/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Development
1 Research group Text analysis and visualization methods
2 Head: Boris Mirkin (Sc.D, prof.)
3 Sta: Bachelor/Master/PhD students
35/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Architecture
36/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: System
37/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
Keyphrase reference graphs
Keyphrase reference graphs:
Model directed relations between keyphrases
Nodes are keyphrases
For a keyphrase A,
r ∈ [0; 1] is a relevance threshold : if SCOREAST(T)(A)  r,
then A is considered to be relevant to text T (usually r = 0.2)
F(A) = {T : SCOREAST(T)(A)  r}
c ∈ [0; 1] is a ™on(den™e threshold
For keyphrases A and B, if
|F(B)∩F(A)|
|F(A)| ≥ c, then there is an
edge in the graph from keyphrase A to keyphrase B (usually
c = 0.6)
A → B is like an —sso™i—tive rule
38/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
39/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
40/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
Figure: Keyphrase reference graph built for Oct/Nov 2014
41/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
LM Monitor: Keyphrase reference graphs
Figure: Keyphrase reference graph built USA/France constitutions
42/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
Future work
Automated graph analysis (central nodes visualization etc.)
Temporal graph analysis (how do graphs change over time?)
Better support for synonyms
43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
e˜ouelhod—D wF sF Replacing Sux Trees with Enhanced Sux
Arrays / M. I. Abouelhoda, S. Kurtz, E. Ohlebusch // Journal
of Discrete Algorithms, Amsterdam: Elsevier.  2004.   2. 
pp. 53-86.
hu˜ovD wF Automatic Russian Text Processing System /
M. Dubov, B. Mirkin, A. Shal // Open Systems. DBMS 
2014.  v. 22  10  pp. 15-17
qus(eldD hF Algorithms on Strings, Trees, and Sequences:
Computer Science and Computational Biology / D. Guseld. 
Cambridge University Press, 1997.
vinD hF Automatic Retrieval and Clustering of Similar Words. /
D. Lin // Proceedings of the 17th International Conference on
Computational Linguistics.  1998.  pp. 768-774.
43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
Annotated sux trees
Algorithms
Implementation
LM Monitor
Concept  architecture
Keyphrase reference graphs
wirkinD fF Method of annotated sux tree for scoring the
extent of presence of a string in text / B. Mirkin, E. Chernyak,
O. Chugunova // Business-Informatics  2012.   3(21). 
pp. 31-41.
43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees

More Related Content

What's hot

Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlibPiyush rai
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Implementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLABImplementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLABLeo Asselborn
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Stack data structure in Data Structure using C
Stack data structure in Data Structure using C Stack data structure in Data Structure using C
Stack data structure in Data Structure using C Meghaj Mallick
 
Graphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their RepresentationsGraphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their RepresentationsAmrinder Arora
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
 
La tex basics
La tex basicsLa tex basics
La tex basicsawverret
 
Advanced latex
Advanced latexAdvanced latex
Advanced latexawverret
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms宇 傅
 
Java 8 Concurrency Updates
Java 8 Concurrency UpdatesJava 8 Concurrency Updates
Java 8 Concurrency UpdatesDamian Łukasik
 

What's hot (17)

Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlib
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
Go Course Day2
Go Course Day2Go Course Day2
Go Course Day2
 
Lec18
Lec18Lec18
Lec18
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Implementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLABImplementation of “Parma Polyhedron Library”-functions in MATLAB
Implementation of “Parma Polyhedron Library”-functions in MATLAB
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Lecture12
Lecture12Lecture12
Lecture12
 
Numpy
NumpyNumpy
Numpy
 
Stack data structure in Data Structure using C
Stack data structure in Data Structure using C Stack data structure in Data Structure using C
Stack data structure in Data Structure using C
 
Graphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their RepresentationsGraphs, Trees, Paths and Their Representations
Graphs, Trees, Paths and Their Representations
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
La tex basics
La tex basicsLa tex basics
La tex basics
 
Advanced latex
Advanced latexAdvanced latex
Advanced latex
 
Numpy
NumpyNumpy
Numpy
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
 
Java 8 Concurrency Updates
Java 8 Concurrency UpdatesJava 8 Concurrency Updates
Java 8 Concurrency Updates
 

Similar to presentation

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsunyil96
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefixSanjeev Gupta
 
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingEelco Visser
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticIJERA Editor
 
メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介Takashi Hoshino
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdfNash229987
 
Signals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 BatchSignals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 BatchAmairullah Khan Lodhi
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAmrinder Arora
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOpsPooyan Jamshidi
 
DSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxDSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxHamedNassar5
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingEelco Visser
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals reviewElifTech
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
Review session2
Review session2Review session2
Review session2NEEDY12345
 

Similar to presentation (20)

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefix
 
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
 
MLMM_16_08_2022.pdf
MLMM_16_08_2022.pdfMLMM_16_08_2022.pdf
MLMM_16_08_2022.pdf
 
メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介メモリより大きなデータの Sufix Array 構築方法の紹介
メモリより大きなデータの Sufix Array 構築方法の紹介
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdf
 
Lecture 2 f17
Lecture 2 f17Lecture 2 f17
Lecture 2 f17
 
Signals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 BatchSignals And Systems Lab Manual, R18 Batch
Signals And Systems Lab Manual, R18 Batch
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data Structures
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 
DSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxDSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptx
 
algorithm Unit 5
algorithm Unit 5 algorithm Unit 5
algorithm Unit 5
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
 
Dynamic programming - fundamentals review
Dynamic programming - fundamentals reviewDynamic programming - fundamentals review
Dynamic programming - fundamentals review
 
Algorithms Question bank
Algorithms Question bankAlgorithms Question bank
Algorithms Question bank
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
 
Review session2
Review session2Review session2
Review session2
 

presentation

  • 1. Annotated sux trees Algorithms Implementation LM Monitor Text Analysis with Enhanced Annotated Sux Trees Algorithms and Implementation Mikhail Dubov 1 National Research University Higher School of Economics Computer Science faculty, Moscow, Russia AIST'2015 1/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 2. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 2/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 3. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Annotated sux trees Letter-based method for text analysis Annotated sux trees: full-text index Basic computation: relevance score of a keyphrase to the text collection indexed by AST Range of applications: Text classication (e.g. spam ltering) Feature extraction Keyphrase analysis (stay tuned) 3/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 4. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Sux trees Figure: Sux tree for string XABXAC Sux tree for a string S (|S| = n) is a rooted directed tree encoding all the suxes of that string [3] The concatenation of edge labels on every path from the root node to one of the leaves makes up one of the suxes of that string, i.e. S[i . . . n]. It is also required that each internal node has two or more children, and each edge is labeled with a non-empty substring of S. 4/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 5. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Sux trees Various O(n) construction algorithms exist (Ukkonen, Weiner) Establishes a linear-time solution for the exact pattern matching problem Sux tree is a full-text index 5/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 6. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score Annotated sux trees Figure: Annotated sux tree for string XABXAC Extension: node labels Node label f (v) indicates the number of entries of the substring on the path from root to v in the text collection 6/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 7. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score Figure: Naive AST representation (as a trie) for a collection of 3 strings gondition—l pro˜—˜ility of — node given its parent: ˆp(v) = f (v) f (parent(v)) 7/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 8. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score Relevance score computation for keyword S in text collection T (described in terms of a trie, not a tree): For each sux S[i . . . n] of S, try to match it against the sux tree AST(T), starting at the root. If, for sux s, we matched exactly k symbols in the tree, then scoresuff (s) = k i=1 ˆp(vi ) k , where vi is the i-th node on the matching path starting at the root (if k = 0, then scoresuff (s) = 0). The nal score for keyword S is obtained as SCORE(S) = |S| i=1 scoresuff (S[i :]) |S| . 8/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 9. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: Example 1 T = [“XAB“, “XAC“, “CAB“] SCORE(“ABC“) = score(“ABC“) + score(“BC“) + score(“C“) 3 = = (0.33 + 0.67)/2 + (0.22)/1 + (0.22)/1 3 = 0.31 9/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 10. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: Example 2 T = [“XAB“, “XAC“, “CAB“] SCORE(“XYZ“) = score(“XYZ“) + score(“YZ“) + score(“Z“) 3 = = (0.22)/1 + 0 + 0 3 = 0.07 10/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 11. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: Example 3 SCORE(“Alice“) = 0.32 SCORE(“Bob“) = 0.04 @…su—llyD ƒgy‚i b HFP is — strong eviden™e of relev—n™eA 11/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 12. Annotated sux trees Algorithms Implementation LM Monitor Sux trees Annotated sux trees AST relevance score AST relevance score: alternatives summary Alternative solution: count the number of occurrences of a keyword in the text colleciton Word-based approach Requires at least normalization, NLP involved Can also use the Levenstein distance for more sensitivity Relevance score denition interpretation is not obvious AST Relevance score: Letter-based, fuzzy approach Language-independent, no NLP involved Interpretation: average conditional probability of an occurrence of a single symbol of the input key phrase in the text collection 12/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 13. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 13/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 14. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux trees In the original papers, AST was represented as a trie =⇒ O(n2 ) time space complexity. Even when implemented properly with sux trees, the AST construction time space usage still has a large hidden constant behind O(n). We propose an enhanced implementation that uses sux arrays. 14/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 15. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux tries to sux trees Ensure the linearity of our data structure: Sux trie: one node per letter, O(n2 ) time space Sux tree: compacted edges, no chains, O(n) time space Figure: Sux trie Figure: Sux tree 15/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 16. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux tries to sux trees To construct —nnot—ted su0x trees in O(n), simple preprocessing is needed: Algorithm LinearASTConstruction(C) snputF String collection C = {S1, . . . , Sm} yutputF Generalized annotated sux tree for C. 1 Construct C = {S1$1, . . . , Sm$m}, where $i are unique characters that do not appear in S1 . . . Sm. 2 Construct a generalized sux tree T for collection C using a linear-time algorithm (e.g. the …kkonen —lgorithm). 3 for l in leaves(T) 4 do set f (l) ← 1 5 Run a postx depth-rst tree traversal on the sux tree T. For each inner node v, set f (v) ← u∈children(v) f (u). 16/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 17. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux tries to sux trees One minor change in the sux relevance score: Figure: Sux trie scoresuff (s) = k i=1 ˆp(vi ) k Figure: Sux tree If l is the number of symbols in the match, then scoresuff (s) = k i=1 ˆp(vi ) + l − k l 17/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 18. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays From sux trees to sux arrays Sux array for a string S (|S| = n) is an array of n integer numbers, enumerating the n suxes of S in lexicographic order. Table: Sux array for string XABXAC (the suxes are not actually stored) i sux array S[su[i]:] 0 2 ABXAC 1 5 AC 2 3 BXAC 3 6 C 4 1 XABXAC 5 4 XAC Sux arrays are more space ecient than sux trees. 18/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 19. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced sux arrays Abouelhoda, Kurtz, Ohlebusch [1] have shown that it is possible to systematically replace every algorithm that uses sux trees with another one based on sux arrays. Need to enhance the sux array with two auxiliary arrays: lcp-table for bottom-up traversal child-table for top-down traversal Can be implemented to take no more than 10 bytes per input symbol (at least 20 for sux trees) 19/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 20. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced sux arrays Table: Enhanced sux array for string XABXAC i sux array lcp-table child-table S[su[i]:]1. 2. 3. 0 1 0 1 2 ABXAC 1 4 1 AC 2 2 0 1 3 BXAC 3 5 0 4 C 4 0 0 XABXAC 5 3 2 XAC 20/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 21. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays We need to store annotations for sux tree nodes The number of nodes in a sux tree cannot exceed (2n − 1) After preprocessing, all the leaves will be annotated with 1, so there is no need to store these annotations explicitly We are left with at most (n − 1) numbers to store =⇒ can introduce one more auxiliary array of length n (annotation-table) 21/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 22. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Table: Enhanced annotated sux array for string XABXAC i sux array lcp-table child-table annotation S[su[i]:]1. 2. 3. 0 1 0 1 2 6 ABXAC 1 4 1 2 AC 2 2 0 1 3 BXAC 3 5 0 4 C 4 0 0 XABXAC 5 3 2 2 XAC 22/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 23. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Figure: Annotated sux tree for string XABXAC 23/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 24. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Node-to-array mapping via virtual lcp-trees: Can be restored from the lcp-table Nodes correspond to the inner nodes of the sux tree Nodes are represented as l, i, j : the l™pEv—lue l and the left and right boundaries of the l™pEinterv—l (i, j) For each l™pEinterv—l v = l, i, j there exists a unique index, index(v) ∈ [0; n − 1], which is equal to the smallest k, such that k i and lcp[k] = l. It is this mapping that we use to store the inner node frequency annotations. 24/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 25. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays Algorithm LinearEASAConstruction(C) snputF String collection C = {S1, . . . , Sm} yutputF Enhanced sux array for C with substring frequency annotations. 1 Construct a string S = S1$1 + · · · + Sm$m, where $i are unique termination symbols. 2 Construct a sux array A for string S using a linear-time algorithm (e.g. the u¤—rkk¤—inenEƒ—nders —lgorithm) and two auxiliary arrays: l™pE—rr—y and ™hildE—rr—y. 3 Simulate a postx depth-rst tree traversal on the sux array A. At each of the virtu—l inner nodes, corresponding to an l™pEinterv—l v = l, i, j , where i j, set annotation[index(v)] = u∈children(v) annotation[index(u)] + #( l, i, j : i = j). 25/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 26. Annotated sux trees Algorithms Implementation LM Monitor From sux tries to sux trees From sux trees to sux arrays Enhanced annotated sux arrays: Experimental results Figure: Experimental results Implementation: Python 2.7 10x less memory due to sux arrays + the xumpy library 26/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 27. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 27/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 28. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Package EAST EAST = inh—n™ed ennot—ted ƒu0x „rees Open-source: https://github.com/msdubov/AST-text-analysis Registered in Python Package Index and is easy to install: $ pip install EAST 28/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 29. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Package EAST Provides command-line user interface: $ east keyphrases table keyphrases_list.txt path/to/the/text/collection/ Can be used as a Python library: from east.asts.base import AST ast = AST.get_ast([''XAB'', ''XAC'', ''CAB'']) ast.score(''ABC'') 0.3148148148148149 29/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 30. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Synonym extraction EAST implements one laguage-dependent feature: synonym extraction Motivation: Relevance scores should be similar, say, for pl—nt t—xonomy and pl—nt ™l—ssi(™—tion, even if the latter can be rarely found in the text collection. Algorithm: distributional synonym extraction algorithm based on that by Lin [4], which employs the so-called dependency triples (w1, r, w2) (idea: simil—r texts —ppe—r in simil—r ™ontexts) Domain-specic synonyms are likely to be found with this context-based approach 30/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 31. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Synonym extraction Dependency triples extraction is done by Yandex Tomita parser (based on grammatical templates like —dje™tive C su˜st—ntive or ver˜ C —rver˜) Grammar: S - adj_mod_of interp (Relation.adj_mod_of::...) | adv_of interp (Relation.adv_of::norm=inf) | adv interp (Relation.adv::norm=inf) | ... adj_mod_of - Adjgnc-agr[1] Noungnc-agr[1]; adv_of - Adv Verb; adv - Verb Adv; ... 31/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 32. Annotated sux trees Algorithms Implementation LM Monitor Package EAST Synonym extraction Synonym extraction Synonyms extracted from a text collection from the Izvestia newspaper: head (ãëàâà) ⇔ CEO (ãåíäèðåêòîð) high (âûñîêèé) ⇔ low (íèçêèé) . . . Low precision is not very critical: among synonimous key phrases we chose the one that has maxw∈syn(S) SCORE(w) To extract synonyms before computing relevance scores: $ east -s keyphrases table keyphrases_list.txt path/to/the/text/collection/ 32/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 33. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs Table of Contents 1 Annotated sux trees Sux trees Annotated sux trees AST relevance score 2 Algorithms From sux tries to sux trees From sux trees to sux arrays 3 Implementation Package EAST Synonym extraction 4 LM Monitor Concept architecture Keyphrase reference graphs 33/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 34. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Concept LM Monitor = v—tent we—ning wonitor [2] 1 Web crawling RuNeWC: Russian Newspaper Web Corpus 5 sources available now 2 Keyphrase analysis Keyphrases are provided by the user Using the AST relevance scores for these keyphrases, a keyphrase reference graph is built Text visualization tool 34/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 35. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Development 1 Research group Text analysis and visualization methods 2 Head: Boris Mirkin (Sc.D, prof.) 3 Sta: Bachelor/Master/PhD students 35/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 36. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Architecture 36/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 37. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: System 37/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 38. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs Keyphrase reference graphs Keyphrase reference graphs: Model directed relations between keyphrases Nodes are keyphrases For a keyphrase A, r ∈ [0; 1] is a relevance threshold : if SCOREAST(T)(A) r, then A is considered to be relevant to text T (usually r = 0.2) F(A) = {T : SCOREAST(T)(A) r} c ∈ [0; 1] is a ™on(den™e threshold For keyphrases A and B, if |F(B)∩F(A)| |F(A)| ≥ c, then there is an edge in the graph from keyphrase A to keyphrase B (usually c = 0.6) A → B is like an —sso™i—tive rule 38/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 39. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs 39/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 40. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs 40/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 41. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs Figure: Keyphrase reference graph built for Oct/Nov 2014 41/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 42. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs LM Monitor: Keyphrase reference graphs Figure: Keyphrase reference graph built USA/France constitutions 42/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 43. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs Future work Automated graph analysis (central nodes visualization etc.) Temporal graph analysis (how do graphs change over time?) Better support for synonyms 43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 44. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs e˜ouelhod—D wF sF Replacing Sux Trees with Enhanced Sux Arrays / M. I. Abouelhoda, S. Kurtz, E. Ohlebusch // Journal of Discrete Algorithms, Amsterdam: Elsevier. 2004.  2. pp. 53-86. hu˜ovD wF Automatic Russian Text Processing System / M. Dubov, B. Mirkin, A. Shal // Open Systems. DBMS 2014. v. 22  10 pp. 15-17 qus(eldD hF Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology / D. Guseld. Cambridge University Press, 1997. vinD hF Automatic Retrieval and Clustering of Similar Words. / D. Lin // Proceedings of the 17th International Conference on Computational Linguistics. 1998. pp. 768-774. 43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees
  • 45. Annotated sux trees Algorithms Implementation LM Monitor Concept architecture Keyphrase reference graphs wirkinD fF Method of annotated sux tree for scoring the extent of presence of a string in text / B. Mirkin, E. Chernyak, O. Chugunova // Business-Informatics 2012.  3(21). pp. 31-41. 43/43 Mikhail Dubov Text Analysis with Enhanced Annotated Sux Trees